Paper 5

Pairwise Similarity for Cluster Ensemble Problem: Link-Based and Approximate Approaches

Authors: Natthakan Iam-On and Tossapon Boongoen

Volume 9 (2013)

Abstract

Cluster ensemble methods have emerged as powerful techniques, aggregating several input data clusterings to generate a single output clustering, with improved robustness and stability. In particular, link-based similarity techniques have recently been introduced with superior performance to the conventional co-association method. Their potential and applicability are, however limited due to the underlying time complexity. In light of such shortcoming, this paper presents two approximate approaches that mitigate the problem of time complexity: the approximate algorithm approach (Approximate SimRank Based Similarity matrix) and the approximate data approach (Prototype-based cluster ensemble model). The first approach involves decreasing the computational requirement of the existing link-based technique; the second reduces the size of the problem by finding a smaller, representative, approximate dataset, derived by a density-biased sampling technique. The advantages of both approximate approaches are empirically demonstrated over 22 datasets (both artificial and real data) and statistical comparisons of performance (with 95% confidence level) with three well-known validity criteria. Results obtained from these experiments suggest that approximate techniques can efficiently help scaling up the application of link-based similarity methods to wider range of data sizes.