Spectral clustering is a technique for finding group structure in data. It makes use of the spectrum of the data similarity matrix to perform dimensionality reduction for clustering in fewer dimensions. Spectral clustering algorithms have been shown to be more effective in finding clusters than traditional algorithms such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computation time when the size of a dataset is large. To perform clustering on large datasets, in this work, we parallelize both memory use and computation using MapReduce and MPI. Through an empirical study on a document set of 534,135 instances and a photo set of 2,121,863 images, we show that our parallel algorithm can effectively handle large problems.
Clustering is one of the most important subfields of machine learning and data mining tasks. In the last decade, spectral clustering (e.g., Shi and Malik, 2000; Meila and Shi, 2000; Fowlkes et al., 2004), motivated by normalized graph cut, has attracted much attention. Unlike traditional partition-based clustering, spectral clustering exploits a pairwise data similarity matrix. It has been shown to be more effective than traditional methods such as k-means, which considers only the similarity between instances and k centroids (Ng, Jordan, and Weiss, 2001).Because of its effectiveness, spectral clustering has been widely used in several areas such as information retrieval and computer vision (e.g., Dhillon, 2001; Xu, Liu, and Gong, 2003; Shi and Malik, 2000; Yu and Shi, 2003).