Skip to main content Accessibility help
×
Hostname: page-component-8448b6f56d-tj2md Total loading time: 0 Render date: 2024-04-23T11:21:16.881Z Has data issue: false hasContentIssue false

12 - Large-Scale Spectral Clustering with Map Reduce and MPI

from Part Two - Supervised and Unsupervised Learning Algorithms

Published online by Cambridge University Press:  05 February 2012

Wen-Yen Chen
Affiliation:
University of California
Yangqiu Song
Affiliation:
Tsinghua University
Hongjie Bai
Affiliation:
Google Research, Beijing, China
Chih-Jen Lin
Affiliation:
National Taiwan University
Edward Y. Chang
Affiliation:
Google Research, Beijing, China
Ron Bekkerman
Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko
Affiliation:
Microsoft Research, Redmond, Washington
John Langford
Affiliation:
Yahoo! Research, New York
Get access

Summary

Spectral clustering is a technique for finding group structure in data. It makes use of the spectrum of the data similarity matrix to perform dimensionality reduction for clustering in fewer dimensions. Spectral clustering algorithms have been shown to be more effective in finding clusters than traditional algorithms such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computation time when the size of a dataset is large. To perform clustering on large datasets, in this work, we parallelize both memory use and computation using MapReduce and MPI. Through an empirical study on a document set of 534,135 instances and a photo set of 2,121,863 images, we show that our parallel algorithm can effectively handle large problems.

Clustering is one of the most important subfields of machine learning and data mining tasks. In the last decade, spectral clustering (e.g., Shi and Malik, 2000; Meila and Shi, 2000; Fowlkes et al., 2004), motivated by normalized graph cut, has attracted much attention. Unlike traditional partition-based clustering, spectral clustering exploits a pairwise data similarity matrix. It has been shown to be more effective than traditional methods such as k-means, which considers only the similarity between instances and k centroids (Ng, Jordan, and Weiss, 2001).Because of its effectiveness, spectral clustering has been widely used in several areas such as information retrieval and computer vision (e.g., Dhillon, 2001; Xu, Liu, and Gong, 2003; Shi and Malik, 2000; Yu and Shi, 2003).

Type
Chapter
Information
Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 240 - 261
Publisher: Cambridge University Press
Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Achlioptas, D., McSherry, F., and Schölkopf, B. 2002. Sampling Techniques for Kernel Methods. Pages 335–342 of: Proceedings of NIPS.Google Scholar
Barnett, M., Gupta, S., Payne, D. G., Shuler, L., Geijn, R., and Watts, J. 1994. Interprocessor Collective Communication Library (InterCom). Pages 357–364 of: Proceedings of the Scalable High Performance Computing Conference.CrossRefGoogle Scholar
Bekkerman, R., and Scholz, M. 2008. Data Weaving: Scaling Up the State-of-the-Art in Data Clustering. Pages 1083–1092 of: Proceedings of CIKM.Google Scholar
Bentley, J. L. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Communications of the ACM, 18(9), 509–517.CrossRefGoogle Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2006. Bigtable: A Distributed Storage System for Structured Data. Pages 205–218 of: Proceedings of OSDI.Google Scholar
Chen, W.-Y., Song, Y., Bai, H., Lin, C.-J., and Chang, E. Y. 2011. Parallel Spectral Clustering in Distributed Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 568–586.CrossRefGoogle ScholarPubMed
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., and Olukotun, K. 2007. Map-Reduce for Machine Learning on Multicore. Pages 281–288 of: Proceedings of NIPS.Google Scholar
Chung, F. 1997. Spectral Graph Theory. Number 92 in CBMS Regional Conference Series in Mathematics. American Mathematical Society.Google Scholar
Dean, J., and Ghemawat, S. 2008. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113.CrossRefGoogle Scholar
Dhillon, I. S. 2001. Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning. Pages 269–274 of: Proceedings of SIGKDD.Google Scholar
Dhillon, I. S., and Modha, D. S. 1999. A Data-Clustering Algorithm on Distributed Memory Multiprocessors. Pages 245–260 of: Large-Scale Parallel Data Mining.Google Scholar
Fowlkes, C., Belongie, S., Chung, F., and Malik, J. 2004. Spectral Grouping Using the Nyström Method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 214–225.CrossRefGoogle ScholarPubMed
Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google File System. Pages 29–43 of: Proceedings of SOSP. New York: ACM.Google Scholar
Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity Search in High Dimensions via Hashing. Pages 518–529 of: Proceedings of VLDB.Google Scholar
Grama, A., Karypis, G., Kumar, V., and Gupta, A. 2003. Introduction to Parallel Computing, 2nd ed. Reading, MA: Addison-Wesley.Google Scholar
Gropp, W., Lusk, E., and Skjellum, A. 1999. Using MPI-2: Advanced Features of the Message-Passing Interface. Cambridge, MA: MIT Press.Google Scholar
Gürsoy, A. 2003. Data Decomposition for Parallel k-Means Clustering. Pages 241–248 of: PPAM.Google Scholar
Hernandez, V., Roman, J. E., Tomas, A., and Vidal, V. 2005a. A Survey of Software for Sparse Eigenvalue Problems. Technical Report. Universidad Politecnica de Valencia.Google Scholar
Hernandez, V., Roman, J. E., and Vidal, V. 2005b. SLEPc: A Scalable and Flexible Toolkit for the Solution of Eigenvalue Problems. ACM Transactions on Mathematical Software, 31, 351–362.CrossRefGoogle Scholar
Lang, Ken. 1995. NewsWeeder: Learning to Filter Netnews. Pages 331–339 of: Proceedings of ICML.Google Scholar
Lehoucg, R. B., Sorensen, D. C., and Yang, C. 1998. ARPACK User's Guide. SIAM.CrossRefGoogle Scholar
Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5, 361–397.Google Scholar
Li, B., Chang, E. Y., andWu, Y.-L. 2003. Discovery of a Perceptual Distance Function for Measuring Image Similarity. Multimedia Systems, 8(6), 512–522.CrossRefGoogle Scholar
Liu, R., and Zhang, H. 2004. Segmentation of 3D meshes through spectral clustering. In: Proceedings of Pacific Graphics.Google Scholar
Liu, T., Moore, A., Gray, A., and Yang, K. 2004. An Investigation of Practical Approximate Nearest Neighbor Algorithms. In: Proceedings of NIPS.Google Scholar
Llorente, I. M., Tirado, F., and Vázquez, L. 1996. Some Aspects about the Scalability of Scientific Applications on Parallel Architectures. Parallel Computing, 22(9), 1169–1195.CrossRefGoogle Scholar
Luxburg, U. 2007. A Tutorial on Spectral Clustering. Statistics and Computing, 17(4), 395–416.CrossRefGoogle Scholar
Marques, O. A. 1995. BLZPACK: Description and User's Guide. Technical Report TR/PA/95/30. CERFACS, Toulouse, France.Google Scholar
Maschhoff, K., and Sorensen, D. 1996. A Portable Implementation of ARPACK for Distributed Memory Parallel Architectures. In: Proceedings of CMCIM.Google Scholar
Meila, M., and Shi, J. 2000. Learning Segmentation by Random Walks. Pages 873–879 of: Proceedings of NIPS.Google Scholar
Micchelli, Charles A. 1986. Interpolation of Scattered Data: Distance Matrices and Conditionally Positive Definite Functions. Constructive Approximation, 2, 11–22.CrossRefGoogle Scholar
Ng, A. Y., Jordan, M. I., and Weiss, Y. 2001. On Spectral Clustering: Analysis and an Algorithm. Pages 849–856 of: Proceedings of NIPS.Google Scholar
Rennie, J. D. M. 2001. Improving Multi-class Text Classification with Naive Bayes. M.Phil. thesis, Massachusetts Institute of Technology.
Shi, J., and Malik, J. 2000. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.Google Scholar
Snir, M., and Otto, S. 1998. MPI – The Complete Reference: The MPI Core. Cambridge, MA: MIT Press.Google Scholar
Strehl, A., and Ghosh, J. 2002. Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3, 583–617.Google Scholar
Thakur, R., Rabenseinfer, R., and Gropp, W. 2005. Optimization of Collective Communication Operations in MPICH. International Journal of High Performance Computing Applications, 19(1), 49–66.CrossRefGoogle Scholar
Wu, K., and Simon, H. 1999. TRLAN User Guide. Technical report. LBNL-41284. Lawrence Berkeley National Laboratory.CrossRefGoogle Scholar
Xu, S. T., and Zhang, J. 2004. A Hybrid Parallel Web Document Clustering Algorithm and Its Performance Study. Journal of Supercomputing, 30(2), 117–131.CrossRefGoogle Scholar
Xu, W., Liu, X., and Gong, Y. 2003. Document Clustering Based on Non-negative Matrix Factorization. Pages 267–273 of: Proceedings of SIGIR.Google Scholar
Yan, D., Huang, L., and Jordan, M. I. 2009. Fast Approximate Spectral Clustering. Pages 907–916 of: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.CrossRefGoogle Scholar
Yu, S. X., and Shi, J. 2003. Multiclass Spectral Clustering. Page 313 of: Proceedings of ICCV.Google Scholar
Zelnik-Manor, L., and Perona, P. 2005. Self-Tuning Spectral Clustering. Pages 1601–1608 of: Proceedings of NIPS.Google Scholar
Zhong, S., and Ghosh, J. 2003. A Unified Framework for Model-Based Clustering. Journal of Machine Learning Research, 4, 1001–1037.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×