Skip to main content Accessibility help
×
Home
Hostname: page-component-6c8bd87754-clkrv Total loading time: 0.235 Render date: 2022-01-18T18:33:57.616Z Has data issue: true Feature Flags: { "shouldUseShareProductTool": true, "shouldUseHypothesis": true, "isUnsiloEnabled": true, "metricsAbstractViews": false, "figures": true, "newCiteModal": false, "newCitedByModal": true, "newEcommerce": true, "newUsageEvents": true }

7 - Clustering

Published online by Cambridge University Press:  05 December 2014

Jure Leskovec
Affiliation:
Stanford University, California
Anand Rajaraman
Affiliation:
Milliways Laboratories, California
Jeffrey David Ullman
Affiliation:
Stanford University, California
Get access

Summary

Clustering is the process of examining a collection of “points,” and grouping the points into “clusters” according to some distance measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another. A suggestion of what clusters might look like was seen in Fig. 1.1. However, there the intent was that there were three clusters around three different road intersections, but two of the clusters blended into one another because they were not sufficiently separated.

Our goal in this chapter is to offer methods for discovering clusters in data. We are particularly interested in situations where the data is very large, and/or where the space either is high-dimensional, or the space is not Euclidean at all. We shall therefore discuss several algorithms that assume the data does not fit in main memory. However, we begin with the basics: the two general approaches to clustering and the methods for dealing with clusters in a non-Euclidean space.

Introduction to Clustering Techniques

We begin by reviewing the notions of distance measures and spaces. The two major approaches to clustering – hierarchical and point-assignment – are defined. We then turn to a discussion of the “curse of dimensionality,” which makes clustering in high-dimensional spaces difficult, but also, as we shall see, enables some simplifications if used correctly in a clustering algorithm.

7.1.1 Points, Spaces, and Distances

A dataset suitable for clustering is a collection of points, which are objects belonging to some space. In its most general sense, a space is just a universal set of points, from which the points in the dataset are drawn. However, we should be mindful of the common case of a Euclidean space (see Section 3.5.2), which has a number of important properties useful for clustering. In particular, a Euclidean space's points are vectors of real numbers.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] B., Babcock, M., Datar, R., Motwani, and L., O'Callaghan, “Maintaining variance and k-medians over data stream windows,” Proc. ACM Syrup. on Principles of Database Systems, pp. 234–243, 2003.Google Scholar
[2] P.S., Bradley, U.M., Fayyad, and C., Reina, “Scaling clustering algorithms to large databases,” Proc. Knowledge Discovery and Data Mining, pp. 9–15, 1998.Google Scholar
[3] V., Ganti, R., Ramakrishnan, J., Gehrke, A.L., Powell, and J.C., French:, “Clustering large datasets in arbitrary metric spaces,” Proc. Intl. Conf. on Data Engineering, pp. 502–511, 1999.Google Scholar
[4] H., Garcia-Molina, J.D., Ullman, and J., Widom, Database Systems: The Complete Book Second Edition, Prentice-Hall, Upper Saddle River, NJ, 2009.Google Scholar
[5] S., Guha, R., Rastogi, and K., Shim, “CURE: An efficient clustering algorithm for large databases,” Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp. 73–84, 1998.Google Scholar
[6] T., Zhang, R., Ramakrishnan, and M., Livny, “BIRCH: an efficient data clustering method for very large databases,” Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp. 103–114, 1996.Google Scholar
3
Cited by

Send book to Kindle

To send this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle.

Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Send book to Dropbox

To send content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about sending content to Dropbox.

Available formats
×

Send book to Google Drive

To send content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about sending content to Google Drive.

Available formats
×