Skip to main content Accessibility help
  • Print publication year: 2012
  • Online publication date: November 2012

9 - Clustering, classification and data mining


Multivariate analysis discussed in Chapter 8 seeks to characterize structural relationships among the p variables that may be present in addition to random scatter. The primary structural relations may link the subpopulations without characterizing the structure of any one population.

In such cases, the scientist should first attempt to discriminate the subpopulations. This is the subject of multivariate clustering and classification. Clustering refers to situations where the subpopulations must be estimated from the dataset alone whereas classification refers to situationswhere training datasets of known populations are available independently of the dataset under study. When the datasets are very large with well-characterized training sets, classification is a major component of data mining. The efforts to find concentrations in a multivariate distribution of points are closely allied with clustering analysis of spatial distribution when p = 2 or 3; for such low-p problems, the reader is encouraged to examine Chapter 12 along with the present discussion.

The astronomical context

Since the advent of astrophotography and spectroscopy over a century ago, astronomers have faced the challenge of characterizing and understanding vast numbers of asteroids, stars, galaxies and other cosmic populations. A crucial step towards astrophysical understanding was the classification of objects into distinct, and often ordered, categories which contain objects sharing similar properties. Over a century ago, A. J. Cannon examined hundreds of thousands of low-resolution photographic stellar spectra, classifying them in the OBAFGKM sequence of decreasing surface temperature.

Related content

Powered by UNSILO