✓ To apply the K-means algorithm in Weka and R language
✓ To interpret the results of clustering
✓ To identify the optimum number of clusters
✓ To apply classification on un-labeled data by using clustering as an intermediate step
As discussed earlier, if data is not labeled then we can analyze this data by performing a clustering analysis, where clustering refers to the task of grouping a set of objects into classes of similar objects.
In this chapter, we will apply clustering on Fisher's Iris dataset. We will use clustering algorithms to group flower samples into clusters with similar flower dimensions. These clusters then become possible ways to group flowers samples into species. We will implement a simple k-means algorithm to cluster numerical attributes with the help of Weka and R.
In the case of classification, we know the attributes and classes of instances. For example, the flower dimensions and classes were already known to us for the Iris dataset. Our goal was to predict the class of an unknown sample as shown in Figure 8.1.
Earlier, we used the Weka J48 classification algorithm to build a decision tree on Fisher's Iris dataset using samples with known class, which helped in predicting the class of unknown samples. We used the flower's Sepal length and width, and the Petal length and width as the specific attributes for this. Based on flower dimensions and using this tree, we can identify an unknown Iris as one of three species, Setosa, Versicolor, and Virginica.
In clustering, we know the attributes for the instances, but we don't know the classes. For example, we know the flower dimensions for samples of the Iris dataset but we don't know what classes exist as shown in Figure 8.2. Therefore, our goal is to group instances into clusters with similar attributes or dimensions and then identify the class.
In this chapter, we will learn what happens if we don't know what classes the samples belong to, or even how many classes there are, or even what defines a class? Since, Fisher's Iris dataset is already labeled, we will first make this dataset unlabeled by removing the class attribute, i.e., the species column. Then, we will apply clustering algorithms to cluster this data on the basis of its input attributes, i.e., Sepal length, Sepal width, Petal length, and Petal width.