Performing and Interpreting Cluster Analysis

As an example of using the agnes function from the cluster package, consider the famous Fisher iris data, available as the dataframe iris in R. First let's look at some of the data:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

We will only consider the numeric variables in the cluster analysis. As mentioned previously, there are two functions to compute the distance matrix: dist and daisy. It should be mentioned that for data that's all numeric, using the function's defaults, the two methods will give the same answers. We can demonstrate this as follows:

> iris.use = subset(iris,select=-Species)
> d = dist(iris.use)
> library(cluster)
> d1 = daisy(iris.use)
> sum(abs(d - d1))
[1] 1.072170e-12

Of course, if we choose a non-default metric for dist, the answers will be different:

> dd = dist(iris.use,method='manhattan')
> sum(abs(as.matrix(dd) - as.matrix(d1)))
[1] 38773.86

The values are very different!

Continuing with the cluster example, we can calculate the cluster solution as follows:

> z = agnes(d)

The plotting method for agnes objects presents two different views of the cluster solution. When we plot such an object, the plotting function sets the graphics parameter ask=TRUE, and the following appears in your R session each time a plot is to be drawn:

Hit <Return> to see next plot:

If you know you want a particular plot, you can pass the which.plots= argument an integer telling which plot you want.

The first plot that is displayed is known as a banner plot. The banner plot for the iris data is shown below:

The white area on the left of the banner plot represents the unclustered data while the white lines that stick into the red are show the heights at which the clusters were formed. Since we don't want to include too many clusters that joined together at similar heights, it looks like three clusters, at a height of about 2 is a good solution. It's clear from the banner plot that if we lowered the height to, say 1.5, we'd create a fourth cluster with only a few observations.

The banner plot is just an alternative to the dendogram, which is the second plot that's produced from an agnes object:

The dendogram shows the same relationships, and it's a matter of individual preference as to which one is easier to use.

Let's see how well the clusters do in grouping the irises by species:

> table(cutree(z,3),iris$Species)
   
    setosa versicolor virginica
  1     50          0         0
  2      0         50        14
  3      0          0        36

We were able to classify all the setosa and versicolor varieties correctly. The following plot gives some insight into why we were so successful:

> splom(~iris,groups=iris$Species,auto.key=TRUE)

File translated from T_EX by T_TH, version 3.67.
On 16 Mar 2009, 14:54.