Performing and Interpreting Cluster Analysis

For the hierarchial clustering methods, the dendogram is the main graphical tool for getting insight into a cluster solution. When you use hclust or agnes to perform a cluster analysis, you can see the dendogram by passing the result of the clustering to the plot function.
To illustrate interpretation of the dendogram, we'll look at a cluster analysis performed on a set of cars from 1978-1979; the data can be found at http://www.stat.berkeley.edu/~spector/s133/data/cars.tab. Since the data is a tab-delimited file, we use read.delim:
> cars = read.delim('cars.tab',stringsAsFactors=FALSE)

To get an idea of what information we have, let's look at the first few records;
 
> head(cars) 
  Country                       Car  MPG Weight Drive_Ratio Horsepower
1    U.S.        Buick Estate Wagon 16.9  4.360        2.73        155
2    U.S. Ford Country Squire Wagon 15.5  4.054        2.26        142
3    U.S.        Chevy Malibu Wagon 19.2  3.605        2.56        125
4    U.S.    Chrysler LeBaron Wagon 18.5  3.940        2.45        150
5    U.S.                  Chevette 30.0  2.155        3.70         68
6   Japan             Toyota Corona 27.5  2.560        3.05         95
  Displacement Cylinders
1          350         8
2          351         8
3          267         8
4          360         8
5           98         4
6          134         4

It looks like the variables are measured on different scales, so we will likely want to standardize the data before proceeding. The daisy function in the cluster library will automatically perform standardization, but it doesn't give you complete control. If you have a particular method of standardization in mind, you can use the scale function. You pass scale a matrix or data frame to be standardized, and two optional vectors. The first, called center, is a vector of values, one for each column of the matrix or data frame to be standardized, which will be subtracted from every entry in that column. The second, called scale, is similar to center, but is used to divide the values in each column. Thus, to get z-scores, you could pass scale a vector of means for center, and a vector of standard deviations for scale. These vectors can be created with the apply function, that performs the same operation on each row or column of a matrix. Suppose we want to standardize by subtracting the median and dividing by the mean average deviation:
> cars.use = cars[,-c(1,2)]
> medians = apply(cars.use,2,median)
> mads = apply(cars.use,2,mad)
> cars.use = scale(cars.use,center=medians,scale=mads)

(The 2 used as the second argument to apply means to apply the function to the columns of the matrix or data frame; a value of 1 means to use the rows.) The country of origin and name of the car will not be useful in the cluster analysis, so they have been removed. Notice that the scale function doesn't change the order of the rows of the data frame, so it will be easy to identify observations using the omitted columns from the original data.
First, we'll take a look at a hierarchical method, since it will provide information about solutions with different numbers of clusters. The first step is calculating a distance matrix. For a data set with n observations, the distance matrix will have n rows and n columns; the (i,j)th element of the distance matrix will be the difference between observation i and observation j. There are two functions that can be used to calculate distance matrices in R; the dist function, which is included in every version of R, and the daisy function, which is part of the cluster library. We'll use the dist function in this example, but you should familiarize yourself with the daisy function (by reading its help page), since it offers some capabilities that dist does not. Each function provides a choice of distance metrics; in this example, we'll use the default of Euclidean distance, but you may find that using other metrics will give different insights into the structure of your data.
cars.dist = dist(cars.use)

If you display the distance matrix in R (for example, by typing its name), you'll notice that only the lower triangle of the matrix is displayed. This is to remind us that the distance matrix is symmetric, since it doesn't matter which observation we consider first when we calculate a distance. R takes advantage of this fact by only storing the lower triangle of the distance matrix. All of the clustering functions will recognize this and have no problems, but if you try to access the distance matrix in the usual way (for example, with subscripting), you'll see an error message. Thus, if you need to use the distance matrix with anything other than the clustering functions, you'll need to use as.matrix to convert it to a regular matrix.
To get started, we'll use the hclust method; the cluster library provides a similar function, called agnes to perform hierarchical cluster analysis.
> cars.hclust = hclust(cars.dist)

Once again, we're using the default method of hclust, which is to update the distance matrix using what R calls "complete" linkage. Using this method, when a cluster is formed, its distance to other objects is computed as the maximum distance between any object in the cluster and the other object. Other linkage methods will provide different solutions, and should not be ignored. For example, using method=ward tends to produce clusters of fairly equal size, and can be useful when other methods find clusters that contain just a few observations.
Now that we've got a cluster solution (actually a collection of cluster solutions), how can we examine the results? The main graphical tool for looking at a hierarchical cluster solution is known as a dendogram. This is a tree-like display that lists the objects which are clustered along the x-axis, and the distance at which the cluster was formed along the y-axis. (Distances along the x-axis are meaningless in a dendogram; the observations are equally spaced to make the dendogram easier to read.) To create a dendogram from a cluster solution, simply pass it to the plot function. The result is displayed below.
plot(cars.hclust,labels=cars$Car,main='Default from hclust')

If you choose any height along the y-axis of the dendogram, and move across the dendogram counting the number of lines that you cross, each line represents a group that was identified when objects were joined together into clusters. The observations in that group are represented by the branches of the dendogram that spread out below the line. For example, if we look at a height of 6, and move across the x-axis at that height, we'll cross two lines. That defines a two-cluster solution; by following the line down through all its branches, we can see the names of the cars that are included in these two clusters. Since the y-axis represents how close together observations were when they were merged into clusters, clusters whose branches are very close together (in terms of the heights at which they were merged) probably aren't very reliable. But if there's a big difference along the y-axis between the last merged cluster and the currently merged one, that indicates that the clusters formed are probably doing a good job in showing us the structure of the data. Looking at the dendogram for the car data, there are clearly two very distinct groups; the right hand group seems to consist of two more distinct cluster, while most of the observations in the left hand group are clustering together at about the same height For this data set, it looks like either two or three groups might be an interesting place to start investigating. This is not to imply that looking at solutions with more clusters would be meaningless, but the data seems to suggest that two or three clusters might be a good start. For a problem of this size, we can see the names of the cars, so we could start interpreting the results immediately from the dendogram, but when there are larger numbers of observations, this won't be possible.
One of the first things we can look at is how many cars are in each of the groups. We'd like to do this for both the two cluster and three cluster solutions. You can create a vector showing the cluster membership of each observation by using the cutree function. Since the object returned by a hierarchical cluster analysis contains information about solutions with different numbers of clusters, we pass the cutree function the cluster object and the number of clusters we're interested in. So to get cluster memberships for the three cluster solution, we could use:
> groups.3 = cutree(cars.hclust,3)

Simply displaying the group memberships isn't that revealing. A good first step is to use the table function to see how many observations are in each cluster. We'd like a solution where there aren't too many clusters with just a few observations, because it may make it difficult to interpret our results. For the four cluster solution, the distribution among the clusters looks good:
> table(groups.3)
groups.3
 1  2  3
 8 20 10

Notice that you can get this information for many different groupings at once by combining the calls to cutree and table in a call to sapply. For example, to see the sizes of the clusters for solutions ranging from 2 to 6 clusters, we could use:
> counts = sapply(2:6,function(ncl)table(cutree(cars.hclust,ncl)))
> names(counts) = 2:6
> counts
$"2"

 1  2
18 20

$"3"

 1  2  3
 8 20 10

$"4"

 1  2  3  4
 8 17  3 10

$"5"

 1  2  3  4  5
 8 11  6  3 10

$"6"

 1  2  3  4  5  6
 8 11  6  3  5  5


To see which cars are in which clusters, we can use subscripting on the vector of car names to choose just the observations from a particular cluster. Since we used all of the observations in the data set to form the distance matrix, the ordering of the names in the original data will coincide with the values returned by cutree. If observations were removed from the data before the distance matrix is computed, it's important to remember to make the same deletions in the vector from the original data set that will be used to identify observations. So, to see which cars were in the first cluster for the four cluster solution, we can use:
> cars$Car[groups.3 == 1]
[1] Buick Estate Wagon        Ford Country Squire Wagon
[3] Chevy Malibu Wagon        Chrysler LeBaron Wagon
[5] Chevy Caprice Classic     Ford LTD
[7] Mercury Grand Marquis     Dodge St Regis

As usual, if we want to do the same thing for all the groups at once, we can use sapply:
> sapply(unique(groups.3),function(g)cars$Car[groups.3 == g])
[[1]]
[1] Buick Estate Wagon        Ford Country Squire Wagon
[3] Chevy Malibu Wagon        Chrysler LeBaron Wagon
[5] Chevy Caprice Classic     Ford LTD
[7] Mercury Grand Marquis     Dodge St Regis

[[2]]
 [1] Chevette         Toyota Corona    Datsun 510       Dodge Omni
 [5] Audi 5000        Saab 99 GLE      Ford Mustang 4   Mazda GLC
 [9] Dodge Colt       AMC Spirit       VW Scirocco      Honda Accord LX
[13] Buick Skylark    Pontiac Phoenix  Plymouth Horizon Datsun 210
[17] Fiat Strada      VW Dasher        BMW 320i         VW Rabbit

[[3]]
 [1] Volvo 240 GL          Peugeot 694 SL        Buick Century Special
 [4] Mercury Zephyr        Dodge Aspen           AMC Concord D/L
 [7] Ford Mustang Ghia     Chevy Citation        Olds Omega
[10] Datsun 810


We could also see what happens when we use the four cluster solution
> groups.4 = cutree(cars.hclust,4)
> sapply(unique(groups.4),function(g)cars$Car[groups.4 == g])
[[1]]
[1] Buick Estate Wagon        Ford Country Squire Wagon
[3] Chevy Malibu Wagon        Chrysler LeBaron Wagon
[5] Chevy Caprice Classic     Ford LTD
[7] Mercury Grand Marquis     Dodge St Regis

[[2]]
 [1] Chevette         Toyota Corona    Datsun 510       Dodge Omni
 [5] Ford Mustang 4   Mazda GLC        Dodge Colt       AMC Spirit
 [9] VW Scirocco      Honda Accord LX  Buick Skylark    Pontiac Phoenix
[13] Plymouth Horizon Datsun 210       Fiat Strada      VW Dasher
[17] VW Rabbit

[[3]]
[1] Audi 5000   Saab 99 GLE BMW 320i

[[4]]
 [1] Volvo 240 GL          Peugeot 694 SL        Buick Century Special
 [4] Mercury Zephyr        Dodge Aspen           AMC Concord D/L
 [7] Ford Mustang Ghia     Chevy Citation        Olds Omega
[10] Datsun 810


The new cluster can be recognized as the third group in the above output.
Often there is an auxiliary variable in the original data set that was not included in the cluster analysis, but may be of interest. In fact, cluster analysis is sometimes performed to see if observations naturally group themselves in accord with some already measured variable. For this data set, we could ask whether the clusters reflect the country of origin of the cars, stored in the variable Country in the original data set. The table function can be used, this time passing two arguments, to produce a cross-tabulation of cluster group membership and country of origin:
> table(groups.3,cars$Country)

groups.3 France Germany Italy Japan Sweden U.S.
       1      0       0     0     0      0    8
       2      0       5     1     6      1    7
       3      1       0     0     1      1    7
>                                                 

Of interest is the fact that all of the cars in cluster 1 were manufactured in the US. Considering the state of the automobile industry in 1978, and the cars that were identified in cluster 1, this is not surprising.
In an example like this, with a small number of observations, we can often interpret the cluster solution directly by looking at the labels of the observations that are in each cluster. Of course, for larger data sets, this will be impossible or meaningless. A very useful method for characterizing clusters is to look at some sort of summary statistic, like the median, of the variables that were used to perform the cluster analysis, broken down by the groups that the cluster analysis identified. The aggregate function is well suited for this task, since it will perform summaries on many variables simultaneously. Let's look at the median values for the variables we've used in the cluster analysis, broken up by the cluster groups. One oddity of the aggregate function is that it demands that the variable(s) used to divide up the data are passed to it in a list, even if there's only one variable:
> aggregate(cars.use,list(groups.3),median)
  Group.1        MPG     Weight Drive_Ratio Horsepower Displacement  Cylinders
1       1 -0.7945273  1.5051136  -0.9133729  1.0476133    2.4775849  4.7214353
2       2  0.6859228 -0.5870568   0.5269459 -0.6027364   -0.5809970 -0.6744908
3       3 -0.4058377  0.5246039  -0.1686227  0.3587717    0.3272282  2.0234723


If the ranges of these numbers seem strange, it's because we standardized the data before performing the cluster analysis. While it is usually more meaningful to look at the variables in their original scales, when data is centered, negative values mean "lower than most" and positive values mean "higher than most". Thus, group 1 is cars with relatively low MPG, high weight, low drive ratios, high horsepower and displacement, and more than average number of cylinders. Group 2 are cars with high gas mileage, and low weight and horsepower; and group 3 is similar to group 1. It may be easier to understand the groupings if we look at the variables in their original scales:
> aggregate(cars[,-c(1,2)],list(groups.3),median)
  Group.1   MPG Weight Drive_Ratio Horsepower Displacement Cylinders
1       1 17.30  3.890       2.430      136.5          334         8
2       2 30.25  2.215       3.455       79.0          105         4
3       3 20.70  3.105       2.960      112.5          173         6


It may also be useful to add the numbers of observations in each group to the above display. Since aggregate returns a data frame, we can manipulate it in any way we want:
> a3 = aggregate(cars[,-c(1,2)],list(groups.3),median)
> data.frame(Cluster=a3[,1],Freq=as.vector(table(groups.3)),a3[,-1])
  Cluster Freq   MPG Weight Drive_Ratio Horsepower Displacement Cylinders
1       1    8 17.30  3.890       2.430      136.5          334         8
2       2   20 30.25  2.215       3.455       79.0          105         4
3       3   10 20.70  3.105       2.960      112.5          173         6

To see how the four cluster solution differed from the three cluster solution, we can perform the same analysis for that solution:
> a4 = aggregate(cars[,-c(1,2)],list(groups.4),median)
> data.frame(Cluster=a4[,1],Freq=as.vector(table(groups.4)),a4[,-1])
  Cluster Freq  MPG Weight Drive_Ratio Horsepower Displacement Cylinders
1       1    8 17.3  3.890        2.43      136.5          334         8
2       2   17 30.9  2.190        3.37       75.0           98         4
3       3    3 21.5  2.795        3.77      110.0          121         4
4       4   10 20.7  3.105        2.96      112.5          173         6


The main difference seems to be that the four cluster solution recognized a group of cars that have higher horsepower and drive ratios than the other cars in the cluster they came from.
Unlike the hierarchical clustering methods, techniques like k-means cluster analysis (available through the kmeans function) or partitioning around mediods (avaiable through the pam function in the cluster library) require that we specify the number of clusters that will be formed in advance. pam offers some additional diagnostic information about a clustering solution, and provides a nice example of an alternative technique to hierarchical clustering. To use pam, you must first load the cluster library. You can pass pam a data frame or a distance matrix; since we've already formed the distance matrix, we'll use that. pam also needs the number of clusters you wish to form. Let's look at the three cluster solution produced by pam:
> library(cluster)
> cars.pam = pam(cars.dist,3)

First of all, let's see if the pam solution agrees with the hclust solution. Since pam only looks at one cluster solution at a time, we don't need to use the cutree function as we did with hclust; the cluster memberships are stored in the clustering component of the pam object; like most R objects, you can use the names function to see what else is available. Further information can be found in the help page for pam.object.
> names(cars.pam)
[1] "medoids"    "id.med"     "clustering" "objective"  "isolation"
[6] "clusinfo"   "silinfo"    "diss"       "call"

We can use table to compare the results of the hclust and pam solutions:
> table(groups.3,cars.pam$clustering)

> table(groups.3,cars.pam$clustering)

groups.3  1  2  3
       1  8  0  0
       2  0 19  1
       3  0  0 10

The solutions seem to agree, except for 1 observations that hclust put in group 2 and pam put in group 3. Which observations was it?
> cars$Car[groups.3 != cars.pam$clustering]
[1] Audi 5000

Notice how easy it is to get information like this due to the power of R's subscripting operations.
One novel feature of pam is that it finds observations from the original data that are typical of each cluster in the sense that they are closest to the center of the cluster. The indexes of the medoids are stored in the id.med component of the pam object, so we can use that component as a subscript into the vector of car names to see which ones were selected:
> cars$Car[cars.pam$id.med]
> cars$Car[cars.pam$id.med]
[1] Dodge St Regis    Dodge Omni        Ford Mustang Ghia

Another feature available with pam is a plot known as a silhouette plot. First, a measure is calculated for each observation to see how well it fits into the cluster that it's been assigned to. This is done by comparing how close the object is to other objects in its own cluster with how close it is to objects in other clusters. (A complete description can be found in the help page for silhouette.) Values near one mean that the observation is well placed in its cluster; values near 0 mean that it's likely that an observation might really belong in some other cluster. Within each cluster, the value for this measure is displayed from smallest to largest. If the silhouette plot shows values close to one for each observation, the fit was good; if there are many observations closer to zero, it's an indication that the fit was not good. The sihouette plot is very useful in locating groups in a cluster analysis that may not be doing a good job; in turn this information can be used to help select the proper number of clusters. For the current example, here's the silhouette plot for the three cluster pam solution, produced by the command
> plot(cars.pam)

The plot indicates that there is a good structure to the clusters, with most observations seeming to belong to the cluster that they're in. There is a summary measure at the bottom of the plot labeled "Average Silhouette Width". This table shows how to use the value:
Range of SCInterpretation
0.71-1.0A strong structure has been found
0.51-0.70A reasonable structure has been found
0.26-0.50The structure is weak and could be artificial
< 0.25No substantial structure has been found
To create a silhouette plot for a particular solution derived from a hierarchical cluster analysis, the silhouette function can be used. This function takes the appropriate output from cutree along with the distance matrix used for the clustering. So to produce a silhouette plot for our 4 group hierarchical cluster (not shown), we could use the following statements:
plot(silhouette(cutree(cars.hclust,4),cars.dist))




File translated from TEX by TTH, version 3.67.
On 19 Oct 2007, 21:44.