> cars = read.delim('cars.tab',stringsAsFactors=FALSE)To get an idea of what information we have, let's look at the first few records;

> head(cars) Country Car MPG Weight Drive_Ratio Horsepower 1 U.S. Buick Estate Wagon 16.9 4.360 2.73 155 2 U.S. Ford Country Squire Wagon 15.5 4.054 2.26 142 3 U.S. Chevy Malibu Wagon 19.2 3.605 2.56 125 4 U.S. Chrysler LeBaron Wagon 18.5 3.940 2.45 150 5 U.S. Chevette 30.0 2.155 3.70 68 6 Japan Toyota Corona 27.5 2.560 3.05 95 Displacement Cylinders 1 350 8 2 351 8 3 267 8 4 360 8 5 98 4 6 134 4It looks like the variables are measured on different scales, so we will likely want to standardize the data before proceeding. The

> cars.use = cars[,-c(1,2)] > medians = apply(cars.use,2,median) > mads = apply(cars.use,2,mad) > cars.use = scale(cars.use,center=medians,scale=mads)(The 2 used as the second argument to apply means to apply the function to the columns of the matrix or data frame; a value of 1 means to use the rows.) The country of origin and name of the car will not be useful in the cluster analysis, so they have been removed. Notice that the

cars.dist = dist(cars.use)If you display the distance matrix in R (for example, by typing its name), you'll notice that only the lower triangle of the matrix is displayed. This is to remind us that the distance matrix is symmetric, since it doesn't matter which observation we consider first when we calculate a distance. R takes advantage of this fact by only storing the lower triangle of the distance matrix. All of the clustering functions will recognize this and have no problems, but if you try to access the distance matrix in the usual way (for example, with subscripting), you'll see an error message. Thus, if you need to use the distance matrix with anything other than the clustering functions, you'll need to use

> cars.hclust = hclust(cars.dist)Once again, we're using the default method of

plot(cars.hclust,labels=cars$Car,main='Default from hclust')If you choose any height along the y-axis of the dendogram, and move across the dendogram counting the number of lines that you cross, each line represents a group that was identified when objects were joined together into clusters. The observations in that group are represented by the branches of the dendogram that spread out below the line. For example, if we look at a height of 6, and move across the x-axis at that height, we'll cross two lines. That defines a two-cluster solution; by following the line down through all its branches, we can see the names of the cars that are included in these two clusters. Since the y-axis represents how close together observations were when they were merged into clusters, clusters whose branches are very close together (in terms of the heights at which they were merged) probably aren't very reliable. But if there's a big difference along the y-axis between the last merged cluster and the currently merged one, that indicates that the clusters formed are probably doing a good job in showing us the structure of the data. Looking at the dendogram for the car data, there are clearly two very distinct groups; the right hand group seems to consist of two more distinct cluster, while most of the observations in the left hand group are clustering together at about the same height For this data set, it looks like either two or three groups might be an interesting place to start investigating. This is not to imply that looking at solutions with more clusters would be meaningless, but the data seems to suggest that two or three clusters might be a good start. For a problem of this size, we can see the names of the cars, so we could start interpreting the results immediately from the dendogram, but when there are larger numbers of observations, this won't be possible. One of the first things we can look at is how many cars are in each of the groups. We'd like to do this for both the two cluster and three cluster solutions. You can create a vector showing the cluster membership of each observation by using the

> groups.3 = cutree(cars.hclust,3)Simply displaying the group memberships isn't that revealing. A good first step is to use the

> table(groups.3) groups.3 1 2 3 8 20 10Notice that you can get this information for many different groupings at once by combining the calls to

> counts = sapply(2:6,function(ncl)table(cutree(cars.hclust,ncl))) > names(counts) = 2:6 > counts $"2" 1 2 18 20 $"3" 1 2 3 8 20 10 $"4" 1 2 3 4 8 17 3 10 $"5" 1 2 3 4 5 8 11 6 3 10 $"6" 1 2 3 4 5 6 8 11 6 3 5 5To see which cars are in which clusters, we can use subscripting on the vector of car names to choose just the observations from a particular cluster. Since we used all of the observations in the data set to form the distance matrix, the ordering of the names in the original data will coincide with the values returned by

> cars$Car[groups.3 == 1] [1] Buick Estate Wagon Ford Country Squire Wagon [3] Chevy Malibu Wagon Chrysler LeBaron Wagon [5] Chevy Caprice Classic Ford LTD [7] Mercury Grand Marquis Dodge St RegisAs usual, if we want to do the same thing for all the groups at once, we can use

> sapply(unique(groups.3),function(g)cars$Car[groups.3 == g]) [[1]] [1] Buick Estate Wagon Ford Country Squire Wagon [3] Chevy Malibu Wagon Chrysler LeBaron Wagon [5] Chevy Caprice Classic Ford LTD [7] Mercury Grand Marquis Dodge St Regis [[2]] [1] Chevette Toyota Corona Datsun 510 Dodge Omni [5] Audi 5000 Saab 99 GLE Ford Mustang 4 Mazda GLC [9] Dodge Colt AMC Spirit VW Scirocco Honda Accord LX [13] Buick Skylark Pontiac Phoenix Plymouth Horizon Datsun 210 [17] Fiat Strada VW Dasher BMW 320i VW Rabbit [[3]] [1] Volvo 240 GL Peugeot 694 SL Buick Century Special [4] Mercury Zephyr Dodge Aspen AMC Concord D/L [7] Ford Mustang Ghia Chevy Citation Olds Omega [10] Datsun 810We could also see what happens when we use the four cluster solution

> groups.4 = cutree(cars.hclust,4) > sapply(unique(groups.4),function(g)cars$Car[groups.4 == g]) [[1]] [1] Buick Estate Wagon Ford Country Squire Wagon [3] Chevy Malibu Wagon Chrysler LeBaron Wagon [5] Chevy Caprice Classic Ford LTD [7] Mercury Grand Marquis Dodge St Regis [[2]] [1] Chevette Toyota Corona Datsun 510 Dodge Omni [5] Ford Mustang 4 Mazda GLC Dodge Colt AMC Spirit [9] VW Scirocco Honda Accord LX Buick Skylark Pontiac Phoenix [13] Plymouth Horizon Datsun 210 Fiat Strada VW Dasher [17] VW Rabbit [[3]] [1] Audi 5000 Saab 99 GLE BMW 320i [[4]] [1] Volvo 240 GL Peugeot 694 SL Buick Century Special [4] Mercury Zephyr Dodge Aspen AMC Concord D/L [7] Ford Mustang Ghia Chevy Citation Olds Omega [10] Datsun 810The new cluster can be recognized as the third group in the above output. Often there is an auxiliary variable in the original data set that was not included in the cluster analysis, but may be of interest. In fact, cluster analysis is sometimes performed to see if observations naturally group themselves in accord with some already measured variable. For this data set, we could ask whether the clusters reflect the country of origin of the cars, stored in the variable

> table(groups.3,cars$Country) groups.3 France Germany Italy Japan Sweden U.S. 1 0 0 0 0 0 8 2 0 5 1 6 1 7 3 1 0 0 1 1 7 >Of interest is the fact that all of the cars in cluster 1 were manufactured in the US. Considering the state of the automobile industry in 1978, and the cars that were identified in cluster 1, this is not surprising. In an example like this, with a small number of observations, we can often interpret the cluster solution directly by looking at the labels of the observations that are in each cluster. Of course, for larger data sets, this will be impossible or meaningless. A very useful method for characterizing clusters is to look at some sort of summary statistic, like the median, of the variables that were used to perform the cluster analysis, broken down by the groups that the cluster analysis identified. The

> aggregate(cars.use,list(groups.3),median) Group.1 MPG Weight Drive_Ratio Horsepower Displacement Cylinders 1 1 -0.7945273 1.5051136 -0.9133729 1.0476133 2.4775849 4.7214353 2 2 0.6859228 -0.5870568 0.5269459 -0.6027364 -0.5809970 -0.6744908 3 3 -0.4058377 0.5246039 -0.1686227 0.3587717 0.3272282 2.0234723If the ranges of these numbers seem strange, it's because we standardized the data before performing the cluster analysis. While it is usually more meaningful to look at the variables in their original scales, when data is centered, negative values mean "lower than most" and positive values mean "higher than most". Thus, group 1 is cars with relatively low MPG, high weight, low drive ratios, high horsepower and displacement, and more than average number of cylinders. Group 2 are cars with high gas mileage, and low weight and horsepower; and group 3 is similar to group 1. It may be easier to understand the groupings if we look at the variables in their original scales:

> aggregate(cars[,-c(1,2)],list(groups.3),median) Group.1 MPG Weight Drive_Ratio Horsepower Displacement Cylinders 1 1 17.30 3.890 2.430 136.5 334 8 2 2 30.25 2.215 3.455 79.0 105 4 3 3 20.70 3.105 2.960 112.5 173 6It may also be useful to add the numbers of observations in each group to the above display. Since

> a3 = aggregate(cars[,-c(1,2)],list(groups.3),median) > data.frame(Cluster=a3[,1],Freq=as.vector(table(groups.3)),a3[,-1]) Cluster Freq MPG Weight Drive_Ratio Horsepower Displacement Cylinders 1 1 8 17.30 3.890 2.430 136.5 334 8 2 2 20 30.25 2.215 3.455 79.0 105 4 3 3 10 20.70 3.105 2.960 112.5 173 6To see how the four cluster solution differed from the three cluster solution, we can perform the same analysis for that solution:

> a4 = aggregate(cars[,-c(1,2)],list(groups.4),median) > data.frame(Cluster=a4[,1],Freq=as.vector(table(groups.4)),a4[,-1]) Cluster Freq MPG Weight Drive_Ratio Horsepower Displacement Cylinders 1 1 8 17.3 3.890 2.43 136.5 334 8 2 2 17 30.9 2.190 3.37 75.0 98 4 3 3 3 21.5 2.795 3.77 110.0 121 4 4 4 10 20.7 3.105 2.96 112.5 173 6The main difference seems to be that the four cluster solution recognized a group of cars that have higher horsepower and drive ratios than the other cars in the cluster they came from.

> library(cluster) > cars.pam = pam(cars.dist,3)First of all, let's see if the

> names(cars.pam) [1] "medoids" "id.med" "clustering" "objective" "isolation" [6] "clusinfo" "silinfo" "diss" "call"We can use

> table(groups.3,cars.pam$clustering) groups.3 1 2 3 1 8 0 0 2 0 19 1 3 0 0 10The solutions seem to agree, except for 1 observations that

> cars$Car[groups.3 != cars.pam$clustering] [1] Audi 5000Notice how easy it is to get information like this due to the power of R's subscripting operations. One novel feature of

> cars$Car[cars.pam$id.med] > cars$Car[cars.pam$id.med] [1] Dodge St Regis Dodge Omni Ford Mustang GhiaAnother feature available with

> plot(cars.pam)The plot indicates that there is a good structure to the clusters, with most observations seeming to belong to the cluster that they're in. There is a summary measure at the bottom of the plot labeled "Average Silhouette Width". This table shows how to use the value:

Range of SC | Interpretation |

0.71-1.0 | A strong structure has been found |

0.51-0.70 | A reasonable structure has been found |

0.26-0.50 | The structure is weak and could be artificial |

< 0.25 | No substantial structure has been found |

plot(silhouette(cutree(cars.hclust,4),cars.dist))

> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosaWe will only consider the numeric variables in the cluster analysis. As mentioned previously, there are two functions to compute the distance matrix:

> iris.use = subset(iris,select=-Species) > d = dist(iris.use) > library(cluster) > d1 = daisy(iris.use) > sum(abs(d - d1)) [1] 1.072170e-12Of course, if we choose a non-default metric for

> dd = dist(iris.use,method='manhattan') > sum(abs(as.matrix(dd) - as.matrix(d1))) [1] 38773.86The values are very different! Continuing with the cluster example, we can calculate the cluster solution as follows:

> z = agnes(d)The plotting method for

Hit <Return> to see next plot:If you know you want a particular plot, you can pass the

> table(cutree(z,3),iris$Species) setosa versicolor virginica 1 50 0 0 2 0 50 14 3 0 0 36We were able to classify all the setosa and versicolor varieties correctly. The following plot gives some insight into why we were so successful:

> splom(~iris,groups=iris$Species,auto.key=TRUE)

File translated from T

On 16 Mar 2011, 15:27.