Cluster Analysis

1  Clustering Techniques

Much of the history of cluster analysis is concerned with developing algorithms that were not too computer intensive, since early computers were not nearly as powerful as they are today. Accordingly, computational shortcuts have traditionally been used in many cluster analysis algorithms. These algorithms have proven to be very useful, and can be found in most computer software.
More recently, many of these older methods have been revisited and updated to reflect the fact that certain computations that once would have overwhelmed the available computers can now be performed routinely. In R, a number of these updated versions of cluster analysis algorithms are available through the cluster library, providing us with a large selection of methods to perform cluster analysis, and the possibility of comparing the old methods with the new to see if they really provide an advantage.
One of the oldest methods of cluster analysis is known as k-means cluster analysis, and is available in R through the kmeans function. The first step (and certainly not a trivial one) when using k-means cluster analysis is to specify the number of clusters (k) that will be formed in the final solution. The process begins by choosing k observations to serve as centers for the clusters. Then, the distance from each of the other observations is calculated for each of the k clusters, and observations are put in the cluster to which they are the closest. After each observation has been put in a cluster, the center of the clusters is recalculated, and every observation is checked to see if it might be closer to a different cluster, now that the centers have been recalculated. The process continues until no observations switch clusters.
Looking on the good side, the k-means technique is fast, and doesn't require calculating all of the distances between each observation and every other observation. It can be written to efficiently deal with very large data sets, so it may be useful in cases where other methods fail. On the down side, if you rearrange your data, it's very possible that you'll get a different solution every time you change the ordering of your data. Another criticism of this technique is that you may try, for example, a 3 cluster solution that seems to work pretty well, but when you look for the 4 cluster solution, all of the structure that the 3 cluster solution revealed is gone. This makes the procedure somewhat unattractive if you don't know exactly how many clusters you should have in the first place.
The R cluster library provides a modern alternative to k-means clustering, known as pam, which is an acronym for "Partitioning around Medoids". The term medoid refers to an observation within a cluster for which the sum of the distances between it and all the other members of the cluster is a minimum. pam requires that you know the number of clusters that you want (like k-means clustering), but it does more computation than k-means in order to insure that the medoids it finds are truly representative of the observations within a given cluster. Recall that in the k-means method the centers of the clusters (which might or might not actually correspond to a particular observation) are only recaculated after all of the observations have had a chance to move from one cluster to another. With pam, the sums of the distances between objects within a cluster are constantly recalculated as observations move around, which will hopefully provide a more reliable solution. Furthermore, as a by-product of the clustering operation it identifies the observations that represent the medoids, and these observations (one per cluster) can be considered a representative example of the members of that cluster which may be useful in some situations. pam does require that the entire distance matrix is calculated to facilitate the recalculation of the medoids, and it does involve considerably more computation than k-means, but with modern computers this may not be a important consideration. As with k-means, there's no guarantee that the structure that's revealed with a small number of clusters will be retained when you increase the number of clusters.
Another class of clustering methods, known as hierarchical agglomerative clustering methods, starts out by putting each observation into its own separate cluster. It then examines all the distances between all the observations and pairs together the two closest ones to form a new cluster. This is a simple operation, since hierarchical methods require a distance matrix, and it represents exactly what we want - the distances between individual observations. So finding the first cluster to form simply means looking for the smallest number in the distance matrix and joining the two observations that the distance correspnds to into a new cluster. Now there is one less cluster than there are observations. To determine which observations will form the next cluster, we need to come up with a method for finding the distance between an existing cluster and individual observations, since once a cluster has been formed, we'll determine which observation will join it based on the distance between the cluster and the observation. Some of the methods that have been proposed to do this are to take the minimum distance between an observation and any member of the cluster, to take the maximum distance, to take the average distance, or to use some kind of measure that minimizes the distances between observations within the cluster. Each of these methods will reveal certain types of structure within the data. Using the minimum tends to find clusters that are drawn out and "snake"-like, while using the maximum tends to find compact clusters. Using the mean is a compromise between those methods. One method that tends to produce clusters of more equal size is known as Ward's method. It attempts to form clusters keeping the distances within the clusters as small as possible, and is often useful when the other methods find clusters with only a few observations. Agglomerative Hierarchical cluster analysis is provided in R through the hclust function.
Notice that, by its very nature, solutions with many clusters are nested within the solutions that have fewer clusters, so observations don't "jump ship" as they do in k-means or the pam methods. Furthermore, we don't need to tell these procedures how many clusters we want - we get a complete set of solutions starting from the trivial case of each observation in a separate cluster all the way to the other trivial case where we say all the observations are in a single cluster.
Traditionally, hierarchical cluster analysis has taken computational shortcuts when updating the distance matrix to reflect new clusters. In particular, when a new cluster is formed and the distance matrix is updated, all the information about the individual members of the cluster is discarded in order to make the computations faster. The cluster library provides the agnes function which uses essentially the same technique as hclust, but which uses fewer shortcuts when updating the distance matrix. For example, when the mean method of calculating the distance between observations and clusters is used, hclust only uses the two observations and/or clusters which were recently merged when updating the distance matrix, while agnes calculates those distances as the average of all the distances between all the observations in the two clusters. While the two techniques will usually agree quite closely when minimum or maximum updating methods are used, there may be noticeable differences when updating using the average distance or Ward's method.

2  Hierarchial Clustering

For the hierarchial clustering methods, the dendogram is the main graphical tool for getting insight into a cluster solution. When you use hclust or agnes to perform a cluster analysis, you can see the dendogram by passing the result of the clustering to the plot function.
To illustrate interpretation of the dendogram, we'll look at a cluster analysis performed on a set of cars from 1978-1979; the data can be found at Since the data is a tab-delimited file, we use read.delim:
> cars = read.delim('',stringsAsFactors=FALSE)

To get an idea of what information we have, let's look at the first few records;
> head(cars) 
  Country                       Car  MPG Weight Drive_Ratio Horsepower
1    U.S.        Buick Estate Wagon 16.9  4.360        2.73        155
2    U.S. Ford Country Squire Wagon 15.5  4.054        2.26        142
3    U.S.        Chevy Malibu Wagon 19.2  3.605        2.56        125
4    U.S.    Chrysler LeBaron Wagon 18.5  3.940        2.45        150
5    U.S.                  Chevette 30.0  2.155        3.70         68
6   Japan             Toyota Corona 27.5  2.560        3.05         95
  Displacement Cylinders
1          350         8
2          351         8
3          267         8
4          360         8
5           98         4
6          134         4

It looks like the variables are measured on different scales, so we will likely want to standardize the data before proceeding. The daisy function in the cluster library will automatically perform standardization, but it doesn't give you complete control. If you have a particular method of standardization in mind, you can use the scale function. You pass scale a matrix or data frame to be standardized, and two optional vectors. The first, called center, is a vector of values, one for each column of the matrix or data frame to be standardized, which will be subtracted from every entry in that column. The second, called scale, is similar to center, but is used to divide the values in each column. Thus, to get z-scores, you could pass scale a vector of means for center, and a vector of standard deviations for scale. These vectors can be created with the apply function, that performs the same operation on each row or column of a matrix. Suppose we want to standardize by subtracting the median and dividing by the mean average deviation:
> cars.use = cars[,-c(1,2)]
> medians = apply(cars.use,2,median)
> mads = apply(cars.use,2,mad)
> cars.use = scale(cars.use,center=medians,scale=mads)

(The 2 used as the second argument to apply means to apply the function to the columns of the matrix or data frame; a value of 1 means to use the rows.) The country of origin and name of the car will not be useful in the cluster analysis, so they have been removed. Notice that the scale function doesn't change the order of the rows of the data frame, so it will be easy to identify observations using the omitted columns from the original data.
First, we'll take a look at a hierarchical method, since it will provide information about solutions with different numbers of clusters. The first step is calculating a distance matrix. For a data set with n observations, the distance matrix will have n rows and n columns; the (i,j)th element of the distance matrix will be the difference between observation i and observation j. There are two functions that can be used to calculate distance matrices in R; the dist function, which is included in every version of R, and the daisy function, which is part of the cluster library. We'll use the dist function in this example, but you should familiarize yourself with the daisy function (by reading its help page), since it offers some capabilities that dist does not. Each function provides a choice of distance metrics; in this example, we'll use the default of Euclidean distance, but you may find that using other metrics will give different insights into the structure of your data.
cars.dist = dist(cars.use)

If you display the distance matrix in R (for example, by typing its name), you'll notice that only the lower triangle of the matrix is displayed. This is to remind us that the distance matrix is symmetric, since it doesn't matter which observation we consider first when we calculate a distance. R takes advantage of this fact by only storing the lower triangle of the distance matrix. All of the clustering functions will recognize this and have no problems, but if you try to access the distance matrix in the usual way (for example, with subscripting), you'll see an error message. Thus, if you need to use the distance matrix with anything other than the clustering functions, you'll need to use as.matrix to convert it to a regular matrix.
To get started, we'll use the hclust method; the cluster library provides a similar function, called agnes to perform hierarchical cluster analysis.
> cars.hclust = hclust(cars.dist)

Once again, we're using the default method of hclust, which is to update the distance matrix using what R calls "complete" linkage. Using this method, when a cluster is formed, its distance to other objects is computed as the maximum distance between any object in the cluster and the other object. Other linkage methods will provide different solutions, and should not be ignored. For example, using method=ward tends to produce clusters of fairly equal size, and can be useful when other methods find clusters that contain just a few observations.
Now that we've got a cluster solution (actually a collection of cluster solutions), how can we examine the results? The main graphical tool for looking at a hierarchical cluster solution is known as a dendogram. This is a tree-like display that lists the objects which are clustered along the x-axis, and the distance at which the cluster was formed along the y-axis. (Distances along the x-axis are meaningless in a dendogram; the observations are equally spaced to make the dendogram easier to read.) To create a dendogram from a cluster solution, simply pass it to the plot function. The result is displayed below.
plot(cars.hclust,labels=cars$Car,main='Default from hclust')

If you choose any height along the y-axis of the dendogram, and move across the dendogram counting the number of lines that you cross, each line represents a group that was identified when objects were joined together into clusters. The observations in that group are represented by the branches of the dendogram that spread out below the line. For example, if we look at a height of 6, and move across the x-axis at that height, we'll cross two lines. That defines a two-cluster solution; by following the line down through all its branches, we can see the names of the cars that are included in these two clusters. Since the y-axis represents how close together observations were when they were merged into clusters, clusters whose branches are very close together (in terms of the heights at which they were merged) probably aren't very reliable. But if there's a big difference along the y-axis between the last merged cluster and the currently merged one, that indicates that the clusters formed are probably doing a good job in showing us the structure of the data. Looking at the dendogram for the car data, there are clearly two very distinct groups; the right hand group seems to consist of two more distinct cluster, while most of the observations in the left hand group are clustering together at about the same height For this data set, it looks like either two or three groups might be an interesting place to start investigating. This is not to imply that looking at solutions with more clusters would be meaningless, but the data seems to suggest that two or three clusters might be a good start. For a problem of this size, we can see the names of the cars, so we could start interpreting the results immediately from the dendogram, but when there are larger numbers of observations, this won't be possible.
One of the first things we can look at is how many cars are in each of the groups. We'd like to do this for both the two cluster and three cluster solutions. You can create a vector showing the cluster membership of each observation by using the cutree function. Since the object returned by a hierarchical cluster analysis contains information about solutions with different numbers of clusters, we pass the cutree function the cluster object and the number of clusters we're interested in. So to get cluster memberships for the three cluster solution, we could use:
> groups.3 = cutree(cars.hclust,3)

Simply displaying the group memberships isn't that revealing. A good first step is to use the table function to see how many observations are in each cluster. We'd like a solution where there aren't too many clusters with just a few observations, because it may make it difficult to interpret our results. For the three cluster solution, the distribution among the clusters looks good:
> table(groups.3)
 1  2  3
 8 20 10

Notice that you can get this information for many different groupings at once by combining the calls to cutree and table in a call to sapply. For example, to see the sizes of the clusters for solutions ranging from 2 to 6 clusters, we could use:
> counts = sapply(2:6,function(ncl)table(cutree(cars.hclust,ncl)))
> names(counts) = 2:6
> counts

 1  2
18 20


 1  2  3
 8 20 10


 1  2  3  4
 8 17  3 10


 1  2  3  4  5
 8 11  6  3 10


 1  2  3  4  5  6
 8 11  6  3  5  5

To see which cars are in which clusters, we can use subscripting on the vector of car names to choose just the observations from a particular cluster. Since we used all of the observations in the data set to form the distance matrix, the ordering of the names in the original data will coincide with the values returned by cutree. If observations were removed from the data before the distance matrix is computed, it's important to remember to make the same deletions in the vector from the original data set that will be used to identify observations. So, to see which cars were in the first cluster for the four cluster solution, we can use:
> cars$Car[groups.3 == 1]
[1] Buick Estate Wagon        Ford Country Squire Wagon
[3] Chevy Malibu Wagon        Chrysler LeBaron Wagon
[5] Chevy Caprice Classic     Ford LTD
[7] Mercury Grand Marquis     Dodge St Regis

As usual, if we want to do the same thing for all the groups at once, we can use sapply:
> sapply(unique(groups.3),function(g)cars$Car[groups.3 == g])
[1] Buick Estate Wagon        Ford Country Squire Wagon
[3] Chevy Malibu Wagon        Chrysler LeBaron Wagon
[5] Chevy Caprice Classic     Ford LTD
[7] Mercury Grand Marquis     Dodge St Regis

 [1] Chevette         Toyota Corona    Datsun 510       Dodge Omni
 [5] Audi 5000        Saab 99 GLE      Ford Mustang 4   Mazda GLC
 [9] Dodge Colt       AMC Spirit       VW Scirocco      Honda Accord LX
[13] Buick Skylark    Pontiac Phoenix  Plymouth Horizon Datsun 210
[17] Fiat Strada      VW Dasher        BMW 320i         VW Rabbit

 [1] Volvo 240 GL          Peugeot 694 SL        Buick Century Special
 [4] Mercury Zephyr        Dodge Aspen           AMC Concord D/L
 [7] Ford Mustang Ghia     Chevy Citation        Olds Omega
[10] Datsun 810

We could also see what happens when we use the four cluster solution
> groups.4 = cutree(cars.hclust,4)
> sapply(unique(groups.4),function(g)cars$Car[groups.4 == g])
[1] Buick Estate Wagon        Ford Country Squire Wagon
[3] Chevy Malibu Wagon        Chrysler LeBaron Wagon
[5] Chevy Caprice Classic     Ford LTD
[7] Mercury Grand Marquis     Dodge St Regis

 [1] Chevette         Toyota Corona    Datsun 510       Dodge Omni
 [5] Ford Mustang 4   Mazda GLC        Dodge Colt       AMC Spirit
 [9] VW Scirocco      Honda Accord LX  Buick Skylark    Pontiac Phoenix
[13] Plymouth Horizon Datsun 210       Fiat Strada      VW Dasher
[17] VW Rabbit

[1] Audi 5000   Saab 99 GLE BMW 320i

 [1] Volvo 240 GL          Peugeot 694 SL        Buick Century Special
 [4] Mercury Zephyr        Dodge Aspen           AMC Concord D/L
 [7] Ford Mustang Ghia     Chevy Citation        Olds Omega
[10] Datsun 810

The new cluster can be recognized as the third group in the above output.
Often there is an auxiliary variable in the original data set that was not included in the cluster analysis, but may be of interest. In fact, cluster analysis is sometimes performed to see if observations naturally group themselves in accord with some already measured variable. For this data set, we could ask whether the clusters reflect the country of origin of the cars, stored in the variable Country in the original data set. The table function can be used, this time passing two arguments, to produce a cross-tabulation of cluster group membership and country of origin:
> table(groups.3,cars$Country)

groups.3 France Germany Italy Japan Sweden U.S.
       1      0       0     0     0      0    8
       2      0       5     1     6      1    7
       3      1       0     0     1      1    7

Of interest is the fact that all of the cars in cluster 1 were manufactured in the US. Considering the state of the automobile industry in 1978, and the cars that were identified in cluster 1, this is not surprising.
In an example like this, with a small number of observations, we can often interpret the cluster solution directly by looking at the labels of the observations that are in each cluster. Of course, for larger data sets, this will be impossible or meaningless. A very useful method for characterizing clusters is to look at some sort of summary statistic, like the median, of the variables that were used to perform the cluster analysis, broken down by the groups that the cluster analysis identified. The aggregate function is well suited for this task, since it will perform summaries on many variables simultaneously. Let's look at the median values for the variables we've used in the cluster analysis, broken up by the cluster groups. One oddity of the aggregate function is that it demands that the variable(s) used to divide up the data are passed to it in a list, even if there's only one variable:
> aggregate(cars.use,list(groups.3),median)
  Group.1        MPG     Weight Drive_Ratio Horsepower Displacement  Cylinders
1       1 -0.7945273  1.5051136  -0.9133729  1.0476133    2.4775849  4.7214353
2       2  0.6859228 -0.5870568   0.5269459 -0.6027364   -0.5809970 -0.6744908
3       3 -0.4058377  0.5246039  -0.1686227  0.3587717    0.3272282  2.0234723

If the ranges of these numbers seem strange, it's because we standardized the data before performing the cluster analysis. While it is usually more meaningful to look at the variables in their original scales, when data is centered, negative values mean "lower than most" and positive values mean "higher than most". Thus, group 1 is cars with relatively low MPG, high weight, low drive ratios, high horsepower and displacement, and more than average number of cylinders. Group 2 are cars with high gas mileage, and low weight and horsepower; and group 3 is similar to group 1. It may be easier to understand the groupings if we look at the variables in their original scales:
> aggregate(cars[,-c(1,2)],list(groups.3),median)
  Group.1   MPG Weight Drive_Ratio Horsepower Displacement Cylinders
1       1 17.30  3.890       2.430      136.5          334         8
2       2 30.25  2.215       3.455       79.0          105         4
3       3 20.70  3.105       2.960      112.5          173         6

It may also be useful to add the numbers of observations in each group to the above display. Since aggregate returns a data frame, we can manipulate it in any way we want:
> a3 = aggregate(cars[,-c(1,2)],list(groups.3),median)
> data.frame(Cluster=a3[,1],Freq=as.vector(table(groups.3)),a3[,-1])
  Cluster Freq   MPG Weight Drive_Ratio Horsepower Displacement Cylinders
1       1    8 17.30  3.890       2.430      136.5          334         8
2       2   20 30.25  2.215       3.455       79.0          105         4
3       3   10 20.70  3.105       2.960      112.5          173         6

To see how the four cluster solution differed from the three cluster solution, we can perform the same analysis for that solution:
> a4 = aggregate(cars[,-c(1,2)],list(groups.4),median)
> data.frame(Cluster=a4[,1],Freq=as.vector(table(groups.4)),a4[,-1])
  Cluster Freq  MPG Weight Drive_Ratio Horsepower Displacement Cylinders
1       1    8 17.3  3.890        2.43      136.5          334         8
2       2   17 30.9  2.190        3.37       75.0           98         4
3       3    3 21.5  2.795        3.77      110.0          121         4
4       4   10 20.7  3.105        2.96      112.5          173         6

The main difference seems to be that the four cluster solution recognized a group of cars that have higher horsepower and drive ratios than the other cars in the cluster they came from.

3  PAM: Partitioning Around Medoids

Unlike the hierarchical clustering methods, techniques like k-means cluster analysis (available through the kmeans function) or partitioning around mediods (avaiable through the pam function in the cluster library) require that we specify the number of clusters that will be formed in advance. pam offers some additional diagnostic information about a clustering solution, and provides a nice example of an alternative technique to hierarchical clustering. To use pam, you must first load the cluster library. You can pass pam a data frame or a distance matrix; since we've already formed the distance matrix, we'll use that. pam also needs the number of clusters you wish to form. Let's look at the three cluster solution produced by pam:
> library(cluster)
> cars.pam = pam(cars.dist,3)

First of all, let's see if the pam solution agrees with the hclust solution. Since pam only looks at one cluster solution at a time, we don't need to use the cutree function as we did with hclust; the cluster memberships are stored in the clustering component of the pam object; like most R objects, you can use the names function to see what else is available. Further information can be found in the help page for pam.object.
> names(cars.pam)
[1] "medoids"    ""     "clustering" "objective"  "isolation"
[6] "clusinfo"   "silinfo"    "diss"       "call"

We can use table to compare the results of the hclust and pam solutions:
> table(groups.3,cars.pam$clustering)
groups.3  1  2  3
       1  8  0  0
       2  0 19  1
       3  0  0 10

The solutions seem to agree, except for 1 observations that hclust put in group 2 and pam put in group 3. Which observations was it?
> cars$Car[groups.3 != cars.pam$clustering]
[1] Audi 5000

Notice how easy it is to get information like this due to the power of R's subscripting operations.
One novel feature of pam is that it finds observations from the original data that are typical of each cluster in the sense that they are closest to the center of the cluster. The indexes of the medoids are stored in the component of the pam object, so we can use that component as a subscript into the vector of car names to see which ones were selected:
> cars$Car[cars.pam$]
> cars$Car[cars.pam$]
[1] Dodge St Regis    Dodge Omni        Ford Mustang Ghia

Another feature available with pam is a plot known as a silhouette plot. First, a measure is calculated for each observation to see how well it fits into the cluster that it's been assigned to. This is done by comparing how close the object is to other objects in its own cluster with how close it is to objects in other clusters. (A complete description can be found in the help page for silhouette.) Values near one mean that the observation is well placed in its cluster; values near 0 mean that it's likely that an observation might really belong in some other cluster. Within each cluster, the value for this measure is displayed from smallest to largest. If the silhouette plot shows values close to one for each observation, the fit was good; if there are many observations closer to zero, it's an indication that the fit was not good. The sihouette plot is very useful in locating groups in a cluster analysis that may not be doing a good job; in turn this information can be used to help select the proper number of clusters. For the current example, here's the silhouette plot for the three cluster pam solution, produced by the command
> plot(cars.pam)

The plot indicates that there is a good structure to the clusters, with most observations seeming to belong to the cluster that they're in. There is a summary measure at the bottom of the plot labeled "Average Silhouette Width". This table shows how to use the value:
Range of SCInterpretation
0.71-1.0A strong structure has been found
0.51-0.70A reasonable structure has been found
0.26-0.50The structure is weak and could be artificial
< 0.25No substantial structure has been found
To create a silhouette plot for a particular solution derived from a hierarchical cluster analysis, the silhouette function can be used. This function takes the appropriate output from cutree along with the distance matrix used for the clustering. So to produce a silhouette plot for our 4 group hierarchical cluster (not shown), we could use the following statements:

4  AGNES: Agglomerative Nesting

As an example of using the agnes function from the cluster package, consider the famous Fisher iris data, available as the dataframe iris in R. First let's look at some of the data:
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

We will only consider the numeric variables in the cluster analysis. As mentioned previously, there are two functions to compute the distance matrix: dist and daisy. It should be mentioned that for data that's all numeric, using the function's defaults, the two methods will give the same answers. We can demonstrate this as follows:
> iris.use = subset(iris,select=-Species)
> d = dist(iris.use)
> library(cluster)
> d1 = daisy(iris.use)
> sum(abs(d - d1))
[1] 1.072170e-12

Of course, if we choose a non-default metric for dist, the answers will be different:
> dd = dist(iris.use,method='manhattan')
> sum(abs(as.matrix(dd) - as.matrix(d1)))
[1] 38773.86

The values are very different!
Continuing with the cluster example, we can calculate the cluster solution as follows:
> z = agnes(d)

The plotting method for agnes objects presents two different views of the cluster solution. When we plot such an object, the plotting function sets the graphics parameter ask=TRUE, and the following appears in your R session each time a plot is to be drawn:
Hit <Return> to see next plot: 

If you know you want a particular plot, you can pass the which.plots= argument an integer telling which plot you want.
The first plot that is displayed is known as a banner plot. The banner plot for the iris data is shown below:
The white area on the left of the banner plot represents the unclustered data while the white lines that stick into the red are show the heights at which the clusters were formed. Since we don't want to include too many clusters that joined together at similar heights, it looks like three clusters, at a height of about 2 is a good solution. It's clear from the banner plot that if we lowered the height to, say 1.5, we'd create a fourth cluster with only a few observations.
The banner plot is just an alternative to the dendogram, which is the second plot that's produced from an agnes object:
The dendogram shows the same relationships, and it's a matter of individual preference as to which one is easier to use.
Let's see how well the clusters do in grouping the irises by species:
> table(cutree(z,3),iris$Species)
    setosa versicolor virginica
  1     50          0         0
  2      0         50        14
  3      0          0        36

We were able to classify all the setosa and versicolor varieties correctly. The following plot gives some insight into why we were so successful:
> splom(~iris,groups=iris$Species,auto.key=TRUE)

File translated from TEX by TTH, version 3.67.
On 16 Mar 2011, 15:27.