Cluster Analysis

1  Introduction to Cluster Analysis

While we often think of statistics as giving definitive answers to well-posed questions, there are some statistical techniques that are used simply to gain further insight into a group of observations. One such technique (which encompasses lots of different methods) is cluster analysis. The idea of cluster analysis is that we have a set of observations, on which we have available several measurements. Using these measurements, we want to find out if the observations naturally group together in some predictable way. For example, we may have recorded physical measurements on many animals, and we want to know if there's a natural grouping (based, perhaps on species) that distinquishes the animals from another. (This use of cluster analysis is sometimes called "numerical taxonomy"). As another example, suppose we have information on the demographics and buying habits of many consumers. We could use cluster analysis on the data to see if there are distinct groups of consumers with similar demographics and buying habits (market segmentation).
It's important to remember that cluster analysis isn't about finding the right answer - it's about finding ways to look at data that allow us to understand the data better. For example, suppose we have a deck of playing cards, and we want to see if they form some natural groupings. One person may separate the black cards from the red; another may break the cards up into hearts, clubs, diamonds and spades; a third person might separate cards with pictures from cards with no pictures, and a fourth might make one pile of aces, one of twos, and so on. Each person is right in their own way, but in cluster analysis, there's really not a single "correct" answer.
Another aspect of cluster analysis is that there are an enormous number of possible ways of dividing a set of observations into groups. Even if we specify the number of groups, the number of possibilities is still enormous. For example, consider the task of dividing 25 observations into 5 groups. (25 observations is considered very small in the world of cluster analysis). It turns out there are 2.4*1015 different ways to arrange those observations into 5 groups. If, as is often the case, we don't know the number of groups ahead of time, and we need to consider all possible numbers of groups (from 1 to 25), the number is more than 4*1018! So any technique that simply tries all the different possibilities is doomed to failure.

2  Standardization

There are two very important decisions that need to be made whenever you are carrying out a cluster analysis. The first regards the relative scales of the variables being measured. We'll see that the available cluster analysis algorithms all depend on the concept of measuring the distance (or some other measure of similarity) between the different observations we're trying to cluster. If one of the variables is measured on a much larger scale than the other variables, then whatever measure we use will be overly influenced by that variable. For example, recall the world data set that we used earlier in the semester. Here's a quick summary of the mean values of the variables in that data set:
> apply(world1[-c(1,6)],2,mean,na.rm=TRUE)
         gdp       income     literacy     military
9.053595e+03 1.025796e+04 8.094902e+01 5.679061e+09

Without some sort of standardization, a variable like literacy, measured on a scale of 0 to 100 has no chance of influencing our solution when the other variables are so much larger.
The traditional way of standardizing variables is to subtract their mean, and divide by their standard deviation. Variables standardized this way are sometimes refered to as z-scores, and always have a mean of zero and variance of one. In the case of variables that contain outliers (observations that are much bigger or smaller than the vast majority of the data), this sort of standardization may be too severe, scaling down the outlying observations so that they appear to be closer to the others. One alternative is to use the median absolute deviation in place of the standard deviation; another possibility is to subtract the median and divide by either the interquartile range or the median absolute deviation. For the common methods of measuring distances (discussed below), centering the data by subtracting the mean or median is not really critical; it's the division by an appropriate scaling factor that's important.

3  Distance Measures

The most common distance measure, and the default for most programs that perform cluster analysis is the Euclidean distance, which is an extension of the usual notion of the distance between two points in a plane. The Euclidean distance between two observations is calculated as the square root of the sum of the squares of the distances between corresponding variables in the two observations being considered. Another widely used measure is the Manhattan distance, so named because it is similar to the distance between two points in a city, where you can only travel along a grid of streets. It's calculated by adding up the absolute value of the differences of the corresponding variables, and is less likely to be influenced by a very large difference between just one of the variables. The Canberra distance is interesting in that it performs its own standardization; absolute values of differences are divided by the absolute value of the sum of the corresponding variables in the two observations. Depending on the values and distributions of the variables in the data set being clustered, these different distance measures may point out different aspects of the structure of the data set.
Special consideration needs to be given to binary variables, that is, variables that take on only one of two values like TRUE or FALSE, especially when they are used in conjunction with continuous variables. Generally there are two types of measures that are used with binary data. Symmetric measures view two observations as being close together if the binary feature is either absent in both or present in both, while asymmetric measures only view the observations as being close if the feature is present for both.
For some clustering methods, the entire distance matrix must be calculated; for other methods, distances are only calculated as needed.



File translated from TEX by TTH, version 3.67.
On 5 May 2011, 09:54.