Cluster Analysis

1 Database Tables and Factors

A different approach illustrates the relationship between factors and the principles of normalization. Since a factor stores each of its levels only once, normalized tables can often be used to form factors in R. Once again, let's assume that we've loaded the RMySQL library and obtained a connection object. This time, I'll trim off the unneeded variables from the track data frame before processing:

> album = dbGetQuery(con,statement="select * from Album")
> track = dbGetQuery(con,statement="select * from Track")
> artist = dbGetQuery(con,statement="select * from Artist")
> track = track[,c('aid','alid','length')]
> names(track) = c('artist','album','length')
> track$artist = factor(track$artist,levels=artist$aid,labels=artist$name)
> track$album =  factor(track$album,levels=album$alid,labels=album$title)

The tables formed in the process of normalizing the database essentially contain the levels and labels of a factor variable in R. Processing the data set is the same as in the previous example:

> track$albart = paste(track$album,track$artist,sep='\t')
> tot = aggregate(track$length,track['albart'],sum)
> aa = strsplit(tot$albart,'\t')
> tot$album = sapply(aa,'[',1)
> tot$artist = sapply(aa,'[',2)
> tot$albart = NULL
> use = tot[tot$x > 2400 & tot$x <= 2700,]
> use = use[order(use$x,decreasing=TRUE),]
> head(use)
        x                            album                         artist
1996 2700                I Need Some Money                   Eddie Harris
4846 2700          We'll Be Together Again       Joe Pass a J. J. Johnson
718  2699                     Blues Groove Tiny Grimes  & Coleman Hawkins
3239 2699                     Pinky & Zoot                  Pinky Winters
1609 2698 Gerard Pochonet et son Quartette                 Lucky Thompson
3146 2698                       Out Front!                     Jaki Byard

In practice, the most useful solutions would probably fall in between the extreme of doing everything on the database and doing everything in R.

2 Using SQL in R

If you like using SQL, or you see a handy way to get something done using SQL, you can use the sqldf package to operate on R data frames using SQL. When you pass a query to the sqldf function, it will identify the data frames involved in the query, create a temporary SQLite database, perform the query, and return the results. For example, consider the task of merging two data frames, based on the value of a variable id:

> data1 = data.frame(id=c(1,4,2,3,5),x=c(7,12,19,15,9))
> data2 = data.frame(id=c(2,3,4,1,5),y=c(21,32,29,35,19))
> sqldf('select * from data1 natural join data2')
  id  x  y
1  1  7 35
2  4 12 29
3  2 19 21
4  3 15 32
5  5  9 19

3 Reading Spreadsheets with the `RODBC` Library

Now that we've got the basics of SQL down, we can look at a third option for reading spreadsheets into R, the RODBC library. Note that this functionality is only available on Windows and you don't need a copy of Excel on your computer in order for it to work. The ODBC interface provides access to spreadsheets using SQL as a query language. Here's a sample session showing how we could read the spreadsheet from the previous example using RODBC:

> library(RODBC)
> con = odbcConnectExcel('2_2000_top200_postdoc.xls')
> tbls = sqlTables(con)
> tbls
tbls
                                         TABLE_CAT TABLE_SCHEM
1 C:\\cygwin\\home\\spector\\2_2000_top200_postdoc        <NA>
2 C:\\cygwin\\home\\spector\\2_2000_top200_postdoc        <NA>
                         TABLE_NAME TABLE_TYPE REMARKS
1             'Postdocs -Top 200 $'      TABLE    <NA>
2 'Postdocs -Top 200 $'Print_Titles      TABLE    <NA>
> qry = paste('select * from ','[',tbls$TABLE_NAME[1],']',sep='')
> postdocs = sqlQuery(con,qry,stringsAsFactors=FALSE)
> head(postdocs)
  F1 The Top 200 Institutions--Postdoctoral Appointees_(2005)   F3 F4 F5 F6
1 NA                                                     <NA>   NA NA NA NA
2 NA Top 50 Institutions\nin Postdoctoral Appointees \n(2005)   NA NA NA NA
3 NA                                       Harvard University 4384  1  1 NA
4 NA                                 Johns Hopkins University 1442  2  2 NA
5 NA                                      Stanford University 1259  3  3 NA
6 NA                   University of California - Los Angeles 1094  4  1 NA
> postdocs = postdocs[3:nrow(postdocs),c(2,3,4,5,7)]
> postdocs = postdocs[postdocs$F7 %in% c('Public','Private'),]
> dim(postdocs)
[1] 203   5
> sapply(postdocs,class)
The Top 200 Institutions--Postdoctoral Appointees_(2005) 
                                             "character"
                                                      F3 
                                               "numeric" 
                                                      F4 
                                               "numeric" 
                                                      F5 
                                               "numeric" 
                                                      F7 
                                              "character" 
> names(postdocs) = c('Institution','NPostdocs','Rank','ControlRank','Control')

The RODBC interface was able to correctly classify the numeric columns.

4 Introduction to Cluster Analysis

While we often think of statistics as giving definitive answers to well-posed questions, there are some statistical techniques that are used simply to gain further insight into a group of observations. One such technique (which encompasses lots of different methods) is cluster analysis. The idea of cluster analysis is that we have a set of observations, on which we have available several measurements. Using these measurements, we want to find out if the observations naturally group together in some predictable way. For example, we may have recorded physical measurements on many animals, and we want to know if there's a natural grouping (based, perhaps on species) that distinquishes the animals from another. (This use of cluster analysis is sometimes called "numerical taxonomy"). As another example, suppose we have information on the demographics and buying habits of many consumers. We could use cluster analysis on the data to see if there are distinct groups of consumers with similar demographics and buying habits (market segmentation).

It's important to remember that cluster analysis isn't about finding the right answer - it's about finding ways to look at data that allow us to understand the data better. For example, suppose we have a deck of playing cards, and we want to see if they form some natural groupings. One person may separate the black cards from the red; another may break the cards up into hearts, clubs, diamonds and spades; a third person might separate cards with pictures from cards with no pictures, and a fourth might make one pile of aces, one of twos, and so on. Each person is right in their own way, but in cluster analysis, there's really not a single "correct" answer.

Another aspect of cluster analysis is that there are an enormous number of possible ways of dividing a set of observations into groups. Even if we specify the number of groups, the number of possibilities is still enormous. For example, consider the task of dividing 25 observations into 5 groups. (25 observations is considered very small in the world of cluster analysis). It turns out there are 2.4*10¹⁵ different ways to arrange those observations into 5 groups. If, as is often the case, we don't know the number of groups ahead of time, and we need to consider all possible numbers of groups (from 1 to 25), the number is more than 4*10¹⁸! So any technique that simply tries all the different possibilities is doomed to failure.

5 Standardization

There are two very important decisions that need to be made whenever you are carrying out a cluster analysis. The first regards the relative scales of the variables being measured. We'll see that the available cluster analysis algorithms all depend on the concept of measuring the distance (or some other measure of similarity) between the different observations we're trying to cluster. If one of the variables is measured on a much larger scale than the other variables, then whatever measure we use will be overly influenced by that variable. For example, recall the world data set that we used earlier in the semester. Here's a quick summary of the mean values of the variables in that data set:

> apply(world1[-c(1,6)],2,mean,na.rm=TRUE)
         gdp       income     literacy     military
9.053595e+03 1.025796e+04 8.094902e+01 5.679061e+09

Without some sort of standardization, a variable like literacy, measured on a scale of 0 to 100 has no chance of influencing our solution when the other variables are so much larger.

The traditional way of standardizing variables is to subtract their mean, and divide by their standard deviation. Variables standardized this way are sometimes refered to as z-scores, and always have a mean of zero and variance of one. In the case of variables that contain outliers (observations that are much bigger or smaller than the vast majority of the data), this sort of standardization may be too severe, scaling down the outlying observations so that they appear to be closer to the others. One alternative is to use the mean absolute deviation in place of the standard deviation; another possibility is to subtract the median and divide by either the interquartile range or the mean absolute deviation. For the common methods of measuring distances (discussed below), centering the data by subtracting the mean or median is not really critical; it's the division by an appropriate scaling factor that's important.

6 Distance Measures

The most common distance measure, and the default for most programs that perform cluster analysis is the Euclidean distance, which is an extension of the usual notion of the between two points in a plane. The Euclidean distance between two observations is calculated as the square root of the sum of the squares of the distances between corresponding variables in the two observations being considered. Another widely used measure is the Manhattan distance, so named because it is similar to the distance between two points in a city, where you can only travel along a grid of streets. It's calculated by adding up the absolute value of the differences of the corresponding variables, and is less likely to be influenced by a very large difference between just one of the variables. The Canberra distance is interesting in that it performs its own standardization; absolute values of differences are divided by the absolute value of the sum of the corresponding variables in the two observations. Depending on the values and distributions of the variables in the data set being clustered, these different distance measures may point out different aspects of the structure of the data set.

Special consideration needs to be given to binary variables, that is, variables that take on only one of two values like TRUE or FALSE, especially when they are used in conjunction with continuous variables. Generally there are two types of measures that are used with binary data. Symmetric measures view two observations as being close together if the binary feature is either absent in both or present in both, while asymmetric measures only view the observations as being close if the feature is present for both.

For some clustering methods, the entire distance matrix must be calculated; for other methods, distances are only calculated as needed.

File translated from T_EX by T_TH, version 3.67.
On 8 Mar 2010, 15:39.