Cluster Analysis
1 Database Tables and Factors
A different approach illustrates the relationship between factors and the
principles of normalization. Since a factor stores each of its levels only
once, normalized tables can often be used to form factors in R. Once again,
let's assume that we've loaded the RMySQL library and obtained a
connection object. This time, I'll trim off the unneeded variables from
the track data frame before processing:
> album = dbGetQuery(con,statement="select * from Album")
> track = dbGetQuery(con,statement="select * from Track")
> artist = dbGetQuery(con,statement="select * from Artist")
> track = track[,c('aid','alid','length')]
> names(track) = c('artist','album','length')
> track$artist = factor(track$artist,levels=artist$aid,labels=artist$name)
> track$album = factor(track$album,levels=album$alid,labels=album$title)
The tables formed in the process of normalizing the database essentially
contain the levels and labels of a factor variable in R.
Processing the data set is the same as in the previous example:
> track$albart = paste(track$album,track$artist,sep='\t')
> tot = aggregate(track$length,track['albart'],sum)
> aa = strsplit(tot$albart,'\t')
> tot$album = sapply(aa,'[',1)
> tot$artist = sapply(aa,'[',2)
> tot$albart = NULL
> use = tot[tot$x > 2400 & tot$x <= 2700,]
> use = use[order(use$x,decreasing=TRUE),]
> head(use)
x album artist
1996 2700 I Need Some Money Eddie Harris
4846 2700 We'll Be Together Again Joe Pass a J. J. Johnson
718 2699 Blues Groove Tiny Grimes & Coleman Hawkins
3239 2699 Pinky & Zoot Pinky Winters
1609 2698 Gerard Pochonet et son Quartette Lucky Thompson
3146 2698 Out Front! Jaki Byard
In practice, the most useful solutions would probably fall in between the
extreme of doing everything on the database and doing everything in R.
2 Using SQL in R
If you like using SQL, or you see a handy way to get something done
using SQL, you can use the sqldf package to operate on R
data frames using SQL.
When you pass a query to the sqldf function, it
will identify the data frames involved in the query, create a temporary
SQLite database, perform the query, and return the results.
For example, consider the task of merging two data frames, based
on the value of a variable id:
> data1 = data.frame(id=c(1,4,2,3,5),x=c(7,12,19,15,9))
> data2 = data.frame(id=c(2,3,4,1,5),y=c(21,32,29,35,19))
> sqldf('select * from data1 natural join data2')
id x y
1 1 7 35
2 4 12 29
3 2 19 21
4 3 15 32
5 5 9 19
3 Reading Spreadsheets with the RODBC Library
Now that we've got the basics of SQL down, we can look at a
third option for reading spreadsheets into R, the RODBC
library. Note that this functionality is only available on Windows
and you don't need a copy of Excel on your computer in order for it
to work. The ODBC interface provides access to spreadsheets using
SQL as a query language. Here's a sample session showing how we
could read the spreadsheet from the previous example using
RODBC:
> library(RODBC)
> con = odbcConnectExcel('2_2000_top200_postdoc.xls')
> tbls = sqlTables(con)
> tbls
tbls
TABLE_CAT TABLE_SCHEM
1 C:\\cygwin\\home\\spector\\2_2000_top200_postdoc <NA>
2 C:\\cygwin\\home\\spector\\2_2000_top200_postdoc <NA>
TABLE_NAME TABLE_TYPE REMARKS
1 'Postdocs -Top 200 $' TABLE <NA>
2 'Postdocs -Top 200 $'Print_Titles TABLE <NA>
> qry = paste('select * from ','[',tbls$TABLE_NAME[1],']',sep='')
> postdocs = sqlQuery(con,qry,stringsAsFactors=FALSE)
> head(postdocs)
F1 The Top 200 Institutions--Postdoctoral Appointees_(2005) F3 F4 F5 F6
1 NA <NA> NA NA NA NA
2 NA Top 50 Institutions\nin Postdoctoral Appointees \n(2005) NA NA NA NA
3 NA Harvard University 4384 1 1 NA
4 NA Johns Hopkins University 1442 2 2 NA
5 NA Stanford University 1259 3 3 NA
6 NA University of California - Los Angeles 1094 4 1 NA
> postdocs = postdocs[3:nrow(postdocs),c(2,3,4,5,7)]
> postdocs = postdocs[postdocs$F7 %in% c('Public','Private'),]
> dim(postdocs)
[1] 203 5
> sapply(postdocs,class)
The Top 200 Institutions--Postdoctoral Appointees_(2005)
"character"
F3
"numeric"
F4
"numeric"
F5
"numeric"
F7
"character"
> names(postdocs) = c('Institution','NPostdocs','Rank','ControlRank','Control')
The RODBC interface was able to correctly classify the numeric columns.
4 Introduction to Cluster Analysis
While we often think of statistics as giving definitive answers to
well-posed questions, there are some statistical techniques that are used
simply to gain further insight into a group of observations. One such
technique (which encompasses lots of different methods) is cluster analysis.
The idea of cluster analysis is that we have a set of observations, on which
we have available several measurements. Using these measurements, we want to
find out if the observations naturally group together in some predictable way.
For example, we may have recorded physical measurements on many animals, and
we want to know if there's a natural grouping (based, perhaps on species) that
distinquishes the animals from another. (This use of cluster analysis is
sometimes called "numerical taxonomy"). As another example, suppose we have
information on the demographics and buying habits of many consumers. We could
use cluster analysis on the data to see if there are distinct groups of
consumers with similar demographics and buying habits (market segmentation).
It's important to remember that cluster analysis isn't about finding the right
answer - it's about finding ways to look at data that allow us to understand the
data better. For example, suppose we have a deck of playing cards, and we want to
see if they form some natural groupings. One person may separate the black cards
from the red; another may break the cards up into hearts, clubs, diamonds and spades;
a third person might separate cards with pictures from cards with no pictures, and
a fourth might make one pile of aces, one of twos, and so on. Each
person is right in their own way, but in cluster analysis, there's really not a single
"correct" answer.
Another aspect of cluster analysis is that there are an enormous number of possible
ways of dividing a set of observations into groups. Even if we specify the number
of groups, the number of possibilities is still enormous. For example, consider the
task of dividing 25 observations into 5 groups. (25 observations is considered
very small in the world of cluster analysis). It turns out there are 2.4*1015
different ways to arrange those observations into 5 groups. If, as is often the
case, we don't know the number of groups ahead of time, and we need to consider
all possible numbers of groups (from 1 to 25), the number is more than 4*1018!
So any technique that simply tries all the different possibilities is doomed to
failure.
5 Standardization
There are two very important decisions that need to be made whenever you are carrying
out a cluster analysis. The first regards the relative scales of the variables
being measured. We'll see that the available cluster analysis algorithms all depend
on the concept of measuring the distance (or some other measure of similarity)
between the different observations we're trying to cluster. If one of the variables
is measured on a much larger scale than the other variables, then whatever measure
we use will be overly influenced by that variable. For example, recall the
world data set that we used earlier in the semester. Here's a quick
summary of the mean values of the variables in that data set:
> apply(world1[-c(1,6)],2,mean,na.rm=TRUE)
gdp income literacy military
9.053595e+03 1.025796e+04 8.094902e+01 5.679061e+09
Without some sort of standardization, a variable like literacy, measured
on a scale of 0 to 100 has no chance of influencing our solution when the other variables
are so much larger.
The traditional way of standardizing variables is to subtract their mean, and divide
by their standard deviation. Variables standardized this way are sometimes refered to
as z-scores, and always have a mean of zero and variance of one. In the case of variables
that contain outliers (observations that are much bigger or smaller than the vast majority
of the data), this sort of standardization may be too severe, scaling down the
outlying observations so that they appear to be closer to the others.
One alternative is to use
the mean absolute deviation in place of the standard deviation; another possibility is to
subtract the median and divide by either the interquartile range
or the mean absolute deviation. For the common
methods of measuring distances (discussed below), centering the data by subtracting the
mean or median is not really critical; it's the division by an appropriate scaling factor
that's important.
6 Distance Measures
The most common distance measure, and the default for most programs that perform cluster
analysis is the Euclidean distance, which is an extension of the usual notion of the
between two points in a plane. The Euclidean distance between two observations is calculated
as the square root of the sum of the squares of the distances between corresponding variables
in the two observations being considered. Another widely used measure is the Manhattan distance,
so named because it is similar to the distance between two points in a city, where you can only
travel along a grid of streets. It's calculated by adding up the absolute value of the
differences of the corresponding variables, and is less likely to be influenced by a very
large difference between just one of the variables. The Canberra distance is interesting in
that it performs its own standardization; absolute values of differences are divided by
the absolute value of the sum of the corresponding variables in the two observations.
Depending on the values and distributions of the variables in the data set being clustered,
these different distance measures may point out different aspects of the structure of the
data set.
Special consideration needs to be given to binary variables, that is, variables that take
on only one of two values like TRUE or FALSE, especially when they are used in conjunction
with continuous variables. Generally there are two types of measures that are used with
binary data. Symmetric measures view two observations as being close together if the binary
feature is either absent in both or present in both, while asymmetric measures only
view the observations as being close if the feature is present for both.
For some clustering methods, the entire distance matrix must be calculated; for other
methods, distances are only calculated as needed.
File translated from
TEX
by
TTH,
version 3.67.
On 8 Mar 2010, 15:39.