Classification Analysis

1  Introduction to Classification Methods

When we apply cluster analysis to a dataset, we let the values of the variables that were measured tell us if there is any structure to the observations in the data set, by choosing a suitable metric and seeing if groups of observations that are all close together can be found. If we have an auxilliary variable (like the country of origin from the cars example), it may be interesting to see if the natural clustering of the data corresponds to this variable, but it's important to remember that the idea of clustering is just to see if any groups form naturally, not to see if we can actually figure out which group an observation belongs to based on the values of the variables that we have.
When the true goal of our data analysis is to be able to predict which of several non-overlapping groups an observation belongs to, the techniques we use are known as classification techniques. We'll take a look at three classification techniques: kth nearest neighbor classification, linear discrimininant analysis, and recursive partitioning.

2  kth Nearest Neighbor Classification

The idea behind nearest neighbor classification is simple and somewhat intuitive - find other observations in the data that are close to an observation we're interested, and classify that observation based on the class of its neighbors. The number of neighbors that we consider is where the "k" comes in - usually we'll have to look at several different values of k to determine which ones work well with a particular data set. Values in the range of one to ten are usually reasonable choices.
Since we need to look at all of the distances between one observation and all the others in order to find the neighbors, it makes sense to form a distance matrix before starting a nearest neighbor classification. Each row of the distance matrix tells us the distances to all the other observations, so we need to find the k smallest values in each row of the distance matrix. Once we find those smallest values, we determine which observations they belong to, and look at how those observations were classified. We assign whichever value of the classification that was most common among the k nearest neighbors as our guess (predicted value) for the current observation, and then move on to the next row of the distance matrix. Once we've looked at every row of the distance matrix, we'll have classified every observation, and can compare the predicted classification with the actual classification. To see how well we've done, various error rates can be examined. If the observations are classified as TRUE / FALSE, for example disease or no disease, then we can look at two types of error rates. The first type of error rate, known as Type I error, occurs when we say that an observation should be classified as TRUE when it really should have been FALSE. The other type of error (Type II) occurs when we say that an observation should be classified as FALSE when it should have been TRUE. When the classification is something other than TRUE/FALSE, we can report an overall error rate, that is, the fraction of observations for which our prediction was not correct. In either case, the error rates can be calculated in R by using the table function. As a simple example, suppose we have two vectors: actualvalues, which contains the actual values of a classification variable, and predvalues, the value that our classification predicted:
> actualvalues = c(TRUE,TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE)
> predvalues = c(TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,FALSE,TRUE)
> tt = table(actualvalues,predvalues)
> tt
            predvalues
actualvalues FALSE TRUE
       FALSE     3    2
       TRUE      1    4

The observations that contribute to Type I error (the actual value is false but we predicted true) can be found in the first row and second column; those that contribute to Type II error can be found in the second row and first column. Since the table function returns a matrix, we can calculate the rows as follows:
> tot = sum(tt)
> type1 = tt['FALSE','TRUE'] / tot
> type2 = tt['TRUE','FALSE'] / tot
> type1
[1] 0.2
> type2
[1] 0.1

3  Cross Validation

There's one problem with the above scheme. We used the data that we're making predictions about in the process of making those predictions. In other words, the data that we're making predictions for is not independent of the data that we're using to make the predictions. As might be expected, it's been shown in practice that calculating error rates this way will almost always make our classification method look better than it should be. If the data can be naturally (or even artificially) divided into two groups, then one can be used as a training set, and the other can be used as a test set - we'd calculate our error rates only from the classification of the test set using the training set to make our predictions. Many statisticians don't like the idea of having to "hold back" some of their data when building models, so an alternative way to bring some independence to our predictions known as v-fold cross validation has been devised. The idea is to first divide the entire data set into v groups. To classify objects in the first group, we don't use any of the first group to make our predictions; in the case of k-th nearest neighbor classification, that would mean that when we're looking for the smallest distances in order to classify an observation, we don't consider any of the distances corresponding to other members of the same group that the current one belongs to. The basic idea is that we want to make the prediction for an observation as independent from that observation as we can. We continue through each of the v groups, classifying observations in each group using only observations from the other groups. When we're done we'll have a prediction for each observation, and can compare them to the actual values as in the previous example.
Another example of cross-validation is leave-out-one cross-validation. With this method, we predict the classification of an observation without using the observation itself. In other words, for each observation, we perform the analysis without using that observation, and then predict where that observation would be classified using that analysis.

4  Linear Discriminant Analysis

One of the oldest forms of classification is known as linear discriminant analysis. The idea is to form linear combinations of predictor variables (similar to a linear regression model) in such a way that the average value of these linear combinations will be as different as possible for the different levels of the classification variable. Based on the values of the linear combinations, linear discriminant analysis reports a set of posterior probabilities for every level of the classification, for each observation, along with the level of the classification variable that the analysis predicted. Suppose we have a classification variable that can take one of three values: after a linear discriminant analysis, we will have three probabilities (adding up to one) for each variable that tell how likely it is that the observation be categorized into each of the three categories; the predicted classificiation is the one that had the highest probability, and we can get insight into the quality of the classification by looking at the values of the probabilities.
To study the different classification methods, we'll use a data set about different wines. This data set contains various measures regarding chemical and other properties of the wines, along with a variable identifying the Cultivar (the particular variety of the grape from which the wine was produced). We'll try to classify the observations based on the Cultivar, using the other variables. The data is available at http://www.stat.berkeley.edu/~spector/s133/data/wine.data; information about the variables is at http://www.stat.berkeley.edu/~spector/s133/data/wine.names
First, we'll read in the wine dataset:
wine = read.csv('http://www.stat.berkeley.edu/~spector/s133/data/wine.data',header=FALSE)
names(wine) = c("Cultivar", "Alcohol", "Malic.acid", "Ash", "Alkalinity.ash",
                "Magnesium", "Phenols", "Flavanoids", "NF.phenols", "Proanthocyanins",
                "Color.intensity","Hue","OD.Ratio","Proline")
wine$Cultivar = factor(wine$Cultivar)

Notice that I set wine$Cultivar to be a factor. Factors are very important and useful in modeling functions because categorical variables almost always have to be treated differently than numeric variables, and turning a categorical variable into a factor will insure that they are always used properly in modeling functions. Not surprisingly, the dependent variable for lda must be a factor.
The class library of R provides two functions for nearest neighbor classification. The first, knn, takes the approach of using a training set and a test set, so it would require holding back some of the data. The other function, knn.cv uses leave-out-one cross-validation, so it's more suitable to use on an entire data set.
Let's use knn.cv on the wine data set. Since, like cluster analysis, this technique is based on distances, the same considerations regarding standardization as we saw with cluster analysis apply. Let's examine a summary for the data frame:
> summary(wine)
 Cultivar    Alcohol        Malic.acid         Ash        Alkalinity.ash 
 1:59     Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
 2:71     1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
 3:48     Median :13.05   Median :1.865   Median :2.360   Median :19.50  
          Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
          3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
          Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
   Magnesium         Phenols        Flavanoids      NF.phenols    
 Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300  
 1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700  
 Median : 98.00   Median :2.355   Median :2.135   Median :0.3400  
 Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619  
 3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375  
 Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600  
 Proanthocyanins Color.intensity       Hue            OD.Ratio    
 Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
 1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
 Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
 Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
 3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
 Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
    Proline      
 Min.   : 278.0  
 1st Qu.: 500.5  
 Median : 673.5  
 Mean   : 746.9  
 3rd Qu.: 985.0  
 Max.   :1680.0  

Since the scale of the variables differ widely, standardization is probably a good idea. We'll divide each variable by its standard deviation to try to give each variable more equal weight in determining the distances:
> wine.use = scale(wine[,-1],scale=apply(wine[,-1],2,sd))
> library(class)
> res = knn.cv(wine.use,wine$Cultivar,k=3)
> names(res)
NULL
> length(res)
[1] 178

Since there are no names, and the length of res is the same as the number of observations, knn.cv is simply returning the classifications that the method predicted for each observation using leave-out-one cross validation. This means we can compare the predicted values to the true values using table:
> table(res,wine$Cultivar)
   
res  1  2  3
  1 59  4  0
  2  0 63  0
  3  0  4 48

To calculate the proportion of incorrect classifications, we can use the row and col functions. These unusual functions don't seem to do anything very useful when we simply call them:
> tt = table(res,wine$Cultivar)
> row(tt)
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    2    2    2
[3,]    3    3    3
> col(tt)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    2    3
[3,]    1    2    3

However, if you recall that the misclassified observations are those that are off the diagonal, we can find those observations as follows:
> tt[row(tt) != col(tt)]
[1] 0 0 4 4 0 0

and the proportion of misclassified observations can be calculated as:
> sum(tt[row(tt) != col(tt)]) / sum(tt)
[1] 0.04494382

or a missclassification rate of about 4.5
Could we have done better if we used 5 nearest neighbors instead of 3?
> res = knn.cv(wine.use,wine$Cultivar,k=5)
> tt = table(res,wine$Cultivar)
> sum(tt[row(tt) != col(tt)]) / sum(tt)
[1] 0.02808989

How about using just the single nearest neighbor?
> res = knn.cv(wine.use,wine$Cultivar,k=1)
> tt = table(res,wine$Cultivar)
> sum(tt[row(tt) != col(tt)]) / sum(tt)
[1] 0.04494382

For this data set, using k=5 did slightly better than 1 or 3.


File translated from TEX by TTH, version 3.85.
On 30 Mar 2011, 16:12.