> actualvalues = c(TRUE,TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE) > predvalues = c(TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,FALSE,TRUE) > tt = table(actualvalues,predvalues) > tt predvalues actualvalues FALSE TRUE FALSE 3 2 TRUE 1 4The observations that contribute to Type I error (the actual value is false but we predicted true) can be found in the first row and second column; those that contribute to Type II error can be found in the second row and first column. Since the

> tot = sum(tt) > type1 = tt['FALSE','TRUE'] / tot > type2 = tt['TRUE','FALSE'] / tot > type1 [1] 0.2 > type2 [1] 0.1

wine = read.csv('http://www.stat.berkeley.edu/classes/s133/data/wine.data',header=FALSE) names(wine) = c("Cultivar", "Alcohol", "Malic.acid", "Ash", "Alkalinity.ash", "Magnesium", "Phenols", "Flavanoids", "NF.phenols", "Proanthocyanins", "Color.intensity","Hue","OD.Ratio","Proline") wine$Cultivar = factor(wine$Cultivar)Notice that I set

> summary(wine) Cultivar Alcohol Malic.acid Ash Alkalinity.ash 1:59 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60 2:71 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20 3:48 Median :13.05 Median :1.865 Median :2.360 Median :19.50 Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50 Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00 Magnesium Phenols Flavanoids NF.phenols Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700 Median : 98.00 Median :2.355 Median :2.135 Median :0.3400 Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375 Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600 Proanthocyanins Color.intensity Hue OD.Ratio Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938 Median :1.555 Median : 4.690 Median :0.9650 Median :2.780 Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170 Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000 Proline Min. : 278.0 1st Qu.: 500.5 Median : 673.5 Mean : 746.9 3rd Qu.: 985.0 Max. :1680.0Since the scale of the variables differ widely, standardization is probably a good idea. We'll divide each variable by its standard deviation to try to give each variable more equal weight in determining the distances:

> wine.use = scale(wine[,-1],scale=apply(wine[,-1],2,sd)) > library(class) > res = knn.cv(wine.use,wine$Cultivar,k=3) > names(res) NULL > length(res) [1] 178Since there are no names, and the length of

> table(res,wine$Cultivar) res 1 2 3 1 59 4 0 2 0 63 0 3 0 4 48To calculate the proportion of incorrect classifications, we can use the

> tt = table(res,wine$Cultivar) > row(tt) [,1] [,2] [,3] [1,] 1 1 1 [2,] 2 2 2 [3,] 3 3 3 > col(tt) [,1] [,2] [,3] [1,] 1 2 3 [2,] 1 2 3 [3,] 1 2 3However, if you recall that the misclassified observations are those that are off the diagonal, we can find those observations as follows:

> tt[row(tt) != col(tt)] [1] 0 0 4 4 0 0and the proportion of misclassified observations can be calculated as:

> sum(tt[row(tt) != col(tt)]) / sum(tt) [1] 0.04494382or a missclassification rate of about 4.5 Could we have done better if we used 5 nearest neighbors instead of 3?

> res = knn.cv(wine.use,wine$Cultivar,k=5) > tt = table(res,wine$Cultivar) > sum(tt[row(tt) != col(tt)]) / sum(tt) [1] 0.02808989How about using just the single nearest neighbor?

> res = knn.cv(wine.use,wine$Cultivar,k=1) > tt = table(res,wine$Cultivar) > sum(tt[row(tt) != col(tt)]) / sum(tt) [1] 0.04494382For this data set, using k=5 did slightly better than 1 or 3.

File translated from T

On 30 Mar 2011, 16:12.