Programming in R

1 Operating on Groups of Data

We've already seen how the aggregate function let's us calculate data summaries for groups in our data. For example, using the world data set, we could find the average value of each variable, broken down by continent:

> world = read.csv('http://www.stat.berkeley.edu/~spector/s133/data/world2.txt',na.strings='.',comment='#')
> > names(world)
[1] "country"  "cont"     "gdp"      "income"   "literacy" "military" "phys"    
[8] "un.date" 
> aggregate(world[,-c(1,2,8)],world['cont'],mean,na.rm=TRUE)
  cont       gdp    income literacy    military      phys
1   AF  2723.404  3901.191 60.52979   356440000  26.31129
2   AS  7778.049  8868.098 84.25122  5006536341 161.83381
3   EU 19711.765 21314.324 98.40294  6311138235 311.18482
4   NA  8946.667 10379.143 85.52000 25919931267 187.08178
5   OC 14625.000 15547.500 87.50000  4462475000 127.63105
6   SA  6283.333  6673.083 92.29167  2137341667 154.81088

Since the aggregate function requires the grouping variables to be in a list, I took advantage of the fact that data frames are also lists, and passed a single element (using single brackets) to aggregate. Alternatively, I could have used a list directly:

> aggregate(world[,-c(1,2,8)],list(world$cont),mean,na.rm=TRUE)
  Group.1       gdp    income literacy    military      phys
1      AF  2723.404  3901.191 60.52979   356440000  26.31129
2      AS  7778.049  8868.098 84.25122  5006536341 161.83381
3      EU 19711.765 21314.324 98.40294  6311138235 311.18482
4      NA  8946.667 10379.143 85.52000 25919931267 187.08178
5      OC 14625.000 15547.500 87.50000  4462475000 127.63105
6      SA  6283.333  6673.083 92.29167  2137341667 154.81088

I would need to name the list element in order for its column in the output to be properly labeled.

Now suppose I'd like to calculate both the mean and the median for each continent:

> answer = aggregate(world[,-c(1,2,8)],world['cont'],function(x)c(mean(x,na.rm=TRUE),median(x,na.rm=TRUE)))
> answer
  cont     gdp.1     gdp.2  income.1  income.2 literacy.1 literacy.2
1   AF  2723.404  1400.000  3901.191  1813.000   60.52979   65.50000
2   AS  7778.049  4300.000  8868.098  4561.000   84.25122   89.90000
3   EU 19711.765 18150.000 21314.324 19804.000   98.40294   99.70000
4   NA  8946.667  4900.000        NA        NA   85.52000   87.70000
5   OC 14625.000 13700.000 15547.500 14970.000   87.50000   96.40000
6   SA  6283.333  5000.000  6673.083  5563.500   92.29167   92.30000
   military.1  military.2    phys.1    phys.2
1   356440000   101300000  26.31129   8.68680
2  5006536341   650000000 161.83381 139.92540
3  6311138235  1288500000 311.18482 317.88580
4 25919931267   147000000 187.08178 164.40130
5  4462475000   591500000 127.63105 128.11915
6  2137341667   742150000 154.81088 126.00035

There are eleven columns in the data, but when we look at the dimensions of the result, things seem strange:

> dim(answer)
[1] 6 6

That's because aggregate stores each set of the two values returned from the function we passed to it in matrices inside the answer data frame:

> sapply(answer,class)
       cont         gdp      income    literacy    military        phys 
"character"    "matrix"    "matrix"    "matrix"    "matrix"    "matrix" 
> names(answer)
[1] "cont"     "gdp"      "income"   "literacy" "military" "phys"    
> answer[,2]
          [,1]  [,2]
[1,]  2723.404  1400
[2,]  7778.049  4300
[3,] 19711.765 18150
[4,]  8946.667  4900
[5,] 14625.000 13700
[6,]  6283.333  5000

This actually makes it pretty easy to access the values that we want. For example, suppose we want the mean for the variable military. Since the mean was the first element returned by the function we passed to aggregate, it's the first column in the military matrix:

> data.frame(answer$cont,military.mean=answer$military[,1])
  answer.cont military.mean
1          AF     356440000
2          AS    5006536341
3          EU    6311138235
4          NA   25919931267
5          OC    4462475000
6          SA    2137341667

The output from aggregate is often easier to use if we give names to the values calculated by the function we pass to aggregate. For example, if we call aggregate like this:

  
> answer = aggregate(world[,-c(1,2,8)],world['cont'],function(x)c(mean=mean(x,na.rm=TRUE),median=median(x,na.rm=TRUE)))
> answer
  cont  gdp.mean gdp.median income.mean income.median literacy.mean
1   AF  2723.404   1400.000    3901.191      1813.000      60.52979
2   AS  7778.049   4300.000    8868.098      4561.000      84.25122
3   EU 19711.765  18150.000   21314.324     19804.000      98.40294
4   NA  8946.667   4900.000   10379.143      6753.000      85.52000
5   OC 14625.000  13700.000   15547.500     14970.000      87.50000
6   SA  6283.333   5000.000    6673.083      5563.500      92.29167
  literacy.median military.mean military.median phys.mean phys.median
1        65.50000     356440000       101300000  26.31129     8.68680
2        89.90000    5006536341       650000000 161.83381   139.92540
3        99.70000    6311138235      1288500000 311.18482   317.88580
4        87.70000   25919931267       147000000 187.08178   164.40130
5        96.40000    4462475000       591500000 127.63105   128.11915
6        92.30000    2137341667       742150000 154.81088   126.00035

Now we can say, for example:

> data.frame(cont=answer$cont,literacy.median=answer$literacy[,'median'])
  cont literacy.median
1   AF            65.5
2   AS            89.9
3   EU            99.7
4   NA            87.7
5   OC            96.4
6   SA            92.3

Another way to manage the output from aggregate takes advantage of the fact that when the data.frame function is called with multiple data frames, it doesn't store the data frames as separate columns, but rather makes one big data frame with lots of columns. So if we convert each of the matrices to data frames we can combine all of them into one big data frame:

> answer1 = data.frame(answer[,1],data.frame(answer[,2]),data.frame(answer[,3]),data.frame(answer[,4]),data.frame(answer[,5]),data.frame(answer[,6]))
> answer1
  answer...1.      mean median    mean.1 median.1   mean.2 median.2      mean.3
1          AF  2723.404   1400  3901.191   1813.0 60.52979     65.5   356440000
2          AS  7778.049   4300  8868.098   4561.0 84.25122     89.9  5006536341
3          EU 19711.765  18150 21314.324  19804.0 98.40294     99.7  6311138235
4          NA  8946.667   4900 10379.143   6753.0 85.52000     87.7 25919931267
5          OC 14625.000  13700 15547.500  14970.0 87.50000     96.4  4462475000
6          SA  6283.333   5000  6673.083   5563.5 92.29167     92.3  2137341667
    median.3    mean.4 median.4
1  101300000  26.31129   8.6868
2  650000000 161.83381 139.9254
3 1288500000 311.18482 317.8858
4  147000000 187.08178 164.4013
5  591500000 127.63105 128.1191
6  742150000 154.81088 126.0003
> dim(answer1)
[1]  6 11

We'd need to rename the columns, but there are now actually eleven columns as we would expect. Of course, we wouldn't want to do all that typing in practice. We would use some of R's powerful tools to automate the process. For example, we could try to use sapply to convert all the matrices to data frames

> sapply(answer[,-1],data.frame)
       gdp       income    literacy  military  phys     
mean   Numeric,6 Numeric,6 Numeric,6 Numeric,6 Numeric,6
median Numeric,6 Numeric,6 Numeric,6 Numeric,6 Numeric,6

As we've seen before this is usually an indication that sapply has simplified things too much. If we try lapply, this is what we get:

> lapply(answer[,-1],data.frame)
$gdp
       mean median
1  2723.404   1400
2  7778.049   4300
3 19711.765  18150
4  8946.667   4900
5 14625.000  13700
6  6283.333   5000

$income
       mean  median
1  3901.191  1813.0
2  8868.098  4561.0
3 21314.324 19804.0
4 10379.143  6753.0
5 15547.500 14970.0
6  6673.083  5563.5

We can convert the list to a data frame as follows:

> answer1= data.frame(answer[1],lapply(answer[,-1],data.frame))
> answer1
  cont  gdp.mean gdp.median income.mean income.median literacy.mean
1   AF  2723.404       1400    3901.191        1813.0      60.52979
2   AS  7778.049       4300    8868.098        4561.0      84.25122
3   EU 19711.765      18150   21314.324       19804.0      98.40294
4   NA  8946.667       4900   10379.143        6753.0      85.52000
5   OC 14625.000      13700   15547.500       14970.0      87.50000
6   SA  6283.333       5000    6673.083        5563.5      92.29167
  literacy.median military.mean military.median phys.mean phys.median
1            65.5     356440000       101300000  26.31129      8.6868
2            89.9    5006536341       650000000 161.83381    139.9254
3            99.7    6311138235      1288500000 311.18482    317.8858
4            87.7   25919931267       147000000 187.08178    164.4013
5            96.4    4462475000       591500000 127.63105    128.1191
6            92.3    2137341667       742150000 154.81088    126.0003

As is often the case, R takes care of the names more effectively than if we labouriously combined everything by hand. All that would be left to make this a truly useful display is to add the number of countries in each continent. Since this will be the same for each variable, it's usually not a good idea to return this value in the function you pass to aggregate. Instead, we can use the table, as.data.frame, and merge functions as follows:

> counts = as.data.frame(table(world$cont))
> names(counts) = c('cont','N')
> merge(counts,answer1)
  cont  N  gdp.mean gdp.median income.mean income.median literacy.mean
1   AF 47  2723.404       1400    3901.191        1813.0      60.52979
2   AS 41  7778.049       4300    8868.098        4561.0      84.25122
3   EU 34 19711.765      18150   21314.324       19804.0      98.40294
4   NA 15  8946.667       4900   10379.143        6753.0      85.52000
5   OC  4 14625.000      13700   15547.500       14970.0      87.50000
6   SA 12  6283.333       5000    6673.083        5563.5      92.29167
  literacy.median military.mean military.median phys.mean phys.median
1            65.5     356440000       101300000  26.31129      8.6868
2            89.9    5006536341       650000000 161.83381    139.9254
3            99.7    6311138235      1288500000 311.18482    317.8858
4            87.7   25919931267       147000000 187.08178    164.4013
5            96.4    4462475000       591500000 127.63105    128.1191
6            92.3    2137341667       742150000 154.81088    126.0003

Even though aggregate can handle cases with more than one value being calculated, it still can only work with one variable at a time. There are two approaches to solve the cases where you need to work with more than one variable. The first, uses the split function that we've seen for making boxplots. Suppose we want to find the two variables with the highest correlation within each continent. First, we can write a function that will find the variables for with the highest correlation in a data frame:

hicorr = function(x){
      x = x[,sapply(x,function(col)mode(col) == 'numeric' & class(col) != 'factor')]
      cc = cor(x,use='complete.obs')
      diag(cc) = 0
      wh = which(cc == max(cc),arr.ind=TRUE)
      c(rownames(cc)[wh[1]],colnames(cc)[wh[2]])
}

Now we can use sapply and split to find the variables with the highest correlation in each continent:

> as.data.frame(t(sapply(split(world,world$cont),hicorr)))
       V1       V2
AF   phys military
AS income      gdp
EU income      gdp
NA income      gdp
OC income      gdp
SA income      gdp

The other way to solve problems like this is with the by function. The by function produces nicely formatted output, but its output can be manipulated if necessary. Here's the basic approach for a problem like this.

> byvals = by(world,world$cont,hicorr)
> byvals
world$cont: AF
[1] "phys"     "military"
------------------------------------------------------------ 
world$cont: AS
[1] "income" "gdp"   
------------------------------------------------------------ 
world$cont: EU
[1] "income" "gdp"   
------------------------------------------------------------ 
world$cont: NA
[1] "income" "gdp"   
------------------------------------------------------------ 
world$cont: OC
[1] "income" "gdp"   
------------------------------------------------------------ 
world$cont: SA
[1] "income" "gdp"

Since each piece of the byvals object looks like part of a data frame we want, we can use do.call with rbind:

> ans = do.call(rbind,byvals)
> ans
   [,1]     [,2]      
AF "phys"   "military"
AS "income" "gdp"     
EU "income" "gdp"     
NA "income" "gdp"     
OC "income" "gdp"     
SA "income" "gdp"

As another example of do.call, consider the task of sorting a data frame by its first column, then using the second column to break any ties, and so on for each ofthe columns of the data frame. For example:

> x = data.frame(matrix(sample(1:5,100,replace=TRUE),ncol=5))
> head(x)
  X1 X2 X3 X4 X5
1  4  4  1  3  4
2  2  4  2  4  1
3  1  1  1  3  2
4  4  3  3  2  2
5  5  2  2  4  3
6  4  2  5  4  5

Since a data frame is a list, with each element representing a column, we can order the data frame as follows:

> x[do.call(order,x),]
   X1 X2 X3 X4 X5
3   1  1  1  3  2
17  1  2  3  5  5
10  1  4  5  1  5
20  2  4  2  1  1
2   2  4  2  4  1
11  3  1  4  4  2
16  3  4  1  1  3
8   4  2  1  1  1
14  4  2  2  3  5
18  4  2  3  4  2
13  4  2  5  1  1
6   4  2  5  4  5
7   4  3  1  4  2
4   4  3  3  2  2
1   4  4  1  3  4
9   4  5  3  1  2
5   5  2  2  4  3
12  5  3  4  2  4
15  5  3  5  1  5
19  5  4  5  1  3

Another interesting R function is Reduce. This function works with any binary function, that is, one which accepts two arguments, and repeatedly applies the binary function until the list is reduced to a single answer. One of the most important uses of Reduce is to merge more than two data frames at a time. Suppose we have three data frames that we wish to merge:

> one = data.frame(id=sample(LETTERS[1:10]),a=rnorm(10))
> two = data.frame(id=sample(LETTERS[1:10]),b=rnorm(10))
> three = data.frame(id=sample(LETTERS[1:10]),c=rnorm(10))

Of course, we could call merge twice:

> both = merge(one,two)
> both = merge(both,three)

But what if we had 20 such data frames:

> twenty = lapply(letters[1:20],function(i){
+                       z = data.frame(id=sample(LETTERS[1:10]),rnorm(10))
+	                names(z)[2] = i
+	                z})
> twenty[[1]]
   id          a
1   D  1.9112203
2   F -0.7198699
3   J -0.4290842
4   E  0.7849467
5   B  1.5758446
6   I  0.4998762
7   H  0.3568294
8   A -0.8567896
9   G -0.7182695
10  C -0.6894396
> twenty[[2]]
   id           b
1   C  2.03416915
2   J  0.69535618
3   E -0.66387813
4   H -0.25343952
5   D -1.04782198
6   G  1.56147090
7   B -0.28126409
8   F -0.30900749
9   I  0.52277648
10  A  0.03841634

We simply pass Reduce the merge function and the list of data frames, and it does the rest:

> all = Reduce(merge,twenty)
  id          a           b          c          d          e          f
1  A -0.8567896  0.03841634  1.1137667 -0.6139491 -1.2893453  0.5760507
2  B  1.5758446 -0.28126409  0.2777182 -0.1615362  0.8221153 -0.3328126
3  C -0.6894396  2.03416915  0.6990830  0.9430438 -0.7059770 -0.1309771
4  D  1.9112203 -1.04782198 -0.1040303  0.2433795  0.7645007 -0.3161749
5  E  0.7849467 -0.66387813  0.0683588 -0.5752617 -0.4059950 -1.2280617
6  F -0.7198699 -0.30900749 -1.0955231 -0.8593922  0.1585112  0.5434146
           g          h          i          j         k          l           m
1  1.1469512  1.8675396 -1.0730001 -0.6306937 2.5853058  0.2524264  0.67496861
2  0.3483675  0.6846907 -1.7896035  0.7710611 1.9890408  0.6174135  0.16124957
3 -0.9972951 -0.7140730  0.7919375  0.3678690 1.2336345 -0.2323708 -0.52465997
4  0.1179205 -0.1865189 -1.0364221  0.6013162 1.3426701 -0.3513651  0.48844598
5 -0.1171988 -0.1514659  0.2650891  0.1813734 0.9233442 -1.7945477 -0.07855065
6 -0.2897749 -0.5888773  1.8274135 -0.3195992 0.6230858 -0.4897521 -0.49559174
           n          o          p          q          r          s          t
1  0.5551094  1.6335976 -1.6149968 -0.0601484  0.1622304  0.1624207 -1.1408366
2  1.3832478 -0.8993505 -0.3657969 -0.2817061  0.2501859 -0.2096964 -0.5082768
3 -0.7369804  0.2675346 -0.3801290  2.2369038  1.7701125  0.6809073 -1.1597869
4  1.4028788 -0.9317347  0.1623700 -2.8401085 -2.0066582 -0.5500940  1.9968541
5 -0.2516113 -1.1269027  2.2224559 -0.2109566  1.6218192 -0.7154724  1.8204627
6  2.0342595 -0.1864191 -0.7241448  1.4295025 -2.1206024  0.4883462 -0.1468977

Another useful R function is reshape. This function reorganizes data in either the so-called wide or long format. In the wide format, all observations for a single individual are recorded in one row of a data frame, while in the long view, each observation is a separate row, with an additional variable to identify the observation. For example, here is a data set, in long format, representing test scores on several different observations:

> data = data.frame(id=rep(1:3,rep(3,3)),time=rep(1:3,3),x=rnorm(9))
> data
  id time           x
1  1    1 -0.85519374
2  1    2 -0.60227681
3  1    3  0.47671168
4  2    1  0.28097796
5  2    2 -2.20903570
6  2    3  0.08311228
7  3    1 -0.56753562
8  3    2  0.65588709
9  3    3  0.42419294

To convert to wide format, the following statements could be used:

> reshape(data,idvar='id',timevar='time',v.names='x',direction='wide')
  id        x.1        x.2        x.3
1  1 -0.8551937 -0.6022768 0.47671168
4  2  0.2809780 -2.2090357 0.08311228
7  3 -0.5675356  0.6558871 0.42419294

Notice that not all possible combinations of observations need to be present:

> data1 = data[-7,]
> reshape(data1,idvar='id',timevar='time',v.names='x',direction='wide')
  id        x.1        x.2        x.3
1  1 -0.8551937 -0.6022768 0.47671168
4  2  0.2809780 -2.2090357 0.08311228
8  3         NA  0.6558871 0.42419294

The missing observations are represented as NA in the output data frame.

To illustrace the wide-to-long transformation, consider the following data:

> data = data.frame(id = rep(1:3),x1=c(1,2,3),x2=c(10,20,30),x3=c(100,200,300))
> data
  id x1 x2  x3
1  1  1 10 100
2  2  2 20 200
3  3  3 30 300
> reshape(data,idvar='id',varying=list(c('x1','x2','x3')),direction='long')
    id time  x1
1.1  1    1   1
2.1  2    1   2
3.1  3    1   3
1.2  1    2  10
2.2  2    2  20
3.2  3    2  30
1.3  1    3 100
2.3  2    3 200
3.3  3    3 300

The trickiest part is noticing that the names of the three varying variables have to be placed in a list.

As another example of the use of reshape, consider a study looking at the BMI (Body Mass Index) of girls over time, measured for two different races. This data can be found in a tab-delimited file at http://www.stat.berkeley.edu/~spector/s133/data/bmi.tab. The variables in the file are ID, RACE, BMI, and VISIT. Suppose we wish to produce a plot of the average BMI versus visit, with separate lines for the two races. We could use xyplot from the lattice package as follows:

> bmi = read.delim('http://www.stat.berkeley.edu/~spector/s133/data/bmi.tab')
> avgbmi = aggregate(list(BMI=bmi$BMI),list(Race=bmi$RACE,Visit=bmi$VISIT),
+                     mean,na.rm=TRUE)
> library(lattice)
> xyplot(BMI~Visit,group=Race,type='l',auto.key=TRUE,
+        main='BMI vs. Visit',data=avgbmi)

The plot appears below:

Now suppose we wish to calculate the correlations between the BMIs at the different visits. The cor function expects all of the variables for which we want the correlations to be separate columns in the data frame we pass to the function. We can rearrange the data to make it suitable for calculating the correlations by calling reshape as follows:

> bmi1 = reshape(bmi,idvar=c('ID','RACE'),timevar='VISIT',v.names='BMI',
+                direction='wide')

We can now calculate the correlations by race:

> by(bmi1,bmi1$RACE,function(x)cor(x[,-c(1,2)],use='complete.obs'))
bmi1$RACE: 1
           BMI.0     BMI.2     BMI.3     BMI.4     BMI.5     BMI.6     BMI.7
BMI.0  1.0000000 0.9513181 0.9238520 0.8871666 0.8489529 0.8227992 0.8018502
BMI.2  0.9513181 1.0000000 0.9490917 0.9144459 0.8625492 0.8339581 0.8110637
BMI.3  0.9238520 0.9490917 1.0000000 0.9517413 0.9025264 0.8612542 0.8327787
BMI.4  0.8871666 0.9144459 0.9517413 1.0000000 0.9458218 0.9054674 0.8781602
BMI.5  0.8489529 0.8625492 0.9025264 0.9458218 1.0000000 0.9530917 0.9105118
BMI.6  0.8227992 0.8339581 0.8612542 0.9054674 0.9530917 1.0000000 0.9568980
BMI.7  0.8018502 0.8110637 0.8327787 0.8781602 0.9105118 0.9568980 1.0000000
BMI.8  0.7968433 0.8026011 0.8166798 0.8540803 0.8859762 0.9319712 0.9630340
BMI.9  0.7888784 0.7931164 0.8042629 0.8351485 0.8690846 0.9107269 0.9316619
BMI.10 0.7630397 0.7690169 0.7777677 0.8131666 0.8430350 0.8803034 0.8952306
           BMI.8     BMI.9    BMI.10
BMI.0  0.7968433 0.7888784 0.7630397
BMI.2  0.8026011 0.7931164 0.7690169
BMI.3  0.8166798 0.8042629 0.7777677
BMI.4  0.8540803 0.8351485 0.8131666
BMI.5  0.8859762 0.8690846 0.8430350
BMI.6  0.9319712 0.9107269 0.8803034
BMI.7  0.9630340 0.9316619 0.8952306
BMI.8  1.0000000 0.9596225 0.9209648
BMI.9  0.9596225 1.0000000 0.9488572
BMI.10 0.9209648 0.9488572 1.0000000
------------------------------------------------------------ 
bmi1$RACE: 2
           BMI.0     BMI.2     BMI.3     BMI.4     BMI.5     BMI.6     BMI.7
BMI.0  1.0000000 0.9628273 0.9281780 0.9010607 0.8646773 0.8410384 0.8173900
BMI.2  0.9628273 1.0000000 0.9607964 0.9334043 0.8940498 0.8730059 0.8543804
BMI.3  0.9281780 0.9607964 1.0000000 0.9656133 0.9300707 0.9081140 0.8886918
BMI.4  0.9010607 0.9334043 0.9656133 1.0000000 0.9605275 0.9378850 0.9130036
BMI.5  0.8646773 0.8940498 0.9300707 0.9605275 1.0000000 0.9722481 0.9465896
BMI.6  0.8410384 0.8730059 0.9081140 0.9378850 0.9722481 1.0000000 0.9751832
BMI.7  0.8173900 0.8543804 0.8886918 0.9130036 0.9465896 0.9751832 1.0000000
BMI.8  0.8051828 0.8423513 0.8746659 0.8979719 0.9293276 0.9561339 0.9769345
BMI.9  0.8003754 0.8389726 0.8689569 0.8868970 0.9147281 0.9410288 0.9582904
BMI.10 0.7862279 0.8224567 0.8464351 0.8640517 0.8894121 0.9157546 0.9316408
           BMI.8     BMI.9    BMI.10
BMI.0  0.8051828 0.8003754 0.7862279
BMI.2  0.8423513 0.8389726 0.8224567
BMI.3  0.8746659 0.8689569 0.8464351
BMI.4  0.8979719 0.8868970 0.8640517
BMI.5  0.9293276 0.9147281 0.8894121
BMI.6  0.9561339 0.9410288 0.9157546
BMI.7  0.9769345 0.9582904 0.9316408
BMI.8  1.0000000 0.9767038 0.9460700
BMI.9  0.9767038 1.0000000 0.9675484
BMI.10 0.9460700 0.9675484 1.0000000

Not surprisingly, there's a very high correlation between the BMI at the different visits for both races.

File translated from T_EX by T_TH, version 3.67.
On 28 Mar 2011, 23:44.