Data Frames

1  More on Data Frames

    Notice that if you want to extract more than one column of a data frame, you need to use single brackets, not double:
    > temps[c('min','max')]
        min  max
    1  50.7 59.5
    2  52.8 55.7
    3  48.6 57.3
    4  53.0 71.5
    5  49.9 69.8
    6  47.9 68.8
    7  54.1 67.5
    8  47.6 66.0
    9  43.6 66.1
    10 45.5 61.7
    > temps[[c('min','max')]]
    Error in .subset2(x, i, exact = exact) : subscript out of bounds
    
    
  1. If you want to work with a data frame without having to constantly retype the data frame's name, you can use the with function. Suppose we want to convert our minimum and maximum temperatures to centigrade, and then calculate the difference between them. Using with, we can write:
    > with(temps,5/9*(max-32) - 5/9*(min-32))
     [1]  4.888889  1.611111  4.833333 10.277778 11.055556 11.611111  7.444444
     [8] 10.222222 12.500000  9.000000
    
    
    which may be more convenient than typing out the data frame name repeatedly:
    > 5/9*(temps$max-32) - 5/9*(temps$min-32)
     [1]  4.888889  1.611111  4.833333 10.277778 11.055556 11.611111  7.444444
     [8] 10.222222 12.500000  9.000000
    
    
  2. Finally, if the goal is to a add one or more new columns to a data frame, you can combine a few operations into one using the transform function. The first argument to transform is the name of the data frame that will be used to construct the new columns. The remaining arguments to transform are name/value pairs describing the new columns. For example, suppose we wanted to create a new variable in the temps data frame called range, representing the difference between the min and max values for each day. We could use transform as follows:
    > temps = transform(temps,range = max - min)
    > head(temps)
        day  min  max range
      1   1 50.7 59.5   8.8
      2   2 52.8 55.7   2.9
      3   3 48.6 57.3   8.7
      4   4 53.0 71.5  18.5
      5   5 49.9 69.8  19.9
      6   6 47.9 68.8  20.9
    
    
    As can be seen, transform returns a new data frame like the original one, but with one or more new columns added.

2  Reading Data Frames from Files and URLs

While creating a data frame the way we just did is very handy for quick examples, it's actually pretty rare to enter a data frame in that way; usually we'll be reading data from a file or possibly a URL. In these cases, the read.table function (or one of its' closely related variations described below) can be used. read.table tries to be clever about figuring out what type of data you'll be using, and automatically determines how each column of the data frame should be stored. One problem with this scheme is has to do with a special type of variable known as a factor. A factor in R is a variable that is stored as an integer, but displayed as a character string. By default, read.table will automatically turn all the character variables that it reads into factors. You can recognize factors by using either the is.factor function or by examining the object's class, using the class function. Factors are very useful for storing large data sets compactly, as well as for statistical modeling and other tasks, but when you're first working with R they'll most likely just get in the way. To avoid read.table from doing any factor conversions, pass the stringsAsFactors=FALSE argument as shown in the examples below.
By default, R expects there to be at least one space or tab between each of the data values in your input file; if you're using a different character to separate your values, you can specify it with the sep= argument. Two special versions of read.table are provided to handle two common cases: read.csv for files where the data is separated by commas, and read.delim when a tab character is used to separate values. On the other hand, if the variables in your input data occupy the same columns for every line in the file, the read.fwf can be used to turn your data into a data frame.
If the first line of your input file contains the names of the variables in your data separated with the same separator used for the rest of the data, you can pass the header=TRUE argument to read.table and its variants, and the variables (columns) of your data frame will be named accordingly. Otherwise, names like V1, V2, etc. will be used.
As an example of how to read data into a data frame, the URL http://www.stat.berkeley.edu/classes/s133/data/world.txt contains information about literacy, gross domestic product, income and military expenditures for about 150 countries. Here are the first few lines of the file:
country,gdp,income,literacy,military
Albania,4500,4937,98.7,56500000
Algeria,5900,6799,69.8,2.48e+09
Angola,1900,2457,66.8,183580000
Argentina,11200,12468,97.2,4.3e+09
Armenia,3900,3806,99.4,1.35e+08

(You can use your favorite browser to examine a file like this, or you can use R's download.file and file.edit functions to download a copy to your computer and examine it locally.)
Since the values are separated by commas, and the variable names can be found in the first line of the file, we can read the data into a data frame as follows:
world = read.csv('http://www.stat.berkeley.edu/classes/s133/data/world.txt',header=TRUE,stringsAsFactors=FALSE)

Now that we've created the data frame, we need to look at some ways to understand what our data is like. The class and mode of objects in R is very important, but if we query them for our data frame, they're not very interesting:
> mode(world)
[1] "list"
> class(world)
[1] "data.frame"

Note that a data frame is also a list. We'll look at lists in more detail later. As we've seen, we can use the sapply function to see the modes of the individual columns. This function will apply a function to each element of a list; for a data frame these elements represent the columns (variables), so it will do exactly what we want:
> sapply(world,mode)
    country         gdp      income    literacy    military
"character"   "numeric"   "numeric"   "numeric"   "numeric"
> sapply(world,class)
    country         gdp      income    literacy    military
"character"   "integer"   "integer"   "numeric"   "numeric"

You might want to experiment with sapply using other functions to get familiar with some strategies for dealing with data frames.
You can always view the names of the variables in a data frame by using the names function, and the size (number of observations and number of variables) using the dim function:
> names(world)
[1] "country"  "gdp"      "income"   "literacy" "military"
> dim(world)
[1] 154   5

Suppose we want to see the country for which military spending is the highest. We can use the which.max function that we used before but extra care is needed to make sure we get the piece of the data frame we want. Since each country occupies one row in the data frame, we want all of the columns in that row, and we can leave the second index of the data frame blank:
>
> world[which.max(world$military),]
    country   gdp income literacy  military
142     USA 37800  39496     99.9 3.707e+11

The 142 at the beginning of the line is the row number of the observation. If you'd like to use a more informative label for the rows, look at the row.names= argument in read.table and data.frame, or use the assignment form of the row.names function if the data frame already exists.
These types of queries, where we want to find observations from a data frame that have certain properties, are so common that R provides a function called subset to make them easier and more readable. The subset function requires two arguments: the first is a data frame, and the second is the condition that you want to use to create the subset. An optional third argument called select= allows you to specify which of the variables in the data frame you're interested in. The return value from subset is a data frame, so you can use it anywhere that you'd normally use a data frame. A very attractive feature of subset is that you can refer to the columns of a data frame directly in the second or third arguments; you don't need to keep retyping the data frame's name, or surround all the variable names with quotes. Suppose we want to find those countries whose literacy rate is below 20%. We could use the subset function like this:
> subset(world,literacy < 20)
         country  gdp income literacy military
22  Burkina Faso 1100   1258     12.8 64200000
88          Mali  900   1024     19.0 22400000
102        Niger  800    865     14.4 33300000

One other nice feature of the select= argument is that it converts variable names to numbers before extracting the requested variables, so you can use "ranges" of variable names to specify contiguous columns in a data frame. For example, here are the names for the world data frame:
> names(world)
[1] "country"  "gdp"      "income"   "literacy" "military"

To create a data frame with just the last three variables, we could use
> subset(world,select=income:military)

If we were interested in a particular variable, it would be useful to reorder the rows of our data frame so that they were arranged in descending or ascending order of that variable. It's easy enough to sort a variable in R; using literacy as an example, we simply call the sort routine:
> sort(world$literacy)
 [1] 12.8 14.4 19.0 25.5 29.6 33.6 39.3 39.6 41.0 41.1 41.5 46.5 47.0 48.6 48.6
[16] 48.7 49.0 50.7 51.2 51.9 53.0 54.1 55.6 56.2 56.7 57.3 58.9 59.0 61.0 64.0
[31] 64.1 65.5 66.8 66.8 67.9 67.9 68.7 68.9 69.1 69.4 69.8 70.6 71.0 73.6 73.6
[46] 74.3 74.4 75.7 76.7 76.9 77.0 77.3 78.9 79.2 79.4 79.7 80.0 81.4 81.7 82.4
[61] 82.8 82.9 82.9 84.2 84.3 85.0 86.5 86.5 87.6 87.7 87.7 87.7 87.9 87.9 88.0
[76] 88.3 88.4 88.7 89.2 89.9 90.0 90.3 90.3 90.4 90.9 91.0 91.0 91.6 91.9 91.9
[91] 92.5 92.5 92.6 92.6 92.7 92.9 93.0 94.2 94.6 95.7 95.8 96.2 96.5 96.8 96.8
[106] 96.9 96.9 97.2 97.2 97.3 97.7 97.7 97.8 98.1 98.2 98.5 98.5 98.7 98.7 98.8
[121] 98.8 99.3 99.3 99.4 99.4 99.5 99.5 99.6 99.6 99.6 99.7 99.7 99.7 99.8 99.9
[136] 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9
[151] 99.9 99.9 99.9 99.9

To reorder the rows of a data frame to correspond to the sorted order of one of the variables in the data frame, the order function can be used. This function returns a set of indices which are in the proper order to rearrange the data frame appropriately. (Perhaps the easiest way to understand what the order function is to realize that x[order(x)] is the same as sort(x).)
> sworld = world[order(world$literacy),]
> head(sworld)
         country  gdp income literacy  military
22  Burkina Faso 1100   1258     12.8  64200000
103        Niger  800    865     14.4  33300000
89          Mali  900   1024     19.0  22400000
29          Chad 1200   1555     25.5 101300000
121 Sierra Leone  500    842     29.6  13200000
14         Benin 1100   1094     33.6  96500000

To sort by descending values of a variable, pass the decreasing=TRUE argument to sort or order.
When you're first working with a data frame, it can be helpful to get some preliminary information about the variables. One easy way to do this is to pass the data frame to the summary function, which understands what a data frame is, and will give separate summaries for each of the variables:
> summary(world)
   country               gdp            income         literacy
 Length:154         Min.   :  500   Min.   :  569   Min.   :12.80
 Class :character   1st Qu.: 1825   1st Qu.: 2176   1st Qu.:69.17
 Mode  :character   Median : 4900   Median : 5930   Median :88.55
                    Mean   : 9031   Mean   :10319   Mean   :81.05
                    3rd Qu.:11700   3rd Qu.:15066   3rd Qu.:98.42
                    Max.   :55100   Max.   :63609   Max.   :99.90
                                    NA's   :    1
    military
 Min.   :6.500e+06
 1st Qu.:5.655e+07
 Median :2.436e+08
 Mean   :5.645e+09
 3rd Qu.:1.754e+09
 Max.   :3.707e+11

Another useful way to view the properties of a variable is with the stem function, which produces a text-base stem-and-leaf diagram. Each observation for the variable is represented by a number in the diagram showing that observation's value:
> stem(world$gdp)

  The decimal point is 4 digit(s) to the right of the |

  0 | 11111111111111111111111111112222222222222222222223333333333344444444
  0 | 55555555555666666666677777778889999
  1 | 000111111223334
  1 | 66788889
  2 | 0022234
  2 | 7778888999
  3 | 00013
  3 | 88
  4 |
  4 |
  5 |
  5 | 5

Graphical techniques are often useful when exploring a data frame. While we'll look at graphics in more detail later, the functions boxplot, hist, and plot combined with the density function are often good choices. Here are examples:
> boxplot(world$gdp,main='Boxplot of GDP')
> hist(world$gdp,main='Histogram of GDP')
> plot(density(world$gdp),main='Density of GDP')




File translated from TEX by TTH, version 3.67.
On 30 Jan 2011, 19:56.