4 Data Frames

We have used information about a 14-member family to demonstrate several features of the R language. This information appears in several vectors, each with 14 elements. It’s natural for us to picture this information arranged in a table, where each column of the table corresponds to a variable, like name, age, height and weight, and each row corresponds to a particular individual, such as the 77-year old named Tom in the first row and the 27 year-old named Sue in the 5th row. R provides a data structure, called a data frame, for collecting vectors into one object, which we can imagine as a table. More specifically, a data frame is an ordered collection of vectors, where the vectors must all be the same length but can be different types.

Example: A Data Frame of Measurements on a Family

The R objects that we loaded into our R session in Section 3.1 contains a data frame for our 14-member family. This data frame is called family. We can use head() to examine the first few rows of family with

head(family)
##   firstName sex age height weight      bmi overWt
## 1       Tom   m  77     70    175 25.16239   TRUE
## 2      Maya   f  33     64    124 21.50106  FALSE
## 3       Joe   m  79     73    185 24.45884  FALSE
## 4    Robert   m  47     67    156 24.48414  FALSE
## 5       Sue   f  27     61     98 18.51492  FALSE
## 6       Liz   f  33     68    190 28.94981   TRUE

The names of the variables in the data frame are not exactly the same as the names of the individual vectors, but we can see that these data are the same as in the separate vectors, e.g., fbmi and the column labeled bmi have the same values in the same order. These vectors are heterogenous in type. That is, firstName is a character vector, sex a factor, age is integer, bmi is numeric, and overWt is logical. The ordering of the vectors in the data frame is firstName, sex, age, and so on.

The types of some of the vectors are not immediately apparent from the examination of the head of the data frame. For example firstName and sex look similar in type. We can try to confirm this with a call to the class() function:

class(family)
## [1] "data.frame"

We did not get what we expected, i.e., the data types of all the variables in family. Calling class() with family returns the type of the family object, not the vectors that it contains. We will soon address how to get the data types of the vectors in family.

We can find additional information about the data frame with some of the same functions that we used to find information about vectors. For example, length() returns the number of vectors in the data frame, names() gives us the names of the vectors. Additionally, dim() provides the number of rows and columns in the data frame.

4.1 Dollar-Sign Notation

To access the vectors within a data frame we can use the $-notation. For example, we find the class of firstName and sex in family with

class(family$firstName)
## [1] "character"
class(family$sex)
## [1] "factor"

Now we have the answer that alluded us in our previous example.

Example: Exploring the family Data

We can use the $-notation to pass a vector in the family data frame as input to any of the functions we introduced in Chapter 3 (as long as these functions accept vectors as inputs). Of course, our family data is artificial so we can not make too much of what we find in our exploration, but it gives an idea as to what is possible.

We can find the average height and weight of the family members with, respectively,

mean(family$height)
## [1] 66.8571428571
mean(family$weight)
## [1] 157.714285714

Alternatively, the summary() function accepts a data frame as input and provides summary statistics for each variable. We call summary() with

summary(family)
##   firstName         sex        age                 height          
##  Length:14          f:6   Min.   :24.0000000   Min.   :61.0000000  
##  Class :character   m:8   1st Qu.:33.0000000   1st Qu.:65.2500000  
##  Mode  :character         Median :47.5000000   Median :67.0000000  
##                           Mean   :48.1428571   Mean   :66.8571429  
##                           3rd Qu.:58.0000000   3rd Qu.:68.0000000  
##                           Max.   :79.0000000   Max.   :73.0000000  
##      weight                bmi               overWt       
##  Min.   : 98.000000   Min.   :18.5149200   Mode :logical  
##  1st Qu.:128.750000   1st Qu.:22.7105300   FALSE:8        
##  Median :161.000000   Median :24.4714900   TRUE :6        
##  Mean   :157.714286   Mean   :24.6085507                  
##  3rd Qu.:182.500000   3rd Qu.:26.5116350                  
##  Max.   :215.000000   Max.   :30.0491100

Notice that summary() does not provide the same summary statistics for all the vectors in family. The factor, sex, is summarized with counts of the number of elements in each level, and the same type of summary is provided for the logical overWt. The character vector firstName is summarized only by its length. The summary of integer and numeric variables include mean and median, minimum and maximum, and upper and lower quartiles.

Lastly, we can also make a scatter plot of height and weight with

plot(y = family$height, x = family$weight, 
     pch = 19, col = 1 + family$overWt)
Example Scatter Plot

Figure 4.1: Example Scatter Plot

This scatter plot shows the heights and weights of the individuals in the artificial family that we are working with to demonstrate coding in R.
The red points correspond to over weight family members. We coerced overWt into a numeric vector with values 1 and 2 by adding 1 to each element, i.e., 1 + family$overWt. These numbers correspond to the colors black and red. The resulting plot appears in Figure 4.1

4.2 Applying Functions to Vectors in a Data Frame

Often, we want to apply a function to each variable in a data frame. For example, we may want to know each vector’s class. We have seen already that we can find this information for one vector with the $-notation, e.g.,

class(family$sex) 
## [1] "factor"

However, it is cumbersome to call class() 7 times, once for each of the variables, and we also need to know all of the variable names to make these 7 calls. Instead, we can ask R to apply the class() function to each of the vectors in family with

sapply(family, class)
##   firstName         sex         age      height      weight         bmi 
## "character"    "factor"   "integer"   "integer"   "integer"   "numeric" 
##      overWt 
##   "logical"

The sapply() function can be very useful when we work with data frames.
sapply() accepts additional arguments which can be passed on to the function that is being applied to each vector in the data frame. See the example below for an application of this concept.

Example: Quantiles of the Family Measurements

We can find the median for each of the variables in family with,

sapply(family, median)

Error in median.default(X[[2L]], ...) : need numeric data
In addition: Warning message:
In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA

Unfortunately, since firstName and sex are character and factor vectors, respectively, the median() function returns an error indicating that it takes only numeric inputs. If we want to avoid taking the median of these two variables, we can apply the median() function to all but the first two variables by excluding them. The select() function in the dplyr package allows us to choose a subset of the variables in a data frame to work with. Below, we select columns 3, 4, …, 7 of the data frame, and pass this smaller data frame to sapply().

require(dplyr)
sapply(select(family, 3:7), median)
##       age    height    weight       bmi    overWt 
##  47.50000  67.00000 161.00000  24.47149   0.00000

We have used subsetting to drop firstName and sex from the data frame (see Chapter ?? for more about subsetting). R applies the median() function to the remaining variables in the data frame. Also, notice that the logical vector, overWt, has been converted to numeric by the median() function.

Suppose that we want the lower quartile of these vectors, rather than the median. We can find the lower quartile for height with quantile(family$height, probs = 0.25). In addition to supplying the input family$height to the quantile() function, we also have provided the particular quantile that we want via the probs argument. In order to find the lower quantile of all the numeric vectors in family, we can use sapply(), but we need to provide the probs argument to quantile(). We can do this with

sapply(select(family,"age":"overWt"), quantile, probs = 0.25)
##    age.25% height.25% weight.25%    bmi.25% overWt.25% 
##   33.00000   65.25000  128.75000   22.71053    0.00000

Notice that we subsetted the columns by name in our call to subset(). This function allows the specification of a range of variables using their names, e.g., "age":"overWt" is shorthand for c("age", "height", "weight", "bmi", "overWt"). This kind of sequence is recognized by functions in dplyr.

R offers several apply functions for working with data frames and other data structures. We only describe sapply() here. Others, such as lapply(), tapply(), and apply() are discussed in Chapter ??.

4.3 Summary: Data Frames

Data from a study or an experiment often consist of different types of measurements on a set of subjects or experimental units. For example, the World Bank provides summary statistics on countries around the world. In this case, the unit is the country and the measurements include gross domestic product, life expectancy, and literacy rate. These data are naturally organized into a table format with one row for each country and one column for each summary statistic.

The data frame is a data structure designed for the purpose of working with data that naturally organize into rows that correspond to observations/units/individuals and columns that correspond to measurements take on the observations.

  • Rows: Each row in a data frame corresponds to a subject in the study, unit in an experiment, etc. The various measurements on a subject are in one row of the data frame.

  • Columns: The columns in a data frame correspond to variables. These can be different types. For example, a health study may contain height (numeric), sex (factor), education level (orded factor), and a unique identifier (character) for all subjects.

  • Working with Data Frames: The data frame facilitates many kinds of statistical analyses. For example, we can easily examine the relationship between income and sex using the formula income ~ sex in a call to the plot() function. And this same formula can be used to compare the average income levels for the two sexes via the lm() function. In both cases, the function performs a computation on the columns in a data frame. Depending on the data types, a different plot is created or a different model is fitted. For example try running the following code and comparing the results:

plot(height ~ weight, data = family)
plot(height ~ sex, data = family)

lm(height ~ weight, data = family)
lm(height ~ sex, data = family)