Data Frames and Plotting
1 Reading Data Frames from Files and URLs
It's actually pretty rare to enter a data frame the way we've done in these
examples; usually you'll be reading data from a file or possibly a URL. In
these cases, the read.table function (or one of its' closely
related variations described below) can be used. read.table tries
to be clever about figuring out what type of data you'll be using, and
automatically determines how each column of the data frame should be
stored. One problem with this scheme is has to do with a special type of
variable known as a factor. A factor in R is a variable that is stored as
an integer, but displayed as a character string.
By default, read.table will automatically turn all the character
variables that it reads into factors.
You can recognize factors
by using either the is.factor function or by examining the
objects class, using the class function. Factors are very useful
for storing large data sets compactly, as well as for statistical modeling
and other tasks, but when you're first working with R they'll most likely
just get in the way. To avoid read.table from doing any factor
conversions, pass the stringsAsFactors=TRUE argument as shown in the examples
below.
By default, R expects there to be at least one space or tab between each
of the data values in your input file; if you're using a different character
to separate your values, you can specify it with the sep= argument.
Two special versions of read.table are provided to handle two
common cases: read.csv for files where the data is separated by
commas, and read.delim when a tab character is used to separate
values. On the other hand, if the variables in your input data occupy
the same columns for every line in the file, the read.fwf can be
used to turn your data into a data frame.
If the first line of your input file contains the names of the variables in
your data separated with the same separator used for the rest of the data,
you can pass the header=TRUE argument to read.table and
its variants, and the variables (columns) of your data frame will be
named accordingly. Otherwise, names like V1, V2, etc. will
be used.
As an example of how to read data into a data frame, the URL
http://www.stat.berkeley.edu/~spector/s133/data/world.txt
contains information about literacy, gross domestic product, income and
military expenditures for about 150 countries. Here are the first few
lines of the file:
country,gdp,income,literacy,military
Albania,4500,4937,98.7,56500000
Algeria,5900,6799,69.8,2.48e+09
Angola,1900,2457,66.8,183580000
Argentina,11200,12468,97.2,4.3e+09
Armenia,3900,3806,99.4,1.35e+08
Since the values are separated by commas, and the variable names
can be found in the first line of the file, we can read the data into
a data frame as follows:
world = read.csv('http://www.stat.berkeley.edu/~spector/s133/data/world.txt',header=TRUE,stringsAsFactors=FALSE)
Now that we've created the data frame, we need to look at some ways to
understand what our data is like. The class and mode of objects in R is
very important, but if we query them for our data frame, they're not
very interesting:
> mode(world)
[1] "list"
> class(world)
[1] "data.frame"
Note that a data frame is also a list. We'll look at lists in more
detail later.
In order to see the modes and classes of the individual columns,
we can use the sapply function. This function will apply a
function to each element of a list; for a data frame these elements
represent the columns (variables), so it will do exactly what we want:
> sapply(world,mode)
country gdp income literacy military
"character" "numeric" "numeric" "numeric" "numeric"
> sapply(world,class)
country gdp income literacy military
"character" "integer" "integer" "numeric" "numeric"
You might want to experiment with sapply using other functions
to get familiar with some strategies for dealing with data frames.
You can always view the names of the variables in a data frame by using
the names function, and the size (number of observations and
number of variables) using the dim function:
> names(world)
[1] "country" "gdp" "income" "literacy" "military"
> dim(world)
[1] 154 5
Suppose we want to see the country for which military spending is the
highest. We can still use logical subscripts just as we did with vectors,
but extra care is needed to make sure we get the piece of the data frame
we want. Since each country occupies one row in the data frame, we want
all of the columns in that row, and we can leave the second index of the
data frame blank:
>
> world[world$military == max(world$military,na.rm=TRUE),]
country gdp income literacy military
141 USA 37800 39496 99.9 3.707e+11
The 141 at the beginning of the line is the row number of
the observation. If you'd like to use a more informative label for the
rows, look at the row.names= argument in read.table and
data.frame, or use the assignment form of the row.names
function if the data frame already exists.
These types of queries, where we want to find observations from a data frame
that have certain properties, are so common that R provides a function called
subset to make them easier and more readable. The subset
function requires two arguments: the first is a data frame, and the second is
the condition that you want to use to create the subset.
An optional third argument called select= allows you to specify which
of the variables in the data frame you're interested in.
The return value from subset is a data frame, so you can use it anywhere
that you'd normally use a data frame.
A very attractive
feature of subset is that you can refer to the columns of a data frame directly
in the second or third arguments; you don't need to keep retyping the data frame's name,
or surround all the variable names with quotes.
So the previous query could be rewritten as:
> subset(world,military==max(military,na.rm=TRUE))
country gdp income literacy military
141 USA 37800 39496 99.9 3.707e+11
One other nice feature of the select= argument is that it converts
variable names to numbers before extracting the requested variables, so you
can use "ranges" of variable names to specify contiguous columns in a
data frame. For example, here are the names for the world data
frame:
> names(world)
[1] "country" "gdp" "income" "literacy" "military"
To create a data frame with just the last three variables, we could
use
> subset(world,select=income:military)
If we were interested in a particular variable, it would be useful to
reorder the rows of our data frame so that they were arranged in descending
or ascending order of that variable. It's easy enough to sort a variable
in R; using literacy as an example, we simply call the sort
routine:
> sort(world$literacy)
[1] 12.8 14.4 19.0 25.5 29.6 33.6 39.3 39.6 41.0 41.1 41.5 46.5 47.0 48.6 48.6
[16] 48.7 49.0 50.7 51.2 51.9 53.0 54.1 55.6 56.2 56.7 57.3 58.9 59.0 61.0 64.0
[31] 64.1 65.5 66.8 66.8 67.9 67.9 68.7 68.9 69.1 69.4 69.8 70.6 71.0 73.6 73.6
[46] 74.3 74.4 75.7 76.7 76.9 77.0 77.3 78.9 79.2 79.4 79.7 80.0 81.4 81.7 82.4
[61] 82.8 82.9 82.9 84.2 84.3 85.0 86.5 86.5 87.6 87.7 87.7 87.7 87.9 87.9 88.0
[76] 88.3 88.4 88.7 89.2 89.9 90.0 90.3 90.3 90.4 90.9 91.0 91.0 91.6 91.9 91.9
[91] 92.5 92.5 92.6 92.6 92.7 92.9 93.0 94.2 94.6 95.7 95.8 96.2 96.5 96.8 96.8
[106] 96.9 96.9 97.2 97.2 97.3 97.7 97.7 97.8 98.1 98.2 98.5 98.5 98.7 98.7 98.8
[121] 98.8 99.3 99.3 99.4 99.4 99.5 99.5 99.6 99.6 99.6 99.7 99.7 99.7 99.8 99.9
[136] 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9
[151] 99.9 99.9 99.9 99.9
To reorder the rows of a data frame to correspond to the sorted order of one of
the variables in the data frame, the order function can be used.
This function returns a set of indices which are in the proper order to
rearrange the data frame appropriately:
> sworld = world[order(world$literacy),]
> head(sworld)
country gdp income literacy military
22 Burkina Faso 1100 1258 12.8 64200000
103 Niger 800 865 14.4 33300000
89 Mali 900 1024 19.0 22400000
29 Chad 1200 1555 25.5 101300000
121 Sierra Leone 500 842 29.6 13200000
14 Benin 1100 1094 33.6 96500000
To sort by descending values of a variable, pass the decreasing=TRUE
argument to sort or order.
When you're first working with a data frame, it can be helpful to get some
preliminary information about the variables. One easy way to do this is to
pass the data frame to the summary function, which understands what
a data frame is, and will give separate summaries for each of the variables:
> summary(world)
country gdp income literacy
Length:154 Min. : 500 Min. : 569 Min. :12.80
Class :character 1st Qu.: 1825 1st Qu.: 2176 1st Qu.:69.17
Mode :character Median : 4900 Median : 5930 Median :88.55
Mean : 9031 Mean :10319 Mean :81.05
3rd Qu.:11700 3rd Qu.:15066 3rd Qu.:98.42
Max. :55100 Max. :63609 Max. :99.90
NA's : 1
military
Min. :6.500e+06
1st Qu.:5.655e+07
Median :2.436e+08
Mean :5.645e+09
3rd Qu.:1.754e+09
Max. :3.707e+11
Another useful way to view the properties of a variable is with the stem
function, which produces a text-base stem-and-leaf diagram. Each observation
for the variable is represented by a number in the diagram showing that
observation's value:
> stem(world$gdp)
The decimal point is 4 digit(s) to the right of the |
0 | 11111111111111111111111111112222222222222222222223333333333344444444
0 | 55555555555666666666677777778889999
1 | 000111111223334
1 | 66788889
2 | 0022234
2 | 7778888999
3 | 00013
3 | 88
4 |
4 |
5 |
5 | 5
Graphical techniques are often useful when exploring a data frame. While we'll
look at graphics in more detail later, the functions boxplot,
hist, and plot combined with the density function
are often good choices. Here are examples:
> boxplot(world$gdp,main='Boxplot of GDP')
> hist(world$gdp,main='Histogram of GDP')
> plot(density(world$gdp),main='Density of GDP')
Suppose we want to add some additional information to our data frame, for example
the continents in which the countries can be found. Very often we have information
from different sources and it's very important to combine it correctly. The URL
http://www.stat.berkeley.edu/s133/data/conts.txt contains the information about the continents. Here are
the first few lines of that file:
country,cont
Afghanistan,AS
Albania,EU
Algeria,AF
American Samoa,OC
Andorra,EU
In R, the merge function allows you to combine two data frames
based on the value of a variable that's common to both of them. The new data
frame will have all of the variables from both of the original data frames. First,
we'll read in the continent values into a data frame called conts:
conts = read.csv('http://www.stat.berkeley.edu/~spector/s133/data/conts.txt',na.string='.',stringsAsFactors=FALSE)
To merge two data frames, we simply need to tell the merge function which
variable(s) the two data frames have in common, in this case country:
world1 = merge(world,conts,by='country')
Notice that we pass the name of the variable that we want to merge by, not the
actual value of the variable itself. The first few records of the merged data
set look like this:
> head(world1)
country gdp income literacy military cont
1 Albania 4500 4937 98.7 5.6500e+07 EU
2 Algeria 5900 6799 69.8 2.4800e+09 AF
3 Angola 1900 2457 66.8 1.8358e+08 AF
4 Argentina 11200 12468 97.2 4.3000e+09 SA
5 Armenia 3900 3806 99.4 1.3500e+08 AS
6 Australia 28900 29893 99.9 1.6650e+10 OC
We've already seen how to count specific conditions, like how many countries
in our data frame are in Europe:
> sum(world1$cont == 'EU')
[1] 34
It would be tedious to have to repeat this for each of the continents. Instead,
we can use the table function:
> table(world1$cont)
AF AS EU NA OC SA
47 41 34 15 4 12
We can now examine the variables taking into account the continent that they're
in. For example, suppose we wanted to view the literacy rates of countries in
the different continents. We can produce side-by-side boxplots like this:
> boxplot(split(world1$literacy,world1$cont))
Now let's concentrate on plots involving two variables. It may be surprising,
but R is smart enough to know how to "plot" a dataframe. It actually calls
the pairs function, which will produce what's called a scatterplot
matrix. This is a display with many little graphs showing the relationships
between each pair of variables in the data frame. Before we can call
plot, we need to remove the character variables (country
and cont) from the data using negative subscripts:
> plot(world1[,-c(1,6)])
The resulting plot looks like this:
As we'd expect, gdp (Gross Domestic Product) and income seem
to have a very consistent relationship. The relation between literacy
and income appears to be interesting, so we'll examine it in more
detail, by making a separate plot for it:
> with(world,plot(literacy,income))
The first variable we pass to plot (literacy in this
example) will be used for the x-axis, and the second (income) will
be used on the y-axis. The plot looks like this:
In many cases, the most interesting points on a graph are the ones that don't
follow the usual relationships. In this case, there are a few points where the
income is a bit higher than we'd expect based on the other countries, considering
the rate of literacy. To see which countries they represent, we can use the
identify function. You call identify with the same arguments
as you passed to plot; then when you click on a point on the graph
with the left mouse button, its row number will be printed on the graph. It's
usually helpful to have more than just the row number, so identify is usually
called with a labels= argument. In this case, the obvious choice is
the country name. The way to stop identifying points depends on your operating
system; on Windows, right click on the plot and choose "Stop"; on Unix/Linux
click on the plot window with the middle button. Here's the previous graph
after some of the outlier points are identified:
File translated from
TEX
by
TTH,
version 3.67.
On 1 Feb 2010, 10:11.