Graphics and Spreadsheets
1 Lattice Graphics
The lattice library is actually flexible enough to produce separate plots
for each state:
> xyplot(Diabetes~Prim_Care_Phys_Rate|CHSI_State_Name,data=risk,main='Diabetes vs. Physician Rate by Region')
Looking at the plot, we should make some elements of the plot
smaller, namely the titles on the strips and the points themselves:
> xyplot(Diabetes~Prim_Care_Phys_Rate|CHSI_State_Name,data=risk,main='Diabetes vs. Physician Rate by Region',cex=.5,par.strip.text=list(cex=.6))
This results in the following plot:
Finally, let's suppose that we want to break up the plots into two pages,
each with 25 plots in a 5x5 arrangement. To get exactly 50 states, we'll
use the subset= argument of the lattice functions to remove Alaska
(for which there's no data), and the layout= argument to arrange
the plots the way we want, and the page= argument to call a function
at the end of each page:
> xyplot(Diabetes~Prim_Care_Phys_Rate|CHSI_State_Name,data=risk,main='Diabetes vs. Physician Rate by Region',subset=CHSI_State_Name != 'Alaska',layout=c(5,5,2),page=readline)
You can run this example to see the two pages.
Now that we've seen some of the basics of how the lattice library routines
work, we'll take a look at some of the functions that are available. Remember
that there are usually similar alternatives available among the traditional
graphics functions, so you can think of these as additional choices that are
available, and not necessarily the only possibility for producing a particular
type of plot.
2 Univariate Displays
Univariate displays are plots that are concerned with the
distribution of a single variable,
possibly comparing the distribution among several subsamples of the
data. They are especially useful when you are first getting acquainted
with a data set, since you may be able to identify outliers or other
problems that could get masked by more complex displays or analyses.
2.1 dotplot
A simple but surprisingly useful display for small to moderate amounts of
univariate data is the dotplot. Each observation's value for a variable
is plotted as a dot along a line that spans the range of the variable's
value. In the usual case, there will be several such lines, one for each
level of a grouping variable, making it very easy to spot differences
in the variable's distribution for different groups.
To illustrate, we'll use a a data set from a wine recognition experiment where
a number of chemical and other measurements were taken on wines from three
cultivars.
The data is available at http://www.stat.berkeley.edu/~spector/s133/data/wine.data; information about
the variables is at http://www.stat.berkeley.edu/~spector/s133/data/wine.names
Suppose we are interested in comparing the alcohol
level of wines from the three different cultivars:
> wine = read.csv('http://www.stat.berkeley.edu/~spector/s133/data/wine.data',header=FALSE)
> names(wine) = c("Cultivar", "Alcohol", "Malic.acid", "Ash", "Alkalinity.ash",
+ "Magnesium", "Phenols", "Flavanoids", "NF.phenols", "Proanthocyanins",
+ "Color.intensity","Hue","OD.Ratio","Proline")
> wine$Cultivar = factor(wine$Cultivar)
> dotplot(Cultivar~Alcohol,data=wine,ylab='Cultivar')
The plot is shown below.
2.2 bwplot
The bwplot produces box/whisker plots. Unfortunately, notched
boxplots are not currently available using bwplot. To create
a box/whisker plot of Alcohol for the three cultivars, we can use
the same formula we passed to dotplot:
> bwplot(Alcohol~Cultivar,data=wine,xlab='Cultivar',ylab='Alcohol')
The plot is shown below.
For both dotplot and bwplot, if you switch the roles of
the variables in the formula, the orientation of the plot will change. In
other words, the lines in the dotplot will be displayed vertically instead
of horizontally, and the boxplots will be displayed horizontally instead
of vertically.
As a second example, consider the literacy rates for the different continents
in the world data set. We can compare these rates using the following command:
> bwplot(literacy~cont,data=world,main='Literacy Rates by Continent')
The plot is shown below
2.3 densityplot
As its name implies, this function produces smoothed plots of densities,
similar to passing a call to the density function to the plot
function. To compare multiple groups, it's best to create a conditioning
plot:
> densityplot(~Alcohol|Cultivar,data=wine)
Notice that, for plots like this, the formula doesn't have a left hand
side. The plot is shown below:
As another example of a custom panel function, suppose we wanted to
eliminate the points that are plotted near the x-axis and replace them
with what is known as a rug - a set of tickmarks pointing up from the
x-axis that show where the observations were. In practice, many people
simply define panel functions like this on the fly. After consulting
the help pages for panel.densityplot and panel.rug,
we could replace the points
with a rug as follows:
> densityplot(~Alcohol|Cultivar,data=wine,panel=function(x,...){
+ panel.densityplot(x,plot.points=FALSE)
+ panel.rug(x=x)
+ })
Of course, if you find it easier or more convenient to define a custom
panel function separate from the call to the plotting function, you can
use that method. Here's the result:
3 barchart
A bar chart is like a histogram for categorical data. The barchart
function expects its input data frame to already have the numbers of
observations for each grouping tabulated. For the simplest case of a single
variable with no conditioning variable, you can use a call to table
on the right hand side of the tilda to produce a vertical bar chart:
> barchart(~table(cont),data=world)
The plot is shown below.
For more complex barcharts, a data frame containing the counts to be plotted
needs to be constructed. This can be done easily using the table
in conjunction with as.data.frame. To illustrate, we'll return to
the movies data set which has the release dates and box office receipts for
some of the all-time most popular movies. Suppose we want to see if the
distribution of the day of the week the movies opened on has changed over
time. First, we'll read the data and create a grouping variable for different
time periods:
> movies = read.delim('http://www.stat.berkeley.edu/~spector/s133/data/movies.txt',as.is=TRUE,sep='|')
> movies$box = as.numeric(sub('\\$','',movies$box))
> movies$date = as.Date(movies$date,'%B %d, %Y')
> movies$year = as.numeric(format(movies$date,'%Y'))
> movies$weekday = factor(weekdays(movies$date),
+ levels=c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'))
> movies$yrgrp = cut(movies$year,c(1936,1980,1985,1990,1995,2000,2006,2011),
+ labels=c('1937-1979','1980-1984','1985-1990','1991-1995','1996-2000','2001-2006','2007-2010'))
> counts = as.data.frame(table(yrgrp=movies$yrgrp,weekday=movies$weekday))
> barchart(Freq~weekday|yrgrp,data=counts,scales=list(x=list(rot=90)),as.table=TRUE)
The plot is shown below.
If the roles of Freq and weekday were reversed in the
previous call to barchart, the bars would be drawn horizontally
instead of vertically.
4 3-D Plots: cloud
The three-dimensional analog of the two-dimensional xyplot in the
lattice library is cloud. By using a conditioning variable,
cloud can allow us to consider the relationship of four variables
at once. To illustrate, here's conditioning plot showing the relationship
among four variables from the wine data frame:
cloud(Alcohol ~ Color.intensity + Malic.acid|cut(Hue,4),data=wine)
The plot is shown below:
One important point about lattice graphics is that they operate slightly
differently from R's traditional graphics. When you type a lattice
command at the top level R session, it's R's automatic printing that
actually displays the plot. So if you're calling a lattice function
in a function or source file, you'll need to surround the call with a
call to the print function, or store the result of the lattice
plot in a variable, and print that variable.
5 Spreadsheets
Spreadsheets, in particular the "xls" format used by Microsoft Excel, are
one of the most common (if not the most common) methods for transfering
data from one organization to another. When you download or receive a
spreadsheet, it's often a good idea to make sure that it really is a
spreadsheet. Comma-separated files, which are easily read into spreadsheet
programs, are often erroneously labeled "spreadsheets" or "Excel files",
and may even have an incorrect extension of ".xls".
In addition, Excel spreadsheets are often used to hold largely textual
information, so simply finding a spreadsheet that appears to contain data
can be misleading.
Furthermore, many
systems are configured to use a spreadsheet program to automatically open
files with an extension of ".csv", further adding to the confusion.
For many years, the only reliable way to read an Excel spreadsheet was with
Microsoft Excel, which is only available on Windows and Mac OS computers;
users on UNIX were left with no option other than finding a different computer.
In the last few years, a number of programs, notably gnumeric and
OpenOffice.org (usually
available through the ooffice command)
have been developed through careful reverse engineering to allow Unix users
the ability to work with these files.
To insure its advantage in the marketplace, Microsoft doesn't publish a
detailed description of exactly how it creates its spreadsheet files. There
are often "secret" ways of doing things that are only known to the
developers within Microsoft.
Reverse engineering means looking at the way a program handles different
kinds of files, and then trying to write a program that imitates what
the other program does.
The end result of all this is that there are several ways
to get the data from spreadsheet files into R.
Spreadsheets are organized as a collection of one or more sheets, stored in
a single file. Each of the sheets represents a rectangular display of rows
and columns, not unlike a data frame. Unfortunately, people often embed
text, graphics and pictures into spreadsheets, so it's not a good idea to
assume that a spreadsheet has the consistent structure of a data frame or
matrix, even if portions of it do appear to have that structure. In addition,
spreadsheets often have a large amount of header information (more than just
a list of variable names), sometimes making it challenging to figure out
the correct variable names to use for the different columns in the spreadsheet.
Our main concern will be in getting the information from the spreadsheet into
R:
we won't look at how to work with data within Excel.
If you are simply working with a single Excel spreadsheet, the easiest way
to get it into R is to open the spreadsheet with any compatible program, go
to the File menu, and select Save As. When the file selector dialog appears,
notice that you will be given an option of different file format choices,
usually through a drop down menu. Choose something like "Tab delimited text",
or "CSV (Comma separated values)". If any of the fields in the spreadsheet
contain commas, tab delimited is a better choice, but generally it doesn't
matter which one you use. Now, the file is (more or less) ready to
be read into R, using read.csv or read.delim. A very common
occurence is for fields stored in a spreadsheet to contain either single
or double quotes, in which case the quote= argument can be used to
make sure R understands what you want. Spreadsheets often arrange their
data so that there are a different number of valid entries on different lines,
which will confuse read.table; the fill= argument may be of
use in these cases. Finally, as mentioned previously, there are often multiple
descriptive lines of text at the top of a spreadsheet, so some experimentation
with the skip= argument to read.table may be necessary.
Alternatively, you can simply delete those lines from the file into which
you save the tab- or comma-delimited data.
As you can see, this is a somewhat time-consuming process which needs to be
customized for each spreadsheet. It's further complicated by the fact that
it only handles one sheet of a multi-sheet spreadsheet; in those cases the
process would need to be repeated for each sheet. If you need to read a
number of spreadsheets, especially ones where you need to access the data from
more than one sheet, you need to use a more programmable solution.
R provides several methods to read spreadsheets without having to save them as
tab- or comma-delimited files. The first is the read.xls function
in the gdata library. This function uses a Perl program (included
as part of the library), that converts the spreadsheet to a comma-separated
file, and then uses read.csv to convert it to a data frame.
In order for this library to work, perl must also be installed on
the computer that's running R. perl will be installed on virtually any
Unix-based computer (including Mac OSX), but will most likely not be on
most Windows computers (although once perl is installed on a Windows system, the
read.xls function will work with no problems). However there
is a Windows-only package , xlsReadWrite, available from CRAN
which can both read and write Excel files. (On other platforms, the
dataframes2xls package provides a write.xls function.)
Another method utilizes a facility known as ODBC (Open DataBase Connectivity),
which is a method that allows different programs that handle data to communicate
with each other. The ODBC interface uses a query language known as SQL, which
we'll study later, so for now I'll just describe the use of read.xls.
(Remember, for just one or two data tables from a spreadsheet, saving as a
tab- or comma-delimited
file and using the correct read.table variant will usually be the
easiest route. )
Since read.xls does nothing more than convert the spreadsheet to
comma-separated form and then call read.csv, the guidelines for
using it are very similar to the normal use of read.csv. The
most useful feature of read.xls is that it accepts an optional
sheet= argument, allowing multiple sheets from a spreadsheet file to
be automatically read.
File translated from
TEX
by
TTH,
version 3.67.
On 6 Mar 2011, 22:47.