Mapping and Lattice Plots

As can be seen, the default for multiple plots leaves quite a bit of space between the plots. The graphics parameter that controls this is called mar. The value that determines the spacing is a vector of length 4, with the number of lines of space on the bottom, left, top, and right of the plots. We can find the current value of graphics parameters by passing the name of the parameter to par:

> par('mar')
[1] 5.1 4.1 4.1 2.1

Let's remove one line from each of the top and bottom, and replot the series of graphs:

> par(mar=c(4.1,4.1,3.1,2.1),mfrow=c(4,2))
> sapply(names(airquality)[1:4],twoplot)
     Ozone Solar.R Wind Temp
[1,]  1216    1216 1216 1216
[2,]  1247    1247 1247 1247
[3,]  1277    1277 1277 1277
[4,]  1308    1308 1308 1308
[5,]  1339    1339 1339 1339
[6,]  1369    1369 1369 1369

The plot is shown below:

After plotting multiple plots, you can restore R's normal behaviour of plotting each plot on a separate page by resetting mfrow as follows:

> par(mfrow=c(1,1))

1 More on Barplots

We briefly looked at barplots when we examined the day of the week on which popular movies opened. You may recall that we passed the result of a call from the table function to barplot. This idea can be extended to create side-by-side or stacked barplots. As an alternative to using table, we can produce a matrix with the values we wish to plot. By providing dimnames for this matrix, barplot will be able to appropriately label the plot.

Once again consider the risk data set we used earlier. To simplify the plot, the different states in the data set can be grouped into regions, using the built-in state.name and state.region objects:

> risk = merge(risk,data.frame(state=state.name,region=state.region),by.x='CHSI_State_Name',by.y='state')

Suppose we're interested in comparing the average smoking, diabetes, and obesity rates for the different regions. First, we use the aggregate function to find the averages of the variables, broken down by regions:

>  mns = aggregate(risk[,c('Smoker','Diabetes','Obesity')],risk['region'],mean,na.rm=TRUE)
> mns
         region   Smoker Diabetes  Obesity
1     Northeast 22.62537 7.161972 22.13430
2         South 25.15847 8.754302 25.71493
3 North Central 22.28592 7.318565 24.16283
4          West 19.50740 6.349167 21.01306

If we try to pass this data frame to barplot, we'll get an error, since barplot needs a matrix. In addition, the column that identifies the different groups (in this case region), needs to be part of the row names, not a variable in the data frame itself. These two steps create the row names, and eliminate the region column from the data frame:

> row.names(mns) = mns$region
> mns$region = NULL

If we were to use the following statement to create the barplot:

> barplot(as.matrix(mns))

we'd find that the bars were representing the variables, not the regions. To fix this, we simply pass the transpose of the matrix to barplot:

> barplot(t(as.matrix(mns)))

Here's what the plot looks like:

This is known as a stacked barplot, and while it's useful for some kinds of data, I don't find it appropriate for this type of data. By passing the beside=TRUE argument to barplot, we can produce a side-by-side barchart, which I find more useful. The legend=TRUE argument helps to identify the bars:

> barplot(t(as.matrix(mns)),beside=TRUE,legend=TRUE)

Here's the result:

2 Mapping

We've looked at some data about different countries around the world, but so far we haven't taken advantage of the fact that the data has a geographic origin (other than trying to see if relationships were different among different continents by using color to represent continent in scatterplots.) The R library maps gives us the ability to use different colors to represent values of variables on actual maps of the world or the United States. A current concern in the United States has to do with the number of people without health insurance. Since this continues to be of great concern, it would be interesting to see if there are geographic patterns to the rate of ininsured adults in the different states.

The data for this example comes from the swivel.com website, and can be downloaded from the class website from http://www.stat.berkeley.edu/~spector/s133/data/insurance.csv.

> library(maps)
> ins = read.csv('http://www.stat.berkeley.edu/~spector/s133/data/insurance.csv',stringsAsFactors=FALSE)
> head(ins)
       State Employer Individual Medicaid Medicare Other.Public Uninsured
1    Alabama     0.54       0.03     0.15     0.13         0.01      0.14
2     Alaska     0.52       0.04     0.16     0.06         0.05      0.17
3    Arizona     0.47       0.05     0.16     0.13         0.01      0.18
4   Arkansas     0.47       0.06     0.15     0.14         0.02      0.17
5 California     0.48       0.07     0.16     0.09         0.01      0.18
6   Colorado     0.59       0.07     0.07     0.08         0.02      0.16

The first step in preparing a map (after loading the maps library) is determining a variable to use to provide color to the different regions. In this case we'll use the Uninsured variable. Our goal will be to break the data up into four groups, based on the value of the Uninsured variable, and then to color in each state in the map using a color based on the groups.

To create the four groups based on values of Uninsured, we can use the cut function. If we pass cut a variable and a number of groups, it will divide the range of the variable by that number, and assign each variable to one of the groups. Depending on how the variable is distributed, there may be more observations in one group than the other. Alternatively, we can provide cut with a breaks= argument, giving one more value than the number of groups we want, and it will assign the values to groups based on the ranges that the breaks define. Thus, if different values of the variable being used to determine the ranges have special meanings, the grouping can be customized using this argument. A final alternative is to guarantee nearly equal-sized groups by using quantiles of the variable in question as the breaks. The include.lowest=TRUE argument should be included to make sure that the smallest value gets properly classified. To create four equal-sized groups for the Uninsured variable, we could use the following call to cut:

> ugroups = cut(ins$Uninsured,quantile(ins$Uninsured,(0:4)/4),include.lowest=TRUE)
> head(ugroups)
[1] (0.13,0.17] (0.13,0.17] (0.17,0.24] (0.13,0.17] (0.17,0.24] (0.13,0.17]
Levels: [0.08,0.11] (0.11,0.13] (0.13,0.17] (0.17,0.24]

Notice that cut produces a factor by default; to suppress this, use the labels=FALSE argument. The factor, since it contains information about the actual values, will turn out to be quite useful.

The maps library provides three databases: "world", "state", and "county". For each identifier in the database, there is information on what polygons need to be drawn to create an outline of the area in question. For example, the entry identified by california in the state database would contain the information about California's borders in the form of polygon coordinates which the polygon function will draw when we ask for a map of California. While the help page for the map function implies that we can pass a vector of region names and colors directly to map, things are complicated by the fact that some states can't be plotted by a single polygon, and map gets confused about the proper colors to use when it needs to draw more than one polygon for a state.

One way around the problem is to create multiple entries in a vector of colors for those states that have more than one polygon. To do this, we need to look at the region names stored inside the database. We can get a vector of region names by calling map with the names=TRUE and plot=FALSE arguments:

> map.names = map('state',names=TRUE,plot=FALSE)

The regions which represent multiple polygons for a single state will always contain semicolons:

> grep(':',map.names,value=TRUE)
 [1] "massachusetts:martha's vineyard" "massachusetts:main"
 [3] "massachusetts:nantucket"         "michigan:north"
 [5] "michigan:south"                  "new york:manhattan"
 [7] "new york:main"                   "new york:staten island"
 [9] "new york:long island"            "north carolina:knotts"
[11] "north carolina:main"             "north carolina:spit"
[13] "virginia:chesapeake"             "virginia:chincoteague"
[15] "virginia:main"                   "washington:san juan island"
[17] "washington:lopez island"         "washington:orcas island"
[19] "washington:whidbey island"       "washington:main"

To properly connect our data with these region names, we first create a vector of state names corresponding to the regions in the data base:

> map.states = sub('^([^:]*):.*','\\1',map.names)

Now we can use the match function to see which observations in the ins dataframe correspond to the regions in the database. Since the database uses all lower case, I'll use the tolower function to convert the state names in the ins data frame to lower case:

> which.state = match(map.states,tolower(ins$State))

Now that we know how the state names line up with the region names, we can create a vector of colors to properly create our map:

> mycolors = c('gray','white','pink','red')
> thecolors = mycolors[ugroups[which.state]]

This process is complicated enough that it might be worth making a function to give us the proper vector of groupings. Here's such a function, created by putting together the steps we just followed:

mapgroups = function(db,myregions,mygroups,tolower=TRUE){
    map.names = map(db,names=TRUE,plot=FALSE)
    map.regions = gsub('^([^:]*):.*$','\\1',map.names)
    if(tolower)myregions = tolower(myregions)
    which.region = match(map.regions,myregions)
    mygroups[which.region]
}

(I've included the tolower= argument because not all the map databases use lower case.) Using this function (or the step-by-step approach that led to it), we can now make our plot:

> mycolors = c('gray','white','pink','red')
> thecolors = mycolors[mapgroups('state',ins$State,ugroups)]
> map('state',col=thecolors,fill=TRUE)

Plots like this should always have a legend:

> title('Uninsured Rates by State')
> legend('bottomleft',legend=levels(ugroups),fill=mycolors,cex=.8)

The map is pictured below.

There are a variety of functions in R which will help us choose colors that will be useful for plotting. Some of the functions you might want to investigate include rainbow, heat.colors, topo.colors, as well as the color.gradient function in the plotrix library, the colorpanel function in the gplots library. and the functions in the RColorBrewer library. Let's look at another example, also from swivel.com, concerning rates of childhood obesity. A copy of the dataset is available on the class website.

> obesity = read.csv('http://www.stat.berkeley.edu/~spector/s133/data/obesity.csv')

Before preceding, it turns out that the state names in this data set have some embedded blanks:

> head(as.character(obesity$State))
[1] " Alabama "    " Alaska "     " Arizona "    " Arkansas "   " California "
[6] " Colorado "

Left uncorrected, this would make our mapgroups function fail to find any matching states in the map data base, and when we plot the map, there would be no colors. If you think you're doing everything right, but there are no colors on the map, check to make sure that the state names are correct.

Naturally fixing something like this is not a problem:

> obesity$State = gsub('^ +','',obesity$State)
> obesity$State = gsub(' +$','',obesity$State)

Now we can proceed with producing the map. We'll use the Childhood.Obesity.Rate variable for the example, and a color scheme generated by the heat.colors function:

> mycolors = rev(heat.colors(4))
> ogroups = cut(obesity$Childhood.Obesity.Rate,
+      quantile(obesity$Childhood.Obesity.Rate,(0:4)/4),include.lowest=TRUE)
> thecolors = mycolors[mapgroups('state',obesity$State,ogroups)]
> map('state',col=thecolors,fill=TRUE)
> title('Childhood Obesity Rates by State')
> legend('bottomleft',legend=levels(ogroups),fill=mycolors,cex=.8)

The map appears below:

File translated from T_EX by T_TH, version 3.67.
On 27 Feb 2011, 08:31.