HOW DO I...
Manipulate Data
Aggregate and Subset
Plot
Bootstrap
Reference Guide
 We have had good results using SPlus and R
in our courses to analyze Stat Labs data. Currently we are using R because
it is free, and versions are available for the unix, Mac, and PC
environments. The syntax of the two languages are nearly identical.
R can be downloaded from
The Comprehensive R Archive Network,
where you can also find a User's Guide.
We also provide here a short list of comonly used commands,
and their syntax.
Splus is available from
Insightful,
and a free student version is available from this
link.
Instructions for the student version are available
here.
We also recommend
Introductory Statistics with R by P. Dalgaard and
An Introduction to S and SPlus by
P. Spector.
FAQ:
Here you will find
answers to our students' frequently asked questions on how to use R and
Splus. The responses provided do not comprise a comprehensive user's guide
to the R or Splus language. Refer to the user's guides listed on this
page for a more thorough introduction to these languages.
The following examples all use the birth weight data, called babies. It contains
variables bwt (birth weight in ounces), smoking (0 for nonsmoker, 1=yes
now, 2=until pregnant, 3=once did not know, 9=unkown), wt (mother's
prepregnancy weight), educ (categories 07, 9=unknown).
Data manipulations
 Getting data from the web into R
 Replacing values
The 999 values in
bwt denote missing values. To replace 999 with NA.
replace(bwt,bwt==999,NA)
The 9 values in smoke
denote missing values. To replace 9 with NA and recode 2 and 3 as 0, for
nonsmoker: ismoke< replace(smoke,smoke==2 smoke==3,0)
ismoke[ismoke==9]<NA
To convert the values of gestation into weeks, and to also collapse these
values into 5 catgories,
gestwks<floor(gestation/7)
gestcut<cut(gestwks,5)
 Creating matrices and vectors
To create a matrix of 0s and 1s, where there is one column for
each education level, and the 1s in that column indicate the mother has that
education level.
outer(educ,unique(educ),"==")
To create a 3 by 4 matrix of all 0s.
matrix(0,nrow=3,ncol=4)
To create an array (0, 0.1, ..., 0.9, 1).
seq(from=0,to=1,by=0.1)
To create an array (1,1,1,12,34,34).
c(1,1,1,12,34,34)
or
rep(c(1,12,34),c(3,1,2))
 Data frames, matrices, and lists
Subsetting and
Aggregating
 Computing statistics on subgroups
To compute the average birthweight for smokers and nonsmokers.
tapply(bwt,ismoke,mean)
To compute the average
birthweight for mothers who are above average in weight.
tapply(bwt,(wt > mean(wt)),mean)
 Selecting a subset
To select babies
whose mother smoked. smokerbabes<babies[is.na(ismoke)  ismoke==1]
To count babies according to whether they are premature or low birthweight:
table(cut(gestation,4),bwt<90)
To find the mean of height and weight for mothers according to
education and smoking status.
apply(cbind(ht,wt),2,function(x)tapply(x,list(educ,ismoke),mean)
To regress bwt on mother's weight for mothers separatley for smokers and nonsmokers
reg<tapply(seq(length(bwt),ismoke,function(x)lm(bwt[x]~wt[x]))
lapply(reg,coeff)
Plotting
 Multiple plots with the same axes
To
make two histograms, one for smokers and on for nonsmokers, where they
both have the same axes. par(mfrow=c(2,1))
tapply(bwt,ismoke,hist,xlim=c(min(bwt),max(bwt)),ylim=c(0,50))
or par(mfrow=c(2,1))
hist(bwt[ismoke==1],xlim=c(min(bwt,max(bwt))
par(xaxs="d",yaxs="d")
hist(bwt[ismoke==0])
 Quantile plots
To make a
quantilequantile plot of bwt for smokers versus nonsmokers
qqplot(bwt[ismoke==0],bwt[ismoke==1])
To make a gamma(5,1) quantile plot of bwt
ps<ppoints(length(bwt))
plot(quantile(bwt,ps),qgamma(ps,5,1))
 Multiple boxplots on one plot
To put
box and whisker plots of bwt, one for each education level, on the same
plot
boxplot(bwt~educ)
 Putting curves on plots
To make a
plot of bwt by wt, and place on it a curve that is computed from a
function g:
plot(wt,bwt)
pts<seq(from=par("usr")[1],to=par("usr")[2],len=100)
lines(pts,g(pts),xpd=T)
<seq(from=par("usr")[1],to=par("usr")[2],len=100)
 Sliding bin plots
 Contour plots
Bootstrapping
 Nonparametric bootstrap
Here we will
use the data from the video game survey. The sample size is 91, and the
population size is 314. We first create a bootstrap population of time.
Then we take a simple random sample of 91 from the bootstrap population
100 times. dups<round(table(time)*314/91)
bootpop<rep(cbind(unique(time),dups)
m<rep(0,100)
for( in
1:100){s<sample(bootpop,size=91,replace=F);m[i]<mean(s)}
 Parametric bootstrap
Here we take 100
stratified randoms from a normal distribution, where there are two
strata. One strata has a mean of 10 and sd of 1, and the second has a
mean of 0 and sd 1. Each sample will consist of 10 observations from the
first stratum and 15 from the second.
means<rep(c(10,0),c(10,15))
apply(matrix(0,nrow=100,ncol=1),1,function(x)rnorm(n=25,mean=means))
