HOW DO I...
Manipulate Data
Aggregate and Subset
Plot
Bootstrap
Reference Guide
| We have had good results using S-Plus and R
in our courses to analyze Stat Labs data. Currently we are using R because
it is free, and versions are available for the unix, Mac, and PC
environments. The syntax of the two languages are nearly identical.
R can be downloaded from
The Comprehensive R Archive Network,
where you can also find a User's Guide.
We also provide here a short list of comonly used commands,
and their syntax.
Splus is available from
Insightful,
and a free student version is available from this
link.
Instructions for the student version are available
here.
We also recommend
Introductory Statistics with R by P. Dalgaard and
An Introduction to S and S-Plus by
P. Spector.
FAQ:
Here you will find
answers to our students' frequently asked questions on how to use R and
Splus. The responses provided do not comprise a comprehensive user's guide
to the R or S-plus language. Refer to the user's guides listed on this
page for a more thorough introduction to these languages.
The following examples all use the birth weight data, called babies. It contains
variables bwt (birth weight in ounces), smoking (0 for nonsmoker, 1=yes
now, 2=until pregnant, 3=once did not know, 9=unkown), wt (mother's
prepregnancy weight), educ (categories 0-7, 9=unknown).
Data manipulations
- Getting data from the web into R
- Replacing values
The 999 values in
bwt denote missing values. To replace 999 with NA.
replace(bwt,bwt==999,NA)
The 9 values in smoke
denote missing values. To replace 9 with NA and recode 2 and 3 as 0, for
nonsmoker: ismoke<- replace(smoke,smoke==2 |smoke==3,0)
ismoke[ismoke==9]<-NA
To convert the values of gestation into weeks, and to also collapse these
values into 5 catgories,
gestwks<-floor(gestation/7)
gestcut<-cut(gestwks,5)
- Creating matrices and vectors
To create a matrix of 0s and 1s, where there is one column for
each education level, and the 1s in that column indicate the mother has that
education level.
outer(educ,unique(educ),"==")
To create a 3 by 4 matrix of all 0s.
matrix(0,nrow=3,ncol=4)
To create an array (0, 0.1, ..., 0.9, 1).
seq(from=0,to=1,by=0.1)
To create an array (1,1,1,12,34,34).
c(1,1,1,12,34,34)
or
rep(c(1,12,34),c(3,1,2))
- Data frames, matrices, and lists
Subsetting and
Aggregating
- Computing statistics on subgroups
To compute the average birthweight for smokers and nonsmokers.
tapply(bwt,ismoke,mean)
To compute the average
birthweight for mothers who are above average in weight.
tapply(bwt,(wt > mean(wt)),mean)
- Selecting a subset
To select babies
whose mother smoked. smokerbabes<-babies[is.na(ismoke) | ismoke==1]
To count babies according to whether they are premature or low birthweight:
table(cut(gestation,4),bwt<90)
To find the mean of height and weight for mothers according to
education and smoking status.
apply(cbind(ht,wt),2,function(x)tapply(x,list(educ,ismoke),mean)
To regress bwt on mother's weight for mothers separatley for smokers and nonsmokers
reg<-tapply(seq(length(bwt),ismoke,function(x)lm(bwt[x]~wt[x]))
lapply(reg,coeff)
Plotting
- Multiple plots with the same axes
To
make two histograms, one for smokers and on for nonsmokers, where they
both have the same axes. par(mfrow=c(2,1))
tapply(bwt,ismoke,hist,xlim=c(min(bwt),max(bwt)),ylim=c(0,50))
or par(mfrow=c(2,1))
hist(bwt[ismoke==1],xlim=c(min(bwt,max(bwt))
par(xaxs="d",yaxs="d")
hist(bwt[ismoke==0])
- Quantile plots
To make a
quantile-quantile plot of bwt for smokers versus nonsmokers
qqplot(bwt[ismoke==0],bwt[ismoke==1])
To make a gamma(5,1) quantile plot of bwt
ps<-ppoints(length(bwt))
plot(quantile(bwt,ps),qgamma(ps,5,1))
- Multiple boxplots on one plot
To put
box and whisker plots of bwt, one for each education level, on the same
plot
boxplot(bwt~educ)
- Putting curves on plots
To make a
plot of bwt by wt, and place on it a curve that is computed from a
function g:
plot(wt,bwt)
pts<-seq(from=par("usr")[1],to=par("usr")[2],len=100)
lines(pts,g(pts),xpd=T)
<-seq(from=par("usr")[1],to=par("usr")[2],len=100)
- Sliding bin plots
- Contour plots
Bootstrapping
- Nonparametric bootstrap
Here we will
use the data from the video game survey. The sample size is 91, and the
population size is 314. We first create a bootstrap population of time.
Then we take a simple random sample of 91 from the bootstrap population
100 times. dups<-round(table(time)*314/91)
bootpop<-rep(cbind(unique(time),dups)
m<-rep(0,100)
for( in
1:100){s<-sample(bootpop,size=91,replace=F);m[i]<-mean(s)}
- Parametric bootstrap
Here we take 100
stratified randoms from a normal distribution, where there are two
strata. One strata has a mean of 10 and sd of 1, and the second has a
mean of 0 and sd 1. Each sample will consist of 10 observations from the
first stratum and 15 from the second.
means<-rep(c(10,0),c(10,15))
apply(matrix(0,nrow=100,ncol=1),1,function(x)rnorm(n=25,mean=means))
|