BrownLab

To Download R to Personal Computer

Go to http://cran.us.r-project.org/
Under "Precompiled Binary Distributions" pick the appropriate system for your machine, and find the correct .exe file under the folders (for Windows, it's under the folder "base" and is called "R-2.2.1-win32.exe")

Getting Started in R

R is a Command Line language, which means you type in the commands at the prompt ">" and the output comes after you hit return. In other words, there are not drop down menus and mouse commands. Even if you use the "Windows" version, it is still works basically this way. This gives you a lot of control, but can make it a little intimidating at first. It's often a good idea, if you are doing a lot of complicated things, to have another screen open (like Notebook) for text editing so that you can save your commands.

We also are going to want to save files into a folder for future reference. When I refer to mydir you should type in an appropriate directory. An example might be

"C:/Documents and Settings/lelandID/My Documents"

Note that Windows uses a " $\backslash$ " to separate subdirectories, unlike Unix, which uses "/". To give a full path name in R using Windows, you can use the Unix "/" or use " $\backslash$ $\backslash$ ". You cannot just use the Windows " $\backslash$ ".

For more information about finding your working directory, etc., see
http://www.stanford.edu/~epurdom/Saving/Saving.html

Creating R Objects
To save output to some variable name, use "<-" (or sometimes you'll see "_" )
- Simple Examples
  
  > x <- 3 > y <- 5 > x + y [1]8
  
  You can create descriptive names, as you like, though remember you may have to type them many times, so try to think of short, descriptive names.
- Vectors
  You will often be working with vectors, particularly if you are fine-tuning options. The function c( ) creates a vector
  
  > Numb<-c(2,4,-1.4) > z<-c(4,-3,2)
  
  If you want to look and see the value of what you have created, you simply type its name at the prompt.
  
  > Numb [1]2.0 4.0 -1.4
  
  Mathematical formulas are elementwise:
  
  > z + Numb > z-Numb > z*Numb > Numb - x
  
  You can make some kinds of vectors quickly using seq and rep
  
  > seq(0,3,length=10) [1]0.0000000 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667 [7]2.0000000 2.3333333 2.6666667 3.0000000 > rep(1,3) [1]1 1 1
  
  You can also give names to your vector to keep track of what they correspond to
  
  > names(Numb)<-c("Pat.1","Pat.2","Pat.3") > Numb Pat.1 Pat.2 Pat.3 2.0 4.0 -1.4
- Matrices
  You can make matrices from scratch using matrix() or from previously made vectors using rbind(),cbind()
  
  > mat<-matrix(c(1,2,4,2,1,3),nrow=3,ncol=2,byrow=T) > mat2<-cbind(z,Numb) > colnames(mat2)<-c("X1","L3") > rownames(mat2)<-c("A","B","C") > mat2 X1 L3 A 4 2.0 B -3 4.0 C 2 -1.4
  
  Matrix operations:
  
  > mat2*mat > t(mat) > mat%*%t(mat2) > mat^2
- Factors
  This type of object allows R to deal with categorical variables. The different possible categories are called levels.
  
  > fac<-factor(c(2,1,2,2,4,-1),labels=c("A","B","C","D")) > fac [1]C B C C D A Levels: A B C D > levels(fac) [1]"A" "B" "C" "D"
- Dataframes
  Dataframes generally act like matrices, but allow different columns to be vectors of different types, like character, factor, numeric, etc.
  
  > mydf<-data.frame(x2=fac,y=c(z,Numb)) > mydf x2 y 1 C 4.0 2 B -3.0 3 C 2.0 4 C 2.0 5 D 4.0 6 A -1.4
  
  versus
  
  > cbind(fac,c(z,Numb)) fac [1,]3 4.0 [2,]2 -3.0 [3,]3 2.0 [4,]3 2.0 [5,]4 4.0 [6,]1 -1.4
- Lists
  Lists allow you to put together any kind of data to keep track of it.
  
  > mylist<-list(mymat=mat,vec=z,afac=fac) > mylist $mymat [,1][,2] [1,]1 2 [2,]4 2 [3,]1 3
  $vec [1]4 -3 2
  $afac [1]C B C C D A Levels: A B C D
- What do I have??
  You can use the functions of the format is.xx to figure out what type object you have.
  
  > is.vector(z) [1]TRUE > is.vector(mat) [1]FALSE
  
  You can also try to convert objects to different types using functions as.xx.
  
  > as.vector(mat) [1]1 4 1 2 2 3 > as.character(fac) [1]"C" "B" "C" "C" "D" "A" > as.matrix(mydf) > data.matrix(mydf)
Manipulating Objects you have Made
- To see a list of the variables you have created and have available, type
  
  > ls() > ls(pattern="temp")
  
  NOTE: If you later reassign something else to that same name, you lose the old information, and there is no warning before you do it, so use ls() to see if you have already used the name. You also should not give a name to your variable that is already a name of an R function because there will be unexpected consequences.
- Quitting
  
  To quit from R, type
  
  > q() Save workspace image? [y/n/c]:
  
  Type y, and then you should be out of R. If you saved your session, when you come back, all of these saved variables will still be there to work with, as long as you start in the same directory, that you created (In Windows/Mac, this will be the default directory that you get by opening R).
- Saving Objects
  This part is useful but not necessary, so feel free to skip this as needed.
  If you want to save one particular object that you made, say to transfer to another computer or to back it up, you can use the command save. If you make the extension of your file ``.rdata'' then Windows recognizes it as an R Data file and will autolaunch.
  
  > save(Numb,file="mydir/Numb.Rdata")
  
  If you exit R and look at the files in your directory, you will see Numb.Rdata. You can move this file around to different directories - this is a way of saving your information. dump() is another way, but not as nice. If you move it to a different directory you can get it back within R by using load()
  
  > load(file="mydir/Numb.Rdata")
  
  Of course, since you saved when you exited, you don't need to load the information back in. You can actually access all of your data that you saved when you exited by loading the .RData file.
- Deleting Objects
  You can also remove objects using the rm() command, but it's permanent (there is no question asking if you really meant it!):
  
  > rm(x) > rm(list=c("x","z")) > rm(list=ls(pattern="temp"))
- Libraries and Bringing in Packages:
  This is fairly basic, and is largely applicable to users of the Windows GUI, though similar GUI capabilities are available with macs. Libraries are additional functions that are available in R, usually more specialized. If the library is already on your computer (i.e. it's one of the standard libraries included in R or you've downloaded it) then you can just type:
```
> library(MASS)
```
  This brings the library up so you can access its functions. You can find out what is included in the package with:
```
> help(package=MASS)
> data(package=MASS)
```
  I can make a dataset available, such as UScereal, a dataset about American cereals:
  
  > data(UScereal)
  
  To download a package/library with Windows - using the example of the multiple testing procedure package, multtest:
  1. Click on "Packages" and scroll down to "Install Package(s) from CRAN..."
  2. Choose a mirror near you from the scrolling list and click "OK"
  3. A list of possible packages to install should appear in a new window. Scroll down and click on "multtest" and "OK" (you can click on more than 1 if you want)
  4. R should proceed to install the package. If asked "Delete downloaded files (y/N)?" typing "y" is fine.
  5. Bring in the library by typing:
    
    > library(multtest)
  You can also do all of this by command line with commands like download.packages and install.packages, etc. If you do this, you can choose to download the package to somewhere other than the default, and other options. Generally downloading the package from the CRAN webpage and installing it from your download can be tricky in Windows because it will not be built correctly or the right zip file etc. Use the commands provided by R if at all possible.
Data Indexing (using the dataset "UScereal")
- Basic Indexing
  You can look at all of your data as said before by just typing the name at the prompt
  
  > UScereal                          mfr calories    protein       fat    sodium 100% Bran                 N 212.12121 12.1212121 3.0303030 393.93939 All-Bran                   K 212.12121 12.1212121 3.0303030 787.87879 All-Bran with Extra Fiber K 100.00000 8.0000000 0.0000000 280.00000 Apple Cinnamon Cheerios    G 146.66667 2.6666667 2.6666667 240.00000 Apple Jacks                K 110.00000 2.0000000 0.0000000 125.00000 Basic 4                    G 173.33333 4.0000000 2.6666667 280.00000 Bran Chex                  R 134.32836 2.9850746 1.4925373 298.50746 Bran Flakes                P 134.32836 4.4776119 0.0000000 313.43284 Cap'n'Crunch               Q 160.00000 1.3333333 2.6666667 293.33333 ... etc.
  
  But with a large data set, it will be too big to be displayed like this. Instead you want to look at a portion through indexing.
  Datasets are thought of like matrices, so you can pick off pieces of the dataset by specifying the row or column entry of part of the data by typing data[row,columns]. So UScereal[3,7] would list the entry in the 3rd row and 7th column. UScereal[ ,4] gives the entire 4th column, and so on. The following are examples of pulling out parts of the data.
  Some Examples:
  
  > UScereal[2,2:5]           > UScereal[c(1,5,6) , ] > UScereal$mfr > UScereal$calories[1:5] > UScereal[c("All-Bran","Bran Chex"),]
  
  A single row/column is a vector, but you can force it to remain a matrix:
  
  > mat[1,] > mat[1,,drop=F]
  
  You can save a portion of your data, say to experiment on or to reduce the number of variables, by assigning it to a variable name
  
  > uscer<-UScereal[1:10,1:3 ] > uscer                           mfr calories   protein 100% Bran                   N 212.1212 12.121212 All-Bran                    K 212.1212 12.121212 All-Bran with Extra Fiber   K 100.0000 8.000000 Apple Cinnamon Cheerios     G 146.6667 2.666667 Apple Jacks                 K 110.0000 2.000000 Basic 4                     G 173.3333 4.000000 Bran Chex                   R 134.3284 2.985075 Bran Flakes                 P 134.3284 4.477612 Cap'n'Crunch                Q 160.0000 1.333333 Cheerios                    G 88.0000 4.800000
- Logical indexing
  You can evaluate the truth of statements element-wise in R using traditional logical commands
  
  > Numb==2 [1]FALSE FALSE TRUE > fac=="C" [1]TRUE FALSE TRUE TRUE FALSE FALSE
  
  You can use these T/F values to index a vector or matrix
  
  > Numb[Numb>1] [1]4 2 > UScereal[mfr=="G" | mfr=="K", 1:3] mfr calories protein All-Bran K 212.12121 12.121212 All-Bran with Extra Fiber K 100.00000 8.000000 Apple Cinnamon Cheerios G 146.66667 2.666667 Apple Jacks K 110.00000 2.000000 Basic 4 G 173.33333 4.000000 ....
  
  You can also find the indices directly using which
  
  > which(shelf==2 & sugars>5) [1]5 9 11 13 16 17 22 24 27 29 33 38 39 42 49 56 62
  
  You can also find out if the elements of a vector are contained in another vector
  
  > c("G","F")%in%mfr > mfr%in%c("N","K")
- Factor Indexing
  Another related indexing is by factor variables, so that you can have the following
  
  > vit.colors<-c("red","green","purple") > vit.colors[vitamins] [1]"green" "green" "green" "green" "green" "green" "green" [8]"green" "green" "green" "green" "green" "green" "green" [15]"green" "green" "green" "green" "green" "green" "green" [22]"green" "green" "green" "green" "green" "green" "green" [29]"green" "green" "green" "green" "green" "green" "green" [36]"red" "green" "green" "green" "green" "green" "green" [43]"green" "green" "green" "red" "purple" "green" "green" [50]"green" "green" "green" "green" "purple" "purple" "green" [57]"green" "red" "red" "red" "green" "green" "green" [64]"green" "green" ...
Functions
- How functions work
  Many things that you will be doing in R will be calling an already created function. Simple examples are mean( ), scan( ), plot( ), sd( ), median( ), sort( ). ( ) means the information (usually data) that you need to feed to the function.
  NOTE: sd( ) is the standard deviation, dividing by n-1
  Example: finding the mean/average of the column "protein" in the dataset UScereal
  
  > mean(UScereal$protein) [1]3.683705 > summary(UScereal$protein) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.7519 2.0000 3.0000 3.6840 4.4780 12.1200
  
  Functions often have "smart" defaults for different kinds of objects:
  
  > mean(UScereal) > summary(UScereal)
- Learning about functions
  Functions may have many different options you can set when you call the function. You can out about a function by typing help(FunctionName).
  
  > help(boxplot) > help.search("linear model") > help.start
  
  If you just want to remember what the possible options are you can use args:
  
  > args(sort) function (x, partial = NULL, na.last = NA, decreasing = FALSE, method = c("shell", "quick"), index.return = FALSE) NULL
  
  You should save the outputs from your functions as a new variable so you can access them again
  
  > pmean<-mean(UScereal$potassium) > pmean+8 [1]167.1197
Bringing in data
There is no direct import from Excel. Excel files need to be saved as text files (tab or comma deliminated).
- Reading tab/comma/character deliminated text
  The function read.table reads in text where each row of the data is on a separate line and the columns of the data are separated by a fixed character. The default is ANY white space. Generally files with will be tab deliminated (" $\backslash$ t") or comma deliminated (","), and you can specify this specifically. You must give a file name or a URL
  
  > state.data<-read.table("http://www.stanford.edu/~epurdom/state.txt", sep="\t",header=T,row.names=1)
  
  The resulting object is a data.frame and non-numeric values are made into factor variables. If "header"=T, the first row is taken to contain the names of the columns, not data.
- Exporting data
  You can write data to files using write.table
  
  > write.table(UScereal,"mydir/cereal.txt",sep="")
Basic Statistics
- T-tests
  Use t.test or prop.test for standard tests of the mean
  
  > cal.small<-calories[mfr%in%c("G","K")] > mfr.small<-mfr[mfr%in%c("G","K")] > t.test(cal.small[mfr.small=="G"],cal.small[mfr.small=="K"]) > #equvialent to > t.test(cal.small~mfr.small) Welch Two Sample t-test data: cal.small by mfr.small t = -0.858, df = 40.829, p-value = 0.3959 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -39.85606 16.08978 sample estimates: mean in group G mean in group K 137.7879 149.6710
  
  Can calculate power with power.t.test and power.prop.test:
  
  > power.t.test(delta=3,sd=1,sig.level=.01,power=.90,type="paired") Paired t test power calculation n = 5.032729 delta = 3 sd = 1 sig.level = 0.01 power = 0.9 alternative = two.sided NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs
- Regression
  Use the command lm and the syntax "Y~X1+X2+...+Xn" to describe model:
  
  > fib.lm<-lm(UScereal$fibre~UScereal$calories) > summary(fib.lm) Call: lm(formula = UScereal$fibre~UScereal$calories) Residuals: Min 1Q Median 3Q Max -5.0901 -2.3674 -1.3674 0.6127 26.0141 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.82928 1.84542 -0.991 0.32535 UScereal$calories 0.03815 0.01141 3.344 0.00140 ** -- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.697 on 63 degrees of freedom Multiple R-Squared: 0.1507, Adjusted R-squared: 0.1372 F-statistic: 11.18 on 1 and 63 DF, p-value: 0.001396
- Anova - use lm with X a factor variable, then anova or aov to get standard anova table
  
  > fib.aov<-lm(UScereal$fibre~as.factor(UScereal$shelf)) > anova(fib.aov) Analysis of Variance Table Response: UScereal$fibre Df Sum Sq Mean Sq F value Pr(>F) as.factor(UScereal$shelf) 2 452.50 226.25 7.1748 0.001574 ** Residuals 62 1955.09 31.53 -- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  If not make a factor, then treats "shelf" as a continuous variable with standard regression:
  
  > fib.notaov<-lm(UScereal$fibre~UScereal$shelf) > anova(fib.notaov) Analysis of Variance Table Response: UScereal$fibre Df Sum Sq Mean Sq F value Pr(>F) UScereal$shelf 1 308.30 308.30 9.252 0.003427 ** Residuals 63 2099.30 33.32 -- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(fib.notaov) Call: lm(formula = UScereal$fibre~UScereal$shelf) Residuals: Min 1Q Median 3Q Max -6.042 -3.375 -1.564 1.185 24.261 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.7983 1.9966 -0.901 0.37119 UScereal$shelf 2.6134 0.8592 3.042 0.00343 ** -- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.773 on 63 degrees of freedom Multiple R-Squared: 0.1281, Adjusted R-squared: 0.1142 F-statistic: 9.252 on 1 and 63 DF, p-value: 0.003427 > summary(fib.aov) Call: lm(formula = UScereal$fibre~as.factor(UScereal$shelf)) Residuals: Min 1Q Median 3Q Max -6.7830 -2.3054 -1.0408 0.6797 23.5200 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.0090 1.3236 1.518 0.13413 as.factor(UScereal$shelf)2 -0.9682 1.8718 -0.517 0.60683 as.factor(UScereal$shelf)3 4.7740 1.6850 2.833 0.00621 ** -- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.615 on 62 degrees of freedom Multiple R-Squared: 0.1879, Adjusted R-squared: 0.1618 F-statistic: 7.175 on 2 and 62 DF, p-value: 0.001574
- Random Number generation and Probability Calculations
  R has a large number of standard distributions. They all have the same format, for the command: a root describing the distribution (e.g. norm for the normal distribution) and a prefix indicating what feature you want from the distribution (e.g. r for random number generation):
  
  > rnorm(5,mean=20,sd=2) [1]21.21675 21.43155 19.18402 18.56639 17.55581 > #P(Z<=1) > pbinom(1,prob=.5,size=6) [1]0.109375 > dbinom(1,prob=.5,size=6)+dbinom(0,prob=.5,size=6) [1]0.109375
  
  To get random samples or permutations use sample
  
  > sample(1:5,replace=F) [1]3 4 2 5 1 > sample(1:5,replace=T) [1]2 2 1 1 2 > z [1]4 -3 2 > sample(z,size=10,replace=T) [1]4 -3 4 4 2 4 -3 -3 4 4

Plots

Basic Plots
First, to reducing typing,

> attach(UScereal)
- plot(xdata,ydata) - the standard x-axis vs. y-axis plot
  
  > plot(potassium, protein) > plot(mfr,potassium) > plot(as.numeric(mfr),potassium) > plot(fib.lm)
- Other plots
  Histograms:
  
  > hist(potassium)
  
  Boxplots:
  
  > boxplot(potassium) > boxplot(potassium[mfr=="Q"], potassium[mfr=="R"] > boxplot(potassium mfr)
  
  Barplots:
  
  > group.mean<-matrix(unlist(by(UScereal[,2:8],mfr,mean)),nrow=7,byrow=F) > colnames(group.mean)<-levels(mfr) > rownames(group.mean)<-colnames(UScereal[,2:8]) > barplot(group.mean["protein",]) > barplot(group.mean,beside=F) > barplot(t(group.mean),beside=T)
Adding to Plots
The following functions add to an existing plot:

lines points curve rect segments abline

> hist(sodium,breaks=20,freq=F) > abline(v=mean(sodium),lty=3) > lines(density(sodium)) > plot(sodium,potassium) > lines(lowess(sodium,potassium)) > plot(residuals(fib.lm) fitted(fib.lm)) > abline(fib.lm)

Note that lines just connects points in the order given

plot(potassium,sodium,type="l") plot(potassium[order(potassium)],sodium[order(potassium)],type="l")

You must have already set up a plotting command with a function such as plot or hist to use these commands. You can set up the coordinates/axes/range without actually plotting anything using the option type="n"

> plot(sodium,potassium,type="n") > points(sodium[mfr=="G"],potassium[mfr=="G"],col="red") > abline(h=c(max(potassium)-1,min(potassium)+1),lty=c(2,3))

I don't actually need to do this, because I could instead have:

> plot(sodium[mfr=="G"],potassium[mfr=="G"], ylim=c(min(potassium),max(potassium))) > abline(h=c(max(potassium)-1,min(potassium)+1),lty=c(2,3))
Manipulating plots
- Multiple plot windows
  When you type in a graphing command, a plotting window comes up automatically. Sometimes you would like to have multiple plotting windows for different graphs.
  The command win.graph() (Windows) or x11() (Unix/Windows/Mac?) brings up another graphing window. To pick one, use the numbers at the top of the window as the argument for dev.set().
  
  > x11() > boxplot(sugars) > dev.set(2) > plot(potassium, protein)
- Prettying your graph
  (see help(par) for more details) - you will find sometimes these are quite easy to implement, but other times some of the settings don't want to work with the plotting function you are using. It takes a good bit of experimenting.
  Some commands you call independently, through the function par () and affect all graphs
  - par(mfrow=c(x,y))
    creates a grid of plots with x rows and y columns.
    Most are options that you put in the plot command just for a particular plot
  - xlab, ylab
    x-axis or y-axis labels
  - lty
    Line type (0=blank, 1=solid, 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings `"blank"', `"solid"', `"dashed"', `"dotted"', `"dotdash"', `"longdash"', or `"twodash"', where `"blank"' uses `invisible lines' (i.e., doesn't draw them).
  - pch
    Style of points in graph
  - `bg' ,'col.lab' , `col.main' ,'col.sub'
    The color for the background, labels, main title, and subtitle respectively. Usually use values like "red" or "tan" to pick color. Type colors() to see all options.
  - las
    Style of axis labels. (0=parallel, 1=all horizontal, 2=all perpendicular to axis, 3=all vertical)
  - font
    1=plain, 2=bold, 3=italic, 4=bold italic
    You can also set some of these things after you have already made your main plot.
  - asp (within plot command) giving the *asp*ect ratio y/x.
    If 'asp' is a finite positive value then the window is set up so that one data unit in the x direction is equal in length to 'asp' * one data unit in the y direction. The special case 'asp == 1' produces plots where distances between points are represented accurately on screen.
  - title(), axis()
  Try help on par, plot, plot.default, plot.window, points, lines to really find everything!
Examples:

> par(mfrow=c(2,2)) > plot(fib.lm)

> par(mfrow=c(1,2)) > #first plot > plot(carbo,fibre,pch=3, las=1,main="Fiber versus Carbohydrates", sub="A cool subtitle is useful") > #second plot > hist(calories,freq=F,xlab="Calories",main="",sub="",col.lab="green") > lines(density(calories),col="red",lwd=1.2) > temp.fcn<-function(x)dnorm(x,mean=mean(calories),sd=sd(calories)) > curve(temp.fcn,col="blue",lwd=1.2,add=T) > title(main="Histogram of Calories",sub="Normal density and empirical estimate of density overlayed")

> par(mfrow=c(1,1)) > plot(sodium,potassium,col=c("red","blue","green","yellow", "purple","pink")[mfr], pch=c(1:3)[vitamins]) > legend("bottomright",legend=c(levels(mfr),levels(vitamins)), col=c("red","blue","green","yellow","purple","pink",rep("black",3)), pch=c(rep(-1,6),1:3),ncol=2,lty=c(rep(1,6),rep(-1,3)),lwd=6,pt.lwd=1)
More Complicated Plots
We'll use microarray data with the multtest package:

> data(golub) > resT<-mt.maxT(golub,classlabel=golub.cl,test="t") #takes a minute > names(resT) [1]"index" "teststat" "rawp" "adjp" > golub.signif<-golub[resT$adjp<.05,]
- heatmap()
  
  > heatmap(golub.signif)
  
  More elaborate:
  
  > golub.fac<-factor(golub.cl,labels=c("ALL","AML")) > gnames.signif<-golub.gnames[resT$adjp<.05,3] > library(RColorBrewer) > mybreaks<-seq(min(golub.signif),max(golub.signif),length=12) > heatmap(golub.signif,ColSideColors=c("cyan","pink")[golub.fac], labRow=gnames.signif,col=brewer.pal(11,"PRGn"),scale="none", breaks=mybreaks) > #make a legend > win.graph() > par(mar=c(4.1,2.1,.1,2.1)) > mids<-mybreaks[1:11]+diff(mybreaks)/2 > image(x=mids,y=1,matrix(mids,ncol=1),breaks=mybreaks, col=brewer.pal(11,"PRGn"),axes=F,xlab="",ylab="") > axis(1,mybreaks,labels=signif(mybreaks,3)) > box()
- matplot
  You can plot the columns of matrices using matplot (or matpoints and matlines for the lines functions). You can just plot them against their row index:
  
  > matplot(UScereal[,c("sodium","potassium","calories")], pch=1:3,col=1:3,ylab="Different Nutrients",type="l") > legend("topleft",legend=c("sodium","potassium","calories"), pch=1:3,col=1:3)
  
  or against another value:
  
  > matplot(fibre,UScereal[,c("sodium","potassium","calories")], lty=1,col=1:3,ylab="Different Nutrients") > legend("topleft",legend=c("sodium","potassium","calories"), lty=1,col=1:3)
- pairs()
  
  > pairs(UScereal)
Saving Plots
- Using the GUI
  When you are looking at a graph, you can save your existing plot by going to "File-Save As" (you can also use the command savePlot to save the existing plot). In this same way you can also copy the plot to a metafile/bitmap and paste the graph into another program, like Word. For larger projects, though, it's generally better to save plots using the written commands below to control the final format and have a record of the name of plots.
  Also check out the "Recording" option under "History" menu (or recordPlot(), replayPlot() If you turn this option on, R will remember the plots made on that screen and you can use the "Page Up" and "Page Down" commands to scroll between your plots.
- Adobe Acrobat (.pdf) format
  While R saves the variables you name, in order to save your plot to print later, you need to save it separately. The easiest is to save the plot into .PDF format (i.e. Adobe Acrobat format). The following saves the x-y plot into a file "protein.pdf" in the directory you started R in.
  
  > pdf("mydir/protein.pdf") > plot(potassium, protein) > dev.off()
  
  NOTE: if you don't do dev.off() then any further plots you make will overwrite the plot you are trying to save.
- Postscript (.ps)
  Similarly to save in postscript format (in portrait this time, so I say horizontal=F). This would be the preferred format for journals or for further editing in Adobe Illustrator as it seems to save the most information.
  
  > postscript("mydir/sugars.ps", horizontal=F) > hist(sugars) > dev.off()
You can create a jpeg file of a plot using jpeg(). This is particularly useful if you have large numbers of points - pdf stores every point which takes up a lot of time and resources for opening/printing a graph of 10,000 points, while jpeg is just an picture of it.

Writing Functions

Basic Control Functions
You define a function in R using the command function. The following function returns the mean, standard deviation, and upper and lower 95% confidence interval limits in the form of a list.

> mysum<-function(x, conf.inv=T){     m<-mean(x)     if(conf.inv==T){         n<-length(x)         uppconf<-mean(x)+2*sd(x)/sqrt(n)         lowconf<-mean(x)-2*sd(x)/sqrt(n)         return(list(mean=m,sd=sd(x),uppconf=uppconf,lowconf=lowconf))     }     else return(list(mean=m,sd=sd(x))) } > mysum(potassium) $mean [1]159.1197
$sd [1]180.2886
$uppconf [1]203.8438
$lowconf [1]114.3957

Basic programming functions are,

if else while

break next for

stop and warning are functions that allow user to check that certain conditions are satisfied. You can comment your code using the # symbol
Note that for loops are generally slow in R, and using apply or sapply is preferable if the function is not actually recursive. For example, the following code that finds the upper confidence interval for each

> my.ind<-c(2,4,8) > x<-vector(length=length(my.ind)) > n<-nrow(UScereal) > for(i in 1:length(my.ind) ){       x[i]<-mean(UScereal[,my.ind[i]])+             2*sd(UScereal[,my.ind[i]])/sqrt(n)   } > x [1]164.890738   1.831168 11.498387

could be written as

> x<-apply(UScereal[,my.ind],2,function(y){           mean(y)+2*sd(y)/sqrt(length(y))}) > x   calories        fat     sugars 164.890738   1.831168 11.498387

If the function is already defined, then apply is even easier:

> apply(UScereal[,my.ind],1,mean)

finds the row means.
Tips:
- In if statements, you should use any, all for robust programing.
  
  > if(Numb>2) print(fac) else print(Numb); > if(any(Numb>2)) print(fac) else print(Numb); > if(all(Numb>2)) print(fac) else print(Numb);
- apply requires matrices
  
  > apply(Numb,1,sum) > apply(matrix(Numb,nrow=1),1,sum)
Finding errors in Your Program
R does not have great debugging mechanisms and the error messages are ... cryptic. Here are a couple of things that can be helpful
- Source your function
  Save your function by itself in a text/.R file. Then when you want to load it into R, use the source command. This reads the file and executes the file. For a file with just a function, this will load your function or changes, and most importantly, will tell you the line number of a syntax error.
- Traceback
  If you are calling functions within functions, as we did in calling mean and sd, traceback() tells you what function had the error
  
  > myerror<-function(){sum(fac)} > myerror() Error in Summary.factor(..., na.rm = na.rm) : sum not meaningful for factors > traceback() 4: stop(.Generic, " not meaningful for factors") 3: Summary.factor(..., na.rm = na.rm) 2: sum(fac) 1: myerror()
- Debugging
  There are several functions that are suppose to help debug your function. I find the most useful is debug. This allows you go along with the function and figure out what the problem is. My function is suppose to both subtract the mean of each column and each row (there's a function that centers matrices, by the way, sweep or scale)
  
  > myCentered<-function(x){ rsum<-apply(x,1,mean) rcentered<-x-rsum csum<-apply(x,2,mean) ccentered<-x-csum return(list(row=rcentered,col=ccentered)) } > myCentered(mat) $row [,1][,2] [1,]-0.5 0.5 [2,]1.0 -1.0 [3,]-1.0 1.0 $col [,1][,2] [1,]-1.000000 -0.3333333 [2,]1.666667 0.0000000 [3,]-1.000000 0.6666667
  
  There's no error, just not what I was wanting - the row centering worked, but the column centering didn't. I can use debug go into the function and try it as it is working. Namely, R pauses before each command and waits to execute it until you ask it to. To get R to execute the next line, hit "return" or type "n". Otherwise, you can just type in what you want within the operation of the function using the objects within the function. This is very helpful with large functions. Try to follow along with the code below to get the idea.
  
  > debug(myCentered) > myCentered(mat) > myCentered(mat) debugging in: myCentered(mat) debug: { rsum <- apply(x, 1, mean) rcentered <- x - rsum csum <- apply(x, 2, mean) ccentered <- x - csum return(list(row = rcentered, col = ccentered)) } Browse[1]> n debug: rsum <- apply(x, 1, mean) Browse[1]> n debug: rcentered <- x - rsum Browse[1]> n debug: csum <- apply(x, 2, mean) Browse[1]> csum #I try to look at my object too soon Error: Object "csum" not found Browse[1]> n Browse[1]> csum #now its there [1]2.000000 2.333333 debug: ccentered <- x - csum Browse[1]> x [,1][,2] [1,]1 2 [2,]4 2 [3,]1 3 Browse[1]> x-csum # So I see this subtracts off across rows... [,1][,2] [1,]-1.000000 -0.3333333 [2,]1.666667 0.0000000 [3,]-1.000000 0.6666667 Browse[1]> t(x)-csum # the right answer, but transposed... [,1][,2] [,3] [1,]-1.0000000 2.0000000 -1.0000000 [2,]-0.3333333 -0.3333333 0.6666667 Browse[1]> t(t(x)-csum) #there we go, now I can fix it with this code [,1][,2] [1,]-1 -0.3333333 [2,]2 -0.3333333 [3,]-1 0.6666667 Browse[1]> Q #get me out of here > undebug(myCentered)#Turns off the debugging
Another example: A function that for a given matrix, will plot the standard 95% confidence intervals for each column. Give options for line thickness and color. For example,

> myplot.CI(UScereal[,c(3,4,6)],col=c("red","green","blue"),lwd=1:3)

Should give something like this
A couple of hints for a good plot function:
1. Use "..." for a bunch of commands that you don't want to specify but you want passed on to another function, like plot
2. Use invisible for a function to return a value only if it is assigned to a new variable; good for returning the coordinates, etc. only if asked
3. Put in default values for non-essential options
My solution:

> myplot.CI<-function(mat,col="black",lwd=1,...) lengths<-apply(mat,2,function(x)n<-length(x); return(c(lw=mean(x)-2*sd(x)/n, up=mean(x)+2*sd(x)/n)) ) range.len<-range(lengths) plot(mean(mat),1:ncol(mat),xlim=range.len,...) segments(lengths[1,],1:ncol(mat),lengths[2,],1:ncol(mat), col=col,lwd=lwd) invisible(t(lengths))

About this document ...

Next: About this document ...

Elizabeth Anne Purdom 2006-02-22

`if`	`else`	`while`
`break`	`next`	`for`