This lab is designed to be able to go through it on your own. If you come to the scheduled lab, we will get everyone started and then answer questions as they arise. You should try the commands on your own and not just read the text if you are unfamiliar with R. The exercises are generally to point out features of R that might be useful, or common mistakes; almost every exercise introduces a little something new. I would recommend you at least read through them. The beginning exercises will be need for later examples. If you are already familiar with R, there are still some handy tips sprinkled through.
Also, I have made a summary of useful commands by Category here. These do not describe the commands, but just give lists. Use the help function as described below to find out what they do.
Nuts & Bolts of Running R
NOTE: if you use "tree" you will not be able to run all of the help, because the help.start( ) requires html
You can get X programs for your personal computer, and they have various trial periods. These are two I know about:
Getting Started in R
R is a Command Line language, which means you type in the commands at the prompt ">" and the output comes after you hit return. In other words, there are not drop down menus and mouse commands. Even if you use the "Windows" version, it is still works basically this way. This gives you a lot of control, but can make it a little intimidating at first. It's often a good idea, if you are doing a lot of complicated things, to have another screen open (like Notebook) for text editing so that you can save your commands.
We are going to use preloaded data in this lab. To access it type the following command:
We also are going to want to save files into a folder for future reference. When I refer to mydir you should type in an appropriate directory. An example might be
"C:/Documents and Settings/lelandID/My Documents"
Note that Windows uses a "
" to separate subdirectories, unlike Unix, which uses "/". To give a full path name in R using Windows, you can use the Unix "/" or use "
". You cannot just use the Windows "
".
In the computer lab, we will give you instructions for what to insert in place of mydir. For more information about finding your working directory, etc., see http://www.stanford.edu/~epurdom/Saving/Saving.html
To save output to some variable name, use "<-" (or sometimes you'll see "_" )
You can make some kinds of vectors quickly using seq and rep
You can make matrices from scratch using matrix() or from previously made vectors using rbind(),cbind()
Dataframes generally act like matrices, but allow different columns to be vectors of different types, like character, factor, numeric, etc.
Lists allow you to put together any kind of data to keep track of it.
$vec
[1]4 -3 2
$afac
[1]C B C C D A
Levels: A B C D
You can use the functions of the format is.xx to figure out what type object you have.
You can also try to convert objects to different types using functions as.xx.
c | seq | rep | names |
matrix | rbind | cbind | rownames |
colnames | factor | levels | data.frame |
as.xx | is.xx |
NOTE: If you later reassign something else to that same name, you lose the old information, and there is no warning before you do it, so use ls() to see if you have already used the name. You also should not give a name to your variable that is already a name of an R function because there will be unexpected consequences.
To quit from R, type
This part is useful but not necessary, so feel free to skip this as needed.
If you want to save one particular object that you made, say to transfer to another computer or to back it up, you can use the command save. If you make the extension of your file ``.rdata'' then Windows recognizes it as an R Data file and will autolaunch.
If you exit R and look at the files in your directory, you will see Numb.Rdata. You can move this file around to different directories - this is a way of saving your information. dump() is another way, but not as nice. If you move it to a different directory you can get it back within R by using load()
Of course, since you saved when you exited, you don't need to load the information back in. You can actually access all of your data that you saved when you exited by loading the .RData file.
You can also remove objects using the rm() command, but it's permanent (there is no question asking if you really meant it!):
ls | q | save | load | rm |
You can look at all of your data as said before by just typing the name at the prompt
But with a large data set, it will be too big to be displayed like this. Instead you want to look at a portion through indexing.
Datasets are thought of like matrices, so you can pick off pieces of the dataset by specifying the row or column entry of part of the data by typing data[row,columns]. So UScereal[3,7] would list the entry in the 3rd row and 7th column. UScereal[ ,4] gives the entire 4th column, and so on. The following are examples of pulling out parts of the data.
1. > UScereal[2,2:5]
2. > UScereal[c(1,5,6) , ]
3. > UScereal$mfr
4. > UScereal$calories[1:5]
5. > UScereal[c("All-Bran","Bran Chex"),]
You can save a portion of your data, say to experiment on or to reduce the number of variables, by assigning it to a variable name
1. What does names(UScereal) tell you? What about colnames(UScereal)
2. Make another copy of UScereal to experiment on:
What happens to uscer.temp when you do
OR
3. How do you find the ratio of fat to protien for each person? (i.e. fat/protein for each entry)
Here is a link to a summary of indexing in R. Note that lists elements are indexed by xx$name if there are names. This is the common output from functions, such as lm (the regression function) to store many different kinds of output for the user.
If you are going to be frequently using a dataset with many variables/columns, instead of constantly typing UScereal$proteins, and so forth, you can "attach" the data set. This means the names at the top of the columns will be variable names that you can use directly (but they won't show up when you type ls())
You can evaluate the truth of statements element-wise in R using traditional logical commands
You can also find the indices directly using which
Another related indexing is by factor variables, so that you can have the following
Many things that you will be doing in R will be calling an already created function. Simple examples are mean( ), scan( ), plot( ), sd( ), median( ), sort( ). ( ) means the information (usually data) that you need to feed to the function.
NOTE: sd( ) is the standard deviation, dividing by n-1
Example: finding the mean/average of the column "protein" in the dataset UScereal
1. Now find the mean, standard deviation, and median of the data column "sugars".
2. What happens if you type the following
3. What are the five smallest values of potassium? (You should not have to search for them manually)
Functions may have many different options you can set when you call the function. You can out about a function by typing help(FunctionName).
If you just want to remember what the possible options are you can use args:
mean | sd | median | summary | cor | |
range | max | min | which.max | which.min | |
length | dim | sort | unique | rowMeans | |
colMeans | rowSums | colSums | cumsum | ||
prod | round | zapsmall |
The function read.table reads in text where each row of the data is on a separate line and the columns of the data are separated by a fixed character. The default is ANY white space. Generally files with will be tab deliminated ("
t") or comma deliminated (","), and you can specify this specifically. You must give a file name or a URL
~
epurdom/state.txt",
\t
",header=T,row.names=1)
The resulting object is a data.frame and non-numeric values are made into factor variables. If "header"=T, the first row is taken to contain the names of the columns, not data.
You can write data to files using write.table
If you are working on the Leland prompt, it's important now that you have already set the environment display, Otherwise nothing will happen when you try to plot.
1. Try each of the plots with the UScereal data.
2. What is happening with the following commands and why?
The following functions add to an existing plot:
lines | points | curve | rect |
segments | abline |
You must have already set up a plotting command with a function such as plot or hist to use these commands. You can set up the coordinates/axes/range without actually plotting anything using the option type="n"
When you are looking at a graph, you can save your existing plot by going to "File-Save As" (you can also use the command savePlot to save the existing plot). In this same way you can also copy the plot to a metafile/bitmap and paste the graph into another program, like Word. For larger projects, though, it's generally better to save plots using the written commands below to control the final format and have a record of the name of plots.
Also check out the "Recording" option under "History" menu (or recordPlot(),replayPlot() If you turn this option on, R will remember the plots made on that screen and you can use the "Page Up" and "Page Down" commands to scroll between your plots.
While R saves the variables you name, in order to save your plot to print later, you need to save it separately. The easiest is to save the plot into .PDF format (i.e. Adobe Acrobat format). The following saves the x-y plot into a file "protein.pdf" in the directory you started R in.
NOTE: if you don't do dev.off() then any further plots you make will overwrite the plot you are trying to save.
Similarly to save in postscript format (in portrait this time, so I say horizontal=F)
When you type in a graphing command, a plotting window comes up automatically. Sometimes you would like to have multiple plotting windows for different graphs.
The command win.graph() (Windows) or x11() (Unix/Windows/Mac?) brings up another graphing window. To pick one, use the numbers at the top of the window as the argument for dev.set().
(see help(par) for more details) - you will find sometimes these are quite easy to implement, but other times some of the settings don't want to work with the plotting function you are using. It takes a good bit of experimenting.
Some commands you call independently, through the function par () and affect all graphs
creates a grid of plots with x rows and y columns.
Most are options that you put in the plot command just for a particular plot
x-axis or y-axis labels
Line type (0=blank, 1=solid, 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings `"blank"', `"solid"', `"dashed"', `"dotted"', `"dotdash"', `"longdash"', or `"twodash"', where `"blank"' uses `invisible lines' (i.e., doesn't draw them).
Style of points in graph
The color for the background, labels, main title, and subtitle respectively. Usually use values like "red" or "tan" to pick color. Type colors() to see all options.
Style of axis labels. (0=parallel, 1=all horizontal, 2=all perpendicular to axis, 3=all vertical)
1=plain, 2=bold, 3=italic, 4=bold italic
You can also set some of these things after you have already made your main plot.
Example:
4. Try the following command using matplot() and figure out what it does:
How would you make the x-axis values equally spaced, rather than dependent on the values of sodium?
You define a function in R using the command function. The following function returns the mean, standard deviation, and upper and lower 95% confidence interval limits in the form of a list.
$sd
[1]180.2886
$uppconf
[1]203.8438
$lowconf
[1]114.3957
Basic programming functions are,
if | else | while |
break | next | for |
stop and warning are functions that allow user to check that certain conditions are satisfied. You can comment your code using the # symbol
Note that for loops are generally slow in R, and using apply or sapply is preferable if the function is not actually recursive. For example, the following code that finds the upper confidence interval for each
2. What's the problem with the following code? (This is a very annoying feature of R to watch for in programming... )
How would you fix this? (there are 2 obvious ways, depending on the circumstances - one of which uses sapply or lapply)
Some useful logical functions for programming and other things can be found here.
R does not have great debugging mechanisms and the error messages are ... cryptic. Here are a couple of things that can be helpful
Save your function by itself in a text/.R file. Then when you want to load it into R, use the source command. This reads the file and executes the file. For a file with just a function, this will load your function or changes, and most importantly, will tell you the line number of a syntax error.
If you are calling functions within functions, as we did in calling mean and sd, traceback() tells you what function had the error
There are several functions that are suppose to help debug your function. I find the most useful is debug. This allows you go along with the function and figure out what the problem is. My function is suppose to both subtract the mean of each column and each row(there's a function that centers matrices, by the way, sweep or scale)
Should give something like this
A couple of hints for a good plot function: