To Download R to Personal Computer
Getting Started in R
R is a Command Line language, which means you type in the commands at the prompt ">" and the output comes after you hit return. In other words, there are not drop down menus and mouse commands. Even if you use the "Windows" version, it is still works basically this way. This gives you a lot of control, but can make it a little intimidating at first. It's often a good idea, if you are doing a lot of complicated things, to have another screen open (like Notebook) for text editing so that you can save your commands.
We also are going to want to save files into a folder for future reference. When I refer to mydir you should type in an appropriate directory. An example might be
"C:/Documents and Settings/lelandID/My Documents"
Note that Windows uses a "
" to separate subdirectories, unlike Unix, which uses "/". To give a full path name in R using Windows, you can use the Unix "/" or use "
". You cannot just use the Windows "
".
For more information about finding your working directory, etc., see
http://www.stanford.edu/~epurdom/Saving/Saving.html
To save output to some variable name, use "<-" (or sometimes you'll see "_" )
You can create descriptive names, as you like, though remember you may have to type them many times, so try to think of short, descriptive names.
You will often be working with vectors, particularly if you are fine-tuning options. The function c( ) creates a vector
If you want to look and see the value of what you have created, you simply type its name at the prompt.
Mathematical formulas are elementwise:
You can make some kinds of vectors quickly using seq and rep
You can also give names to your vector to keep track of what they correspond to
You can make matrices from scratch using matrix() or from previously made vectors using rbind(),cbind()
Matrix operations:
This type of object allows R to deal with categorical variables. The different possible categories are called levels.
Dataframes generally act like matrices, but allow different columns to be vectors of different types, like character, factor, numeric, etc.
versus
Lists allow you to put together any kind of data to keep track of it.
$vec
[1]4 -3 2
$afac
[1]C B C C D A
Levels: A B C D
You can use the functions of the format is.xx to figure out what type object you have.
You can also try to convert objects to different types using functions as.xx.
NOTE: If you later reassign something else to that same name, you lose the old information, and there is no warning before you do it, so use ls() to see if you have already used the name. You also should not give a name to your variable that is already a name of an R function because there will be unexpected consequences.
To quit from R, type
Type y, and then you should be out of R. If you saved your session, when you come back, all of these saved variables will still be there to work with, as long as you start in the same directory, that you created (In Windows/Mac, this will be the default directory that you get by opening R).
This part is useful but not necessary, so feel free to skip this as needed.
If you want to save one particular object that you made, say to transfer to another computer or to back it up, you can use the command save. If you make the extension of your file ``.rdata'' then Windows recognizes it as an R Data file and will autolaunch.
If you exit R and look at the files in your directory, you will see Numb.Rdata. You can move this file around to different directories - this is a way of saving your information. dump() is another way, but not as nice. If you move it to a different directory you can get it back within R by using load()
Of course, since you saved when you exited, you don't need to load the information back in. You can actually access all of your data that you saved when you exited by loading the .RData file.
You can also remove objects using the rm() command, but it's permanent (there is no question asking if you really meant it!):
This is fairly basic, and is largely applicable to users of the Windows GUI, though similar GUI capabilities are available with macs. Libraries are additional functions that are available in R, usually more specialized. If the library is already on your computer (i.e. it's one of the standard libraries included in R or you've downloaded it) then you can just type:
> library(MASS)
This brings the library up so you can access its functions. You can find out what is included in the package with:
> help(package=MASS) > data(package=MASS)
I can make a dataset available, such as UScereal, a dataset about American cereals:
To download a package/library with Windows - using the example of the multiple testing procedure package, multtest:
You can also do all of this by command line with commands like download.packages and install.packages, etc. If you do this, you can choose to download the package to somewhere other than the default, and other options. Generally downloading the package from the CRAN webpage and installing it from your download can be tricky in Windows because it will not be built correctly or the right zip file etc. Use the commands provided by R if at all possible.
You can look at all of your data as said before by just typing the name at the prompt
But with a large data set, it will be too big to be displayed like this. Instead you want to look at a portion through indexing.
Datasets are thought of like matrices, so you can pick off pieces of the dataset by specifying the row or column entry of part of the data by typing data[row,columns]. So UScereal[3,7] would list the entry in the 3rd row and 7th column. UScereal[ ,4] gives the entire 4th column, and so on. The following are examples of pulling out parts of the data.
Some Examples:
A single row/column is a vector, but you can force it to remain a matrix:
You can save a portion of your data, say to experiment on or to reduce the number of variables, by assigning it to a variable name
You can evaluate the truth of statements element-wise in R using traditional logical commands
You can use these T/F values to index a vector or matrix
You can also find the indices directly using which
You can also find out if the elements of a vector are contained in another vector
Another related indexing is by factor variables, so that you can have the following
Many things that you will be doing in R will be calling an already created function. Simple examples are mean( ), scan( ), plot( ), sd( ), median( ), sort( ). ( ) means the information (usually data) that you need to feed to the function.
NOTE: sd( ) is the standard deviation, dividing by n-1
Example: finding the mean/average of the column "protein" in the dataset UScereal
Functions often have "smart" defaults for different kinds of objects:
Functions may have many different options you can set when you call the function. You can out about a function by typing help(FunctionName).
If you just want to remember what the possible options are you can use args:
You should save the outputs from your functions as a new variable so you can access them again
There is no direct import from Excel. Excel files need to be saved as text files (tab or comma deliminated).
The function read.table reads in text where each row of the data is on a separate line and the columns of the data are separated by a fixed character. The default is ANY white space. Generally files with will be tab deliminated ("
t") or comma deliminated (","), and you can specify this specifically. You must give a file name or a URL
~
epurdom/state.txt",
\t
",header=T,row.names=1)
The resulting object is a data.frame and non-numeric values are made into factor variables. If "header"=T, the first row is taken to contain the names of the columns, not data.
You can write data to files using write.table
Use t.test or prop.test for standard tests of the mean
~
mfr.small)
Can calculate power with power.t.test and power.prop.test:
Use the command lm and the syntax "Y~
X1+X2+...+Xn" to describe model:
~
UScereal$calories)
~
UScereal$calories)
~
as.factor(UScereal$shelf))
If not make a factor, then treats "shelf" as a continuous variable with standard regression:
~
UScereal$shelf)
~
UScereal$shelf)
~
as.factor(UScereal$shelf))
R has a large number of standard distributions. They all have the same format, for the command: a root describing the distribution (e.g. norm for the normal distribution) and a prefix indicating what feature you want from the distribution (e.g. r for random number generation):
To get random samples or permutations use sample
Plots
First, to reducing typing,
Histograms:
Boxplots:
Barplots:
The following functions add to an existing plot:
lines | points | curve | rect | segments | abline |
Note that lines just connects points in the order given
You must have already set up a plotting command with a function such as plot or hist to use these commands. You can set up the coordinates/axes/range without actually plotting anything using the option type="n"
I don't actually need to do this, because I could instead have:
When you type in a graphing command, a plotting window comes up automatically. Sometimes you would like to have multiple plotting windows for different graphs.
The command win.graph() (Windows) or x11() (Unix/Windows/Mac?) brings up another graphing window. To pick one, use the numbers at the top of the window as the argument for dev.set().
(see help(par) for more details) - you will find sometimes these are quite easy to implement, but other times some of the settings don't want to work with the plotting function you are using. It takes a good bit of experimenting.
Some commands you call independently, through the function par () and affect all graphs
creates a grid of plots with x rows and y columns.
Most are options that you put in the plot command just for a particular plot
x-axis or y-axis labels
Line type (0=blank, 1=solid, 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings `"blank"', `"solid"', `"dashed"', `"dotted"', `"dotdash"', `"longdash"', or `"twodash"', where `"blank"' uses `invisible lines' (i.e., doesn't draw them).
Style of points in graph
The color for the background, labels, main title, and subtitle respectively. Usually use values like "red" or "tan" to pick color. Type colors() to see all options.
Style of axis labels. (0=parallel, 1=all horizontal, 2=all perpendicular to axis, 3=all vertical)
1=plain, 2=bold, 3=italic, 4=bold italic
You can also set some of these things after you have already made your main plot.
If 'asp' is a finite positive value then the window is set up so that one data unit in the x direction is equal in length to 'asp' * one data unit in the y direction. The special case 'asp == 1' produces plots where distances between points are represented accurately on screen.
Try help on par, plot, plot.default, plot.window, points, lines to really find everything!
Examples:
We'll use microarray data with the multtest package:
More elaborate:
or against another value:
When you are looking at a graph, you can save your existing plot by going to "File-Save As" (you can also use the command savePlot to save the existing plot). In this same way you can also copy the plot to a metafile/bitmap and paste the graph into another program, like Word. For larger projects, though, it's generally better to save plots using the written commands below to control the final format and have a record of the name of plots.
Also check out the "Recording" option under "History" menu (or recordPlot(), replayPlot() If you turn this option on, R will remember the plots made on that screen and you can use the "Page Up" and "Page Down" commands to scroll between your plots.
While R saves the variables you name, in order to save your plot to print later, you need to save it separately. The easiest is to save the plot into .PDF format (i.e. Adobe Acrobat format). The following saves the x-y plot into a file "protein.pdf" in the directory you started R in.
NOTE: if you don't do dev.off() then any further plots you make will overwrite the plot you are trying to save.
Similarly to save in postscript format (in portrait this time, so I say horizontal=F). This would be the preferred format for journals or for further editing in Adobe Illustrator as it seems to save the most information.
Writing Functions
You define a function in R using the command function. The following function returns the mean, standard deviation, and upper and lower 95% confidence interval limits in the form of a list.
$sd
[1]180.2886
$uppconf
[1]203.8438
$lowconf
[1]114.3957
Basic programming functions are,
if | else | while |
break | next | for |
stop and warning are functions that allow user to check that certain conditions are satisfied. You can comment your code using the # symbol
Note that for loops are generally slow in R, and using apply or sapply is preferable if the function is not actually recursive. For example, the following code that finds the upper confidence interval for each
could be written as
If the function is already defined, then apply is even easier:
finds the row means.
Tips:
R does not have great debugging mechanisms and the error messages are ... cryptic. Here are a couple of things that can be helpful
Save your function by itself in a text/.R file. Then when you want to load it into R, use the source command. This reads the file and executes the file. For a file with just a function, this will load your function or changes, and most importantly, will tell you the line number of a syntax error.
If you are calling functions within functions, as we did in calling mean and sd, traceback() tells you what function had the error
There are several functions that are suppose to help debug your function. I find the most useful is debug. This allows you go along with the function and figure out what the problem is. My function is suppose to both subtract the mean of each column and each row (there's a function that centers matrices, by the way, sweep or scale)
There's no error, just not what I was wanting - the row centering worked, but the column centering didn't. I can use debug go into the function and try it as it is working. Namely, R pauses before each command and waits to execute it until you ask it to. To get R to execute the next line, hit "return" or type "n". Otherwise, you can just type in what you want within the operation of the function using the objects within the function. This is very helpful with large functions. Try to follow along with the code below to get the idea.
Should give something like this
A couple of hints for a good plot function:
My solution: