Introduction to R

R is an interactive and interpreted language. By interpreted, we mean that we can give an instruction and immediately have it evaluated. Then, we can give another command. In other non-interpreted languages, we must write an entire program made up of a sequence of commands that are specified before we run the program. Once it is running, we cannot change the commands. All we can do is either wait for it to complete or terminate it and re-run it with the commands altered or different inputs. This interactivity is very important for us in statistics.

We need to be able to visualize data, look at numerical summaries and the output from fitting a model, or subsetting the data based on previous observations and then decide what to do next. This is Exploratory Data Analysis (EDA). It is a highly iterative process where we attempt to let the data direct us as to what to do next. We try different things as we go along different branches or paths of action, sometimes leading to useful insights that we want to report and, at other times, verifying that certain assumptsions are justified, or trying different methods to understand the data better.

The ability to be able to dynamically specify what we want to do next is important. R then allows to combine the commands into a script or "program" that we can then re-run on new or different data to recreate our analyses. This is often termed BATCH programming since we are doing several commands in a single run. This gives us the best of both worlds: interactive facilities during exploration, prgramming facilities when the exploration is more "complete".

Systems or environments like SAS provide either a BATCH or interactive interface. The interactive view is a point-and-click interface as in SAS' JMP product. While this supports exploratory, interactive data analysis, it does not allow us to readily manage the intermediate results from each step and put them into the next steps and generally branch in different directions. And the BATCH system only allows us to do interactive work in very coarse-grained increments; run these sequence of commands on this data and produce this output. Then look at the output and write some more code. While superficially this is the same sequence of steps in EDA that we might do in R, the interface is much less convenient.

The point-and-click, drag-and-drop interfaces like the one provided by Excel are very useful for specific tasks such as editing values in cells, quickly creating plots, etc. Managing results and output from different tasks (e.g. regression, ANOVA, summary statistics) across sheets can be hard work. The visual interface which is the thing that makes Excel, and GUIs in general, convenient is a hinderance here. We must put these values somewhere. Instead, we might like to give them a name and be able to refer to them later. In other words, we would like to assign them to a variable. While we can put different results in their own worksheets, this soon gets cluttered and we sped time navigating the tabs in the workbook.

Another complexity in this world of point-and-click for EDA is that specifying precisely what we mean can be difficult. In some cases, we might want to customize a particular methodology when applying it to data, or we might want to draw a plot slightly differently, or use a dataset that we derive in a complex way from the original source. For these common but non-standard situations, a sequence of dialogs provided by a "wizard" can be frustrating. It is tediously long, especially if we are doing it several times with different subsets of the data or different datasets that we wish to compare. And in addition to the unnecessary repetitiveness, we also cannot specify everything we may want to. The dialogs provide access only to the common options. In the interest of keeping them simple, the designers have identified what they believe are the important elements one specify and change in the task. It is not possible for us to create our own modifications of these tasks, or at least it is a major project.

The purpose of the rest of this document is to give you an understanding of how R works.

Starting R

We first start R by invoking the command R. An R session will begin and provide you with the following information about the particualr version of R being run, licensing information, and how to get help.


R : Copyright 2003, The R Foundation for Statistical Computing
Version 1.9.0 Under development (unstable) (2003-12-28), ISBN 3-900051-00-3

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]



Also note, that the opening message tells how to exit an R session: invoke the command q()

> q()
Save workspace image? [y/n/c]:
Answering y will save this session. The saved session will be reloaded the next time you invoke R from the current directory. Note that the line that says


[Previously saved workspace restored]

means that R is loading up the data that we had saved from the last session. More on saving sessions later.

R has lots of functions (over 1500 immediately available to you and thousands more in add on packages). It is impossible to remember all of these and their details (e.g. what arguments they take, what they do in all situations adn what they return) and to make effective use of R, you need to get into the habit of using the help system. (Just type help("topic") to get help on the specified topic.) What we want to do in this document is to get you to think about what R is doing and why it does it and to understand the basic building blocks that you have available to you. They are similar to those of other languages such as Matlab and packages like Stata, SPSS, Excel. R is more a programming language than these environments and is much more statistically focussed than Matlab. It is a good tool to know for doing any kind of data analysis. To use it effectively, you will need to understand these basics and to sit down and gain experience with the engine and the numerous functions.

Using R as a Calculator

We can use R as a heavyweight calculator
> 1+2
[1] 3

The operators are + for addition, - for subtraction * for multiplication, / for division and ^ for exponentiation. The order of operation is what you expect, with parenthesis overiding the defaults.

We can use built-in values, such as pi.

> 1+pi
[1] 4.141593

What is the "thing" pi in this computation? It is a variable. By this we mean it is a name by which we refer to a value. We can associate new values with this name by assigning a value to it. For example, we can give pi the value 1 and then use that.
> pi = 1
> 1+pi
[1] 2

So pi is not a constant in this world. Mathematically, it is. But in the programming world, it is merely a variable to which we can bind or assign new values. Of course, this is not necessarily a good idea. If we use this new value, we will get strange results!

There are several ways to assign a value to a variable in S. They differ only in syntax.
> x = 1+2
> x <- 1+2
> 1+2 ->x
> x
[1] 3

These three forms (=, <-, and ->) can all be used, however the last is very rarely seen. It arose when one was typing a long command and realized that we had forgotten to assign the result. At one time, we couldn't go back to the beginning of the line without deleting all the intervening text and so removing the command. Nowadays, we can jump to the beginning of the line, and add the assignment and continue on.

Saving Variables from an R Session

Now that we can create variables, we can do some useful things. And we may want to ensure that we don't lose anything we do. So it is a good time to think about how we might save our data. Each time we run R, we create a new R session. Then we can do some work, create new variables and potentially want to save some or all of them. We can save our entire workspace, i.e. all the variables we have created calling the function the save.image at any time. This puts all the objects in our session workspace into a file named .RData. If we we start R again in that directory, the contents of that .RData are loaded into the new session and are immediately available to us again. If we start in a different directory, we can still load the values into the R session, but we must do this ourselves using the function load and giving it the fully qualified name of the file to load (i.e. full directory path and file name).

When we end the R session using the q() function, we will normally be asked whether we want to save the session or not. This calls save.image implicitly.

If we don't want to store all the variables, but only specific ones, we can explicitly save one or more objects to a file (or generally a connection). This is convenient when we create a big dataset and then want to ensure that it gets saved before we do anything else. Or if we want to make an object available to another R session, e.g. to somebody we are working with, without terminating ours, we can simply write the object to disk and then send it that person in an entirely portable format.

Note that R uses "copying" semantics. When I assign the value of x to y, y gets the value of x. It is not "linked" to x so that when x is changed, y would see that change. Instead, we copy the value of x in the assignment and the two variables are unrelated after that.
> x
[1] 3
> y = x
> x = 10
> x
[1] 10
> y
[1] 3

We have seen how we can store the results of computations or simple values in variables. We can think of these as being stored in our workspace. This is like our desk with pieces of paper storing different information. We would put different pieces of paper in different places so that we can easily find them again when we need them. The place we put them allows us to quickly find them and is analogous to the variable name which allows us to easily refer to the values.

In the same way that we might overload our desk with pieces of paper as we move from task to task, or just have too much information, we need to manage the variables we have in our work area or desktop. S provides functions which we can use to dynamically manage the variables and the contents of our workspace. The function objects gives us the names of the variables we have in our workspace.
> objects()
[1] "x"  "y"  "pi"

We can remove values using remove by passing the name to the function of the variable we want to remove.
> remove("x")  

and we can verify that the variable has been removed using objects again.
> objects()
[1] "y"  "pi"

We can give more than one name. So we can remove both y and pi, the last two remaining variables in our session's workspace.
> remove("y", "pi")

Before we leave this topic, we should ask what happened to the original version of pi? We assigned a new value to it - 1 - and used that in our computations? Now that we removed it, is pi defined at all ? is the old value put back? The answer is that the old value is now in effect again, but it wasn't "put back". R did not remember the old value and restore it when we removed our version of pi. The explanation is a little more complicated, and a lot richer. It relates to where we were finding the variable named pi.

When we issued the command
> pi = 1

we were telling R to associated the value 1 with the variable name pi. This puts it in our workspace. But before we did this, we managed to find pi also, and then it had the usual value of 3.141. So where did it come from? It wasn't in our workspace, yet it was still available.

The answer involves understanding how R finds variables when we refer to them. R actually keeps a collection of places in which to search for variables. This is called the search path. This is an ordered collection of workspaces containing variables and their associated values. At any point during an R session, we can ask R what this collection of workspaces is. We do this using the function search. In my session, I get
> search()
[1] ".GlobalEnv"       "package:Rbits"    "package:methods"  "package:stats"   
[5] "package:graphics" "package:utils"    "Autoloads"        "package:base"    

The first entry is our own personal workspace. When we quit, this disappears. The other entries are packages or libraries of functions and data that are available to us.

Now, when we implicitly cause R to look for a variable, it walks along this collection and asks each entry whether it has the relevant variable. After we defined our own version of pi, when we used pi in a computation such as 1 + pi, R started its search for pi. It started in the first element of the search path, and found it there. That is our workspace where put pi.

When the session started and we did not yet have our own version of pi, the search for pi was rather different. R looked through each element of the search path and found pi only in the last entry "package:base". This contains the built-in variable provided by the R system itself (rather than add-ons).

How could we know where R would find a variable? We can use the function find. So in the following, we define pi, and ask R where we can find it.
> pi = 1
> find("pi")
[1] ".GlobalEnv"   "package:base"

Now, we remove our version of pi and then R can only find the one in "package:base".
> remove("pi")
> find("pi")
[1] "package:base"

All the functions we have seen so far, and in general, are simply values assigned to variables. R finds them in the same way when we refer to them in a computation. It looks through the search path until it finds the variable. It is slightly smarter for functions. If it knows we are calling the value of the variable as a function, it will only look for a function, skipping over other types of values.

What if we look for a variable that doesn't exist? For example, suppose we use a variable named duncan in a computation
> duncan^2

What happens? R looks through each element of the search path and eventually gives up, giving the error message:
Error: Object "duncan" not found
We can determine whether a variable is defined using find, or using a more convenient function in some cases named exists. For example,
> exists('duncan')
[1] FALSE

(Note that I can use single or double quotes for a string, i.e. "duncan" or 'duncan'.)

The Basic Data Types

In S, everything is an object. We have seen this already. We have variables that refer to values, and functions which do things are accessed as regular variables. So we see that we have a commonality for data and functions. This is different from many languages such as C/C++, Java, etc. For interpreted languages, it is quite common and it is very powerful.

The basic or primitive types of objects are vectors. These are simply collections of values grouped together into a single container. The basic types are integer, numeric, logical and character vectors. And a very important characteristic of these vector types is that they can only store values of the same type. In other words, a vector has homogeneous data types. We cannot use a vector to store both an integer and a string in their basic forms. (We'll see that we can put them into a vector and the integer will become a string. And we can use what is called a "list" to store them both in their original form.)

As we just said, there are 4 basic types of vectors: integer, character, numeric and logical. Integer vectors store integer values, numeric vectors store real numbers, logical vectors store values that are either TRUE or FALSE and character vectors store strings. In C and Java, we can work on characters individually. However, in S there is no way to store a single character except as a simple string with only one character. This is very rarely a problem.

Essentially, vectors are like arrays in C and Java. And in those languages, we have a large difference between a scalar or basic built-in value and arrays of such values. In S, there are no scalars. By this, we mean that there are no individual number objects, or logical values, or strings. Instead, such individual values are actually vectors of length 1. So they are special cases of general vectors with multiple elements. And this makes lots of computations convenient.

Creating Vectors

An important function for creating vectors is the c function. The 'c' stands for concatenate, and all it does is take one or more values and put them into a new vector. For example,
> c(1.2, 4.5, 3.2)
[1] 1.2 4.5 3.2
> c(TRUE, FALSE, FALSE, TRUE)
[1]  TRUE FALSE FALSE  TRUE
> c("Abc", 'def', "ghikllm", "z")
[1] "Abc"     "def"     "ghikllm" "z"      

What about the integer vector? Well, in R, all numbers that we type are made into real numbers. So when we type
> c(1, 2, 3)

we get a numeric vector as the individual values are actually numeric. (This is different in S-Plus, version 5 and higher.)

There are many cases in which we want integers and they arise naturally. One of them, as we shall see in the subsetting section, is a sequence of integers. The built-in syntax for creating the integer sequence a, a+1, a+2, ..., b is a:b. For example,
> 1:10
 [1]  1  2  3  4  5  6  7  8  9 10
> 4 : 5
[1] 4 5
> 10:3
[1] 10  9  8  7  6  5  4  3
> 3:3
[1] 3

This is a very specific version of the more general seq function. This allows us to create sequences with different strides (differences between successive elements), of specific length, and so on. See the help pages.
> seq
function (...) 
UseMethod("seq")
<environment: namespace:base>
> seq(1, length = 10, by = 2)
 [1]  1  3  5  7  9 11 13 15 17 19

An important characteristic of any vector is its length. We can always find out how many elements a vector contains using the function length.
> x = 1:10
> length(x)
[1] 10
> letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> length(letters)
[1] 26
Note that the return value from calling length is itself a vector. It is an integer vector of length 1. The system uses its own built-in types to provide functionality.

Often we want to combine two vectors. We can also do this using the c.
> x = c(1, 2, 3)
> y = c(4, 5, 6)
> c(x, y)
[1] 1 2 3 4 5 6

We can also use the function append. Look at the help for c and append and try to discover the difference.

In many situations, it is convenient to associate names with elements in a vector. For example, suppose we have IP addresses of machines stored as strings. We might also want to associate the human-readable name along with it. For example,
          wald          anson         fisher 
"169.237.46.2" "169.237.46.9" "169.237.46.3" 
Here, we have associated the names wald, anson and fisher with the elements of the character vector

For any vector, we can ask for the names of the elements. Suppose the vector of IP addresses above was assigned to the variable ip, then we could get the character vector of names using the function names.
> names(ip)
[1] "wald"   "anson"  "fisher"

If the vector has no names, we get back NULL. This is a special symbol in R, and has length 0. We can check if a value is NULL using is.null:
 is.null(names(ip))

There are several ways to specify the names for a vector (of any type, i.e. integer, numeric, logical or character). If we are explicitly creating the vector (using c), then we can put the names in the expression, as in c("169.237.46.2", "169.237.46.9", "169.237.46.3").
> c(wald="169.237.46.2", anson = "169.237.46.9", fisher = "169.237.46.3")

If we already have a vector, then we can assign names to the elements using the names function (or technically the names<- function).
> x = 1:26
> names(x) <- letters
 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z 
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
> y = c(0, 0, 256)
> names(y) = c("R", "G", "B")
> y
  R   G   B 
  0   0 256 

Useful general facilities for operating on vectors are rep, rev, sort. rep allows us to replicate a vector in convenient ways.
> rep(1, 2)

For character vectors, paste is convenient for combining strings together. strsplit can be used for splitting strings by user-specified delimiters. substring can be used to extract a sub-part of a string. And we can match and substutute text using regular expressions with the functions grep and gsub.

Vectorized Operations

In lower-level languages like C/C++ and Java, we operate on entire arrays by iterating over each element. We have code something like:
 for(i = 0; i < n; i++) {
   f(x[i])
 }

where f is some function to do something on the individual element of the array.

In S, since vectors are the basic types, and because in statistics we typically want to work on groups of observations or experimental units, the philosophy is that operations work on an entire vector. This means users don't have to write loops for many operations. A simple example is the + function. We can add two vectors together elementwise using the + operation:
> c(1, 2, 3) + c(4, 5, 6)
[1] 5 7 9

The first element of each vector are added together to get 5. Similarly, we get 7 and 9 by adding the second elements, and the third elements.

This is very powerful and convenient. It allows us to express computations at a high-level, indicating what we mean rather than hiding it in a loop. Many functions in S are vectorized, meaning that if you give them a vector of length n, they will operate on all n elements rather than just the first one. strsplit is an example. If we give it the vector of IP addresses and ask it to break the strings into sub-parts separated by ., then we get
> strsplit(ip, "\\\.")
$wald
[1] "169" "237" "46"  "2"  

$anson
[1] "169" "237" "46"  "9"  

$fisher
[1] "169" "237" "46"  "3"  

Here, we get back a collection of character vectors. The collection has the same names as the original input vector (wald, anson, fisher) and each element is a string with the particular part of the IP address. The actual data type of the result is a list which we shall see shortly.

When you right your own functions, you should try to make them vectorized so that they take in a vector and give back a value for each element. Of course, if these are aggregator functions (e.g. sum, prod, lm), then they should work on all of the elements and combine them into a single result.

The Recycling Rule

What if we add two vectors with different lengths. For example, what happens to c(1, 2) + 2? We would like S to be smart enough to add 2 to each element. And that is what happens
> c(1, 2) + 2
[1] 3 4

What about c(1, 10) + c(100, 200, 300, 400) where the second vector has two more elements than the first.
> c(1, 10) + c(100, 200, 300, 400)
[1] 101 210 301 410

R does the right thing, depending on what you think the right thing is! But what did it do? It appears to have created the vector c(1 + 100, 10 + 200 , 1 + 300, 10 + 400) and indeed that is what it did. This is a general concept in S; it recycles or replicates the smaller vector to have the same length as the larger one. So, in this case, we recycle c(1, 10) to have length 4. We do this as the function rep would, basically by concatenating several copies of the original vector to get the right length. So we get c( 1, 10, 1, 10) to have length 4, the same as the larger vector and then we can do the basic arithmetic as before.

We can now understand how c(1, 2) + 2 works.

What about the following expression c(1, 2) + c(10, 11, 12), i.e. using vectors of length 2 and length 3.
> c(1, 2) + c(10, 11, 12)
[1] 11 13 13
Warning message: 
longer object length
	is not a multiple of shorter object length in: c(1, 2) + c(10, 11, 12) 

First thing to note is that R generates a warning telling you that you may want to check whether the result is as you expected. The problem is that recycling the smaller vector did not naturally yield a vector of the same length as the larger one. That is why R gave a warning. But it went ahead and did the addition using c(1, 2, 1) + c(10, 11, 12) as it recycled the smaller vector to have the same length as the larger one and threw away any left over elements.