Introduction to R

1 Data in R

While R can handle many types of data, the three main varieties that we'll be using are numeric, character and logical. In R, you can identify what type of object you're dealing with by using the mode function. For example:

> name = 'phil'
> number = 495
> happy = TRUE
> mode(name)
[1] "character"
> mode(number)
[1] "numeric"
> mode(happy)
[1] "logical"

Note that when we enter character data, it needs to be surrounded by quotes (either double or single), but the symbols TRUE and FALSE (without quotes) are recognized as values of a logical variable.

Another important characteristic of an object in R is its class, because many functions know how to treat objects of different classes in a special way. You can find the class of an object with the class function.

2 Vectors

Occasionally it may be useful to have a variable (like name or happy in the above example) to have only a single value (like 'phil' or TRUE), but usually we'll want to store more than a single value (sometimes refered to as a scalar) in a variable. A vector is a collection of objects, all of the same mode, that can be stored in a single variable, and accessed through subscripts. For example, consider the minimum temperature in Berkeley for the first 10 days of January, 2006:

50.7 52.8 48.6 53.0 49.9 47.9 54.1 47.6 43.6 45.5

We could create a variable called mintemp as follows:

> mintemp = c(50.7,52.8,48.6,53.0,49.9,47.9,54.1,47.6,43.6,45.5)

The c function is short for catenate or combine, and it's used to put individual values together into vectors. You can find the number of elements in a vector using the length function.

Once you've created a vector, you can refer to the elements of the vector using subscripts. Numerical subscripts in R start at 1, and continue up to the length of the vector. Subscripts of 0 are silently ignored. To refer to multiple elements in a vector, simply use the c function to create a vector of the indexes you're interested in. So to extract the first, third, and fifth values of the mintemp vector, we could use:

> mintemp[c(1,3,5)]
[1] 50.7 48.6 49.9

If all of the subscripts of an object are negative, R will ignore those values, and use just the remaining elements. To extract all of the elements of mintemp except the first and last (tenth), use:

> mintemp[-c(1,10)]
[1] 52.8 48.6 53.0 49.9 47.9 54.1 47.6 43.6

In most programming languages, once you're dealing with vectors, you also have to start worrying about loops and other programming problems. Not so in R! For example, suppose we want to use the conversion formula

C = 5/9 (F - 32)

to convert our Fahrenheit temperatures into Celsius. We can act as if mintemp is just a single number, and R will do the hard part:

> mintempC = 5/9 * (mintemp - 32)
> mintempC
 [1] 10.388889 11.555556  9.222222 11.666667  9.944444  8.833333 12.277778
 [8]  8.666667  6.444444  7.500000

In fact, most similar operations in R are vectorized; that is they operate on entire vectors at once, without the need for loops or other programming.

There are some shortcuts to generate vectors. The colon operator lets you generate sequences of integers from one value to another. For example,

> x = 1:10
> x
 [1]  1  2  3  4  5  6  7  8  9 10

For more control, see the help page for the seq function.

You can repeat values using the rep function. This function is very flexible; if called with scalars, it does the obvious:

> rep(5,3)
 [1] 5 5 5

with a vector and a scalar, it creates a new vector by repeating the old one:

> y = 3:7
> rep(y,3)
 [1] 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7

Finally, if you call rep with two equal length vectors, it repeats the elements of the first vector as many times as the corresponding element of the second vector:

> rep(1:4,c(2,3,3,4))
 [1] 1 1 2 2 2 3 3 3 4 4 4 4

One surprising thing about vectors in R is that many times it will carry out an operation with two vectors that aren't the same size by simply recycling the values in the shorter vector. For example, suppose we try to add a vector with four numbers to one with just two numbers:

> c(1,2,3,4) + c(1,2)
[1] 2 4 4 6

Notice that, for the two last elements, it simply recycled the 1 and 2 from the second vector to perform the addition. R will be silent when things like this happen, but if the length of the larger vector isn't an even multiple of the length of the smaller vector, R will print a warning:

> c(1,2,3,4) + c(1,2,3)
[1] 2 4 6 5
Warning message:
longer object length is not a multiple of shorter object 
length in: c(1, 2, 3, 4) + c(1, 2, 3)

It's possible to provide names for the elements of a vector. Suppose we were working with purchases in a number of states, and we needed to know the sales tax rate for a given state. We could create a named vector as follows:

> taxrate = c(AL=4,CA=7.25,IL=6.25,KS=5.3,NY=4.25,TN=7)
> taxrate
  AL   CA   IL   KS   NY   TN
4.00 7.25 6.25 5.30 4.25 7.00

To add names to a vector after a fact, you can use the names function:

> taxrate = c(4,7.25,6.25,5.3,4.25,7)
> taxrate
[1] 4.00 7.25 6.25 5.30 4.25 7.00
> names(taxrate) = c('AL','CA','IL','KS','NY','TN')
> taxrate
  AL   CA   IL   KS   NY   TN
4.00 7.25 6.25 5.30 4.25 7.00

If you have a named vector, you can access the elements with either numeric subscripts or by using the name of the element you want:

> taxrate[3]
  IL
6.25
> taxrate['KS']
  KS
5.3

One of the most powerful tools in R is the ability to use logical expressions to extract or modify elements in the way that numeric subscripts are traditionally used. While there are (of course) many cases where we're interested in accessing information based on the numeric or character subscript of an object, being able to use logical expressions gives us a much wider choice in the way we can study our data. For example, suppose we want to find all of observations in taxrate with a taxrate less than 6. First, let's look at the result of just asking whether taxrate is less than 6:

> taxrate < 6
   AL    CA    IL    KS    NY    TN
 TRUE FALSE FALSE  TRUE  TRUE FALSE

The result is a logical vector of the same length as the vector we were asking about. If we use such a vector to extract values from the taxrate vector, it will give us all the ones that correspond to TRUE values, discarding the ones that correspond to FALSE.

> taxrate[taxrate > 6]
  CA   IL   TN
7.25 6.25 7.00

Another important use of logical variables is counting the number of elements of a vector meet a particular condition. When a logical vector is passed to the sum function, TRUEs count as one and FALSEs count as 0. So we can count the number of TRUEs in a logical expression by passing it to sum:

> sum(taxrate > 6)
[1] 3

This tells us three observations in the taxrate vector had values greater than 6.

As another example, suppose we want to find which of the states we have information about has the highest sales tax. The max function will find the largest value in a vector. (Once again, note that we don't have to worry about the size of the vector or looping over individual elements.)

> max(taxrate)
[1] 7.25

We can find the state which has the highest tax rate as follows:

> taxrate[taxrate == max(taxrate)]
  CA
7.25

Notice that we use two equal signs when testing for equality, and one equal sign when we are assigning an object to a variable.

Another useful tool for these kinds of queries is the which function. It converts between logical subscripts and numeric ones. For example, if we wanted to know the index of the element in the taxrate vector that was the biggest, we could use:

> which(taxrate == max(taxrate))
CA
 2

In fact, this is such a common operation that R provides two functions called which.min and which.max which will return the index of the minimum or maximum element of a vector:

> which.max(taxrate)
CA 
 2

While it's certainly not necessary to examine every function that we use in R, it might be interesting to see what which.max is doing beyond our straight-forward solution. As always, we can type the name of the function to see what it does:

> which.max
function (x) 
.Internal(which.max(x))
<environment: namespace:base>

.Internal means that the function that actually finds the index of the maximum value is compiled inside of R. Generally functions like this will be faster than pure R solutions like the first one we tried. We can use the system.time function to see how much faster which.max will be. Because functions use the equal sign (=) to name their arguments, we'll use the alternative assignment operator, <- in our call to system.time:

> system.time(one <- which(taxrate == max(taxrate)))
   user  system elapsed 
      0       0       0

It's not surprising to see a time of 0 when operating on such a small vector. It doesn't mean that it required no time to do the operation, just that the amount of time it required was smaller than the granularity of the system clock. (The granularity of the clock is simply the smallest interval of time that can be measured by the computer.) To get a good comparison, we'll need to create a larger vector. To do this, we'll use the rnorm function, which generates random numbers from the normal distribution with mean 0 and standard deviation 1. To get times that we can trust, I'll use a vector with 10 million elements:

> x = rnorm(10000000)
> system.time(one <- which(x == max(x)))
   user  system elapsed 
  0.276   0.016   0.292 
> system.time(two <- which.max(x))
   user  system elapsed 
  0.068   0.000   0.071

While the pure R solution seems pretty fast (0.292 seconds to find the index of the largest element in a vector of 10 million numbers), the compiled (internal) version is actually around 4 times faster!

Of course none of this matters if they don't get the same answers:

> one
[1] 8232773
> two
[1] 8232773

The two methods do agree.

If you try this example on your own computer, you'll see a different value for the index of the maximum. This is due to the way random numbers are generated in R, and we'll see how to take more control of this later in the semester.

File translated from T_EX by T_TH, version 3.67.
On 25 Jan 2011, 14:49.