Introduction to R
1 Data in R
While R can handle many types of data, the three main varieties that
we'll be using are numeric, character and logical.
In R, you can identify
what type of object you're dealing with by using the mode function.
For example:
> name = 'phil'
> number = 495
> happy = TRUE
> mode(name)
[1] "character"
> mode(number)
[1] "numeric"
> mode(happy)
[1] "logical"
Note that when we enter character data, it needs to be surrounded by
quotes (either double or single), but the symbols TRUE and
FALSE (without quotes) are recognized as values of a logical
variable.
Another important characteristic of an object in R is its class, because
many functions know how to treat objects of different classes in a special
way. You can find the class of an object with the class function.
2 Vectors
Occasionally it may be useful to have a variable (like name or
happy in the above example) to have only a single value (like
'phil' or TRUE), but usually we'll want to store more
than a single value (sometimes refered to as a scalar) in a variable.
A vector is a collection of objects, all of the same mode, that can be
stored in a single variable, and accessed through subscripts. For example,
consider the minimum temperature in Berkeley for the first 10 days of
January, 2006:
50.7 52.8 48.6 53.0 49.9 47.9 54.1 47.6 43.6 45.5
We could create a variable called mintemp as follows:
> mintemp = c(50.7,52.8,48.6,53.0,49.9,47.9,54.1,47.6,43.6,45.5)
The c function is short for catenate or combine, and it's used to put
individual values together into vectors. You can find the number of
elements in a vector using the length function.
Once you've created a vector, you can refer to the elements of the vector
using subscripts. Numerical subscripts in R start at 1, and continue up
to the length of the vector. Subscripts of 0 are silently ignored. To refer to
multiple elements in a vector, simply use the c function to create
a vector of the indexes you're interested in. So to extract the first, third,
and fifth values of the mintemp vector, we could use:
> mintemp[c(1,3,5)]
[1] 50.7 48.6 49.9
If all of the subscripts of an object are negative, R will ignore those values,
and use just the remaining elements. To extract all of the elements of
mintemp except the first and last (tenth), use:
> mintemp[-c(1,10)]
[1] 52.8 48.6 53.0 49.9 47.9 54.1 47.6 43.6
In most programming languages, once you're dealing with vectors, you also
have to start worrying about loops and other programming problems. Not
so in R! For example, suppose we want to use the conversion formula
to convert our Fahrenheit temperatures into Celsius.
We can act as if mintemp is just a single number, and R will do the hard
part:
> mintempC = 5/9 * (mintemp - 32)
> mintempC
[1] 10.388889 11.555556 9.222222 11.666667 9.944444 8.833333 12.277778
[8] 8.666667 6.444444 7.500000
In fact, most similar operations in R are vectorized; that is they
operate on entire vectors at once, without the need for loops or other
programming.
There are some shortcuts to generate vectors. The colon operator lets you
generate sequences of integers from one value to another. For example,
> x = 1:10
> x
[1] 1 2 3 4 5 6 7 8 9 10
For more control, see the help page for the seq function.
You can repeat values using the rep function. This function is
very flexible; if called with scalars, it does the obvious:
> rep(5,3)
[1] 5 5 5
with a vector and a scalar, it creates a new vector by repeating
the old one:
> y = 3:7
> rep(y,3)
[1] 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7
Finally, if you call rep with two equal length vectors,
it repeats the elements of the first vector as many times as the corresponding
element of the second vector:
> rep(1:4,c(2,3,3,4))
[1] 1 1 2 2 2 3 3 3 4 4 4 4
One surprising thing about vectors in R is that many times it will carry
out an operation with two vectors that aren't the same size by simply
recycling the values in the shorter vector. For example, suppose we
try to add a vector with four numbers to one with just two numbers:
> c(1,2,3,4) + c(1,2)
[1] 2 4 4 6
Notice that, for the two last elements, it simply recycled the 1 and 2
from the second vector to perform the addition. R will be silent when
things like this happen, but if the length of the larger vector isn't an
even multiple of the length of the smaller vector, R will print a warning:
> c(1,2,3,4) + c(1,2,3)
[1] 2 4 6 5
Warning message:
longer object length is not a multiple of shorter object
length in: c(1, 2, 3, 4) + c(1, 2, 3)
It's possible to provide names for the elements of a vector.
Suppose we were working with purchases in a number of states, and we
needed to know the sales tax rate for a given state. We could create
a named vector as follows:
> taxrate = c(AL=4,CA=7.25,IL=6.25,KS=5.3,NY=4.25,TN=7)
> taxrate
AL CA IL KS NY TN
4.00 7.25 6.25 5.30 4.25 7.00
To add names to a vector after a fact, you can use the
names function:
> taxrate = c(4,7.25,6.25,5.3,4.25,7)
> taxrate
[1] 4.00 7.25 6.25 5.30 4.25 7.00
> names(taxrate) = c('AL','CA','IL','KS','NY','TN')
> taxrate
AL CA IL KS NY TN
4.00 7.25 6.25 5.30 4.25 7.00
If you have a named vector, you can access the elements with
either numeric subscripts or by using the name of the element you want:
> taxrate[3]
IL
6.25
> taxrate['KS']
KS
5.3
One of the most powerful tools in R is the ability to use logical expressions
to extract or modify elements in the way that numeric subscripts
are traditionally used. While there are (of course) many cases where we're
interested in accessing information based on the numeric or character
subscript of an object, being able to use logical expressions gives us a
much wider choice in the way we can study our data. For example, suppose
we want to find all of observations in taxrate with a taxrate less
than 6. First, let's look at the result of just asking whether
taxrate is less than 6:
> taxrate < 6
AL CA IL KS NY TN
TRUE FALSE FALSE TRUE TRUE FALSE
The result is a logical vector of the same length as the vector
we were asking about. If we use such a vector to extract values from
the taxrate vector, it will give us all the ones that correspond
to TRUE values, discarding the ones that correspond to FALSE.
> taxrate[taxrate > 6]
CA IL TN
7.25 6.25 7.00
Another important use of logical variables is counting the number of elements
of a vector meet a particular condition. When a logical vector is passed
to the sum function, TRUEs count as one and FALSEs
count as 0. So we can count the number of TRUEs in a logical
expression by passing it to sum:
> sum(taxrate > 6)
[1] 3
This tells us three observations in the taxrate vector
had values greater than 6.
As another example, suppose we want to find which of the states we have
information about has the highest sales tax. The max function will
find the largest value in a vector. (Once again, note that we don't have
to worry about the size of the vector or looping over individual elements.)
> max(taxrate)
[1] 7.25
We can find the state which has the highest tax rate as follows:
> taxrate[taxrate == max(taxrate)]
CA
7.25
Notice that we use two equal signs when testing for equality, and one
equal sign when we are assigning an object to a variable.
Another useful tool for these kinds of queries is the which
function. It converts between logical subscripts and numeric ones.
For example, if we wanted to know the index of the element in the
taxrate vector that was the biggest, we could use:
> which(taxrate == max(taxrate))
CA
2
In fact, this is such a common operation that R provides
two functions called which.min and which.max which
will return the index of the minimum or maximum element of a vector:
> which.max(taxrate)
CA
2
While it's certainly not necessary to examine every function that
we use in R, it might be interesting to see what which.max
is doing beyond our straight-forward solution. As always, we can
type the name of the function to see what it does:
> which.max
function (x)
.Internal(which.max(x))
<environment: namespace:base>
.Internal means that the function that actually finds
the index of the maximum value is compiled inside of R. Generally functions
like this will be faster than pure R solutions like the first one we tried.
We can use the system.time function to see how much faster
which.max will be. Because functions use the equal sign (=)
to name their arguments, we'll use the alternative assignment operator,
<- in our call to system.time:
> system.time(one <- which(taxrate == max(taxrate)))
user system elapsed
0 0 0
It's not surprising to see a time of 0 when operating on such
a small vector. It doesn't mean that it required no time to do the operation,
just that the amount of time it required was smaller than the granularity
of the system clock. (The granularity of the clock is simply the smallest
interval of time that can be measured by the computer.) To get a good
comparison, we'll need to create a larger vector. To do this, we'll use
the rnorm function, which generates random numbers from the normal
distribution with mean 0 and standard deviation 1. To get times that we
can trust, I'll use a vector with 10 million elements:
> x = rnorm(10000000)
> system.time(one <- which(x == max(x)))
user system elapsed
0.276 0.016 0.292
> system.time(two <- which.max(x))
user system elapsed
0.068 0.000 0.071
While the pure R solution seems pretty fast (0.292 seconds to
find the index of the largest element in a vector of 10 million numbers),
the compiled (internal) version is actually around 4 times faster!
Of course none of this matters if they don't get the same answers:
> one
[1] 8232773
> two
[1] 8232773
The two methods do agree.
If you try this example on your own computer, you'll see a different value
for the index of the maximum. This is due to the way random numbers are
generated in R, and we'll see how to take more control of this later in the
semester.
File translated from
T_{E}X
by
T_{T}H,
version 3.67.
On 25 Jan 2011, 14:49.