> q() Save workspace image? [y/n/c]:Answering y will save this session. The saved session will be reloaded the next time you invoke R from the current directory. Note that the line that says
[Previously saved workspace restored]
help("topic")
to get
help on the specified topic.) What we want to do in this document is
to get you to think about what R is doing and why it does it and to
understand the basic building blocks that you have available to you.
They are similar to those of other languages such as Matlab and
packages like Stata, SPSS, Excel. R is more a programming language
than these environments and is much more statistically focussed than
Matlab. It is a good tool to know for doing any kind of data analysis.
To use it effectively, you will need to understand these basics and to
sit down and gain experience with the engine and the numerous
functions.
> 1+2 [1] 3The operators are + for addition, - for subtraction * for multiplication, / for division and ^ for exponentiation. The order of operation is what you expect, with parenthesis overiding the defaults. We can use built-in values, such as pi.
> 1+pi [1] 4.141593What is the "thing" pi in this computation? It is a variable. By this we mean it is a name by which we refer to a value. We can associate new values with this name by assigning a value to it. For example, we can give pi the value 1 and then use that.
> pi = 1 > 1+pi [1] 2So pi is not a constant in this world. Mathematically, it is. But in the programming world, it is merely a variable to which we can bind or assign new values. Of course, this is not necessarily a good idea. If we use this new value, we will get strange results! There are several ways to assign a value to a variable in S. They differ only in syntax.
> x = 1+2 > x <- 1+2 > 1+2 ->x > x [1] 3These three forms (=, <-, and ->) can all be used, however the last is very rarely seen. It arose when one was typing a long command and realized that we had forgotten to assign the result. At one time, we couldn't go back to the beginning of the line without deleting all the intervening text and so removing the command. Nowadays, we can jump to the beginning of the line, and add the assignment and continue on.
save.image
at any time. This puts all
the objects in our session workspace into a file named .RData. If we
we start R again in that directory, the contents of that .RData are
loaded into the new session and are immediately available to us again.
If we start in a different directory, we can still load the values
into the R session, but we must do this ourselves using the function
load
and giving it the fully qualified
name of the file to load (i.e. full directory path and file name).
When we end the R session using the q() function, we will
normally be asked whether we want to save the session or not. This
calls save.image
implicitly.
If we don't want to store all the variables, but only specific
ones, we can explicitly save
one or
more objects to a file (or generally a connection). This is convenient when we create a big dataset and
then want to ensure that it gets saved before we do anything else. Or
if we want to make an object available to another R session, e.g. to
somebody we are working with, without terminating ours, we can simply
write the object to disk and then send it that person in an entirely
portable format.
Note that R uses "copying" semantics. When I assign the value of x to
y, y gets the value of x. It is not "linked" to x so that when x is
changed, y would see that change. Instead, we copy the value of x in
the assignment and the two variables are unrelated after that.
> x [1] 3 > y = x > x = 10 > x [1] 10 > y [1] 3We have seen how we can store the results of computations or simple values in variables. We can think of these as being stored in our workspace. This is like our desk with pieces of paper storing different information. We would put different pieces of paper in different places so that we can easily find them again when we need them. The place we put them allows us to quickly find them and is analogous to the variable name which allows us to easily refer to the values. In the same way that we might overload our desk with pieces of paper as we move from task to task, or just have too much information, we need to manage the variables we have in our work area or desktop. S provides functions which we can use to dynamically manage the variables and the contents of our workspace. The function
objects
gives us the names
of the variables we have in our workspace.
> objects() [1] "x" "y" "pi"We can remove values using
remove
by passing the name to the function
of the variable we want to remove.
> remove("x")and we can verify that the variable has been removed using
objects
again.
> objects() [1] "y" "pi"We can give more than one name. So we can remove both y and pi, the last two remaining variables in our session's workspace.
> remove("y", "pi")Before we leave this topic, we should ask what happened to the original version of pi? We assigned a new value to it - 1 - and used that in our computations? Now that we removed it, is pi defined at all ? is the old value put back? The answer is that the old value is now in effect again, but it wasn't "put back". R did not remember the old value and restore it when we removed our version of pi. The explanation is a little more complicated, and a lot richer. It relates to where we were finding the variable named pi. When we issued the command
> pi = 1we were telling R to associated the value 1 with the variable name pi. This puts it in our workspace. But before we did this, we managed to find pi also, and then it had the usual value of 3.141. So where did it come from? It wasn't in our workspace, yet it was still available. The answer involves understanding how R finds variables when we refer to them. R actually keeps a collection of places in which to search for variables. This is called the search path. This is an ordered collection of workspaces containing variables and their associated values. At any point during an R session, we can ask R what this collection of workspaces is. We do this using the function
search
.
In my session, I get
> search() [1] ".GlobalEnv" "package:Rbits" "package:methods" "package:stats" [5] "package:graphics" "package:utils" "Autoloads" "package:base"The first entry is our own personal workspace. When we quit, this disappears. The other entries are packages or libraries of functions and data that are available to us. Now, when we implicitly cause R to look for a variable, it walks along this collection and asks each entry whether it has the relevant variable. After we defined our own version of pi, when we used pi in a computation such as
1 + pi
, R
started its search for pi. It started in the
first element of the search path, and found it there. That is our
workspace where put pi.
When the session started and we did not yet have our own
version of pi, the search for
pi was rather different. R looked through
each element of the search path and found pi
only in the last entry "package:base". This contains the built-in
variable provided by the R system itself (rather than add-ons).
How could we know where R would find a variable? We can use
the function find
.
So in the following, we define pi,
and ask R where we can find it.
> pi = 1 > find("pi") [1] ".GlobalEnv" "package:base"Now, we remove our version of pi and then R can only find the one in "package:base".
> remove("pi") > find("pi") [1] "package:base"All the functions we have seen so far, and in general, are simply values assigned to variables. R finds them in the same way when we refer to them in a computation. It looks through the search path until it finds the variable. It is slightly smarter for functions. If it knows we are calling the value of the variable as a function, it will only look for a function, skipping over other types of values. What if we look for a variable that doesn't exist? For example, suppose we use a variable named duncan in a computation
> duncan^2What happens? R looks through each element of the search path and eventually gives up, giving the error message:
Error: Object "duncan" not foundWe can determine whether a variable is defined using
find
, or using a more convenient
function in some cases named exists
.
For example,
> exists('duncan') [1] FALSE(Note that I can use single or double quotes for a string, i.e. "duncan" or 'duncan'.)
c
function.
The 'c' stands for concatenate, and all it does
is take one or more values and put them into
a new vector.
For example,
> c(1.2, 4.5, 3.2) [1] 1.2 4.5 3.2 > c(TRUE, FALSE, FALSE, TRUE) [1] TRUE FALSE FALSE TRUE > c("Abc", 'def', "ghikllm", "z") [1] "Abc" "def" "ghikllm" "z"What about the integer vector? Well, in R, all numbers that we type are made into real numbers. So when we type
> c(1, 2, 3)we get a numeric vector as the individual values are actually numeric. (This is different in S-Plus, version 5 and higher.) There are many cases in which we want integers and they arise naturally. One of them, as we shall see in the subsetting section, is a sequence of integers. The built-in syntax for creating the integer sequence a, a+1, a+2, ..., b is
a:b
.
For example,
> 1:10 [1] 1 2 3 4 5 6 7 8 9 10 > 4 : 5 [1] 4 5 > 10:3 [1] 10 9 8 7 6 5 4 3 > 3:3 [1] 3This is a very specific version of the more general
seq
function. This allows us to create
sequences with different strides (differences between successive
elements), of specific length, and so on. See the help pages.
> seq function (...) UseMethod("seq") <environment: namespace:base> > seq(1, length = 10, by = 2) [1] 1 3 5 7 9 11 13 15 17 19An important characteristic of any vector is its length. We can always find out how many elements a vector contains using the function
length
.
> x = 1:10 > length(x) [1] 10 > letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" [20] "t" "u" "v" "w" "x" "y" "z" > length(letters) [1] 26Note that the return value from calling
length
is itself a vector. It is an
integer vector of length 1. The system uses its own built-in types to
provide functionality.
Often we want to combine two vectors. We can also do this using the
c
.
> x = c(1, 2, 3) > y = c(4, 5, 6) > c(x, y) [1] 1 2 3 4 5 6We can also use the function
append
.
Look at the help for c
and
append
and try to discover the
difference.
In many situations, it is convenient to associate names with elements
in a vector. For example, suppose we have IP addresses of machines
stored as strings. We might also want to associate the human-readable
name along with it.
For example,
wald anson fisher "169.237.46.2" "169.237.46.9" "169.237.46.3"Here, we have associated the names wald, anson and fisher with the elements of the character vector For any vector, we can ask for the names of the elements. Suppose the vector of IP addresses above was assigned to the variable ip, then we could get the character vector of names using the function
names
.
> names(ip) [1] "wald" "anson" "fisher"If the vector has no names, we get back
NULL
. This is a special
symbol in R, and has length 0. We can check if a value is NULL
using is.null
:
is.null(names(ip))There are several ways to specify the names for a vector (of any type, i.e. integer, numeric, logical or character). If we are explicitly creating the vector (using
c
), then we can put the names in
the expression, as in
c("169.237.46.2", "169.237.46.9", "169.237.46.3")
.
> c(wald="169.237.46.2", anson = "169.237.46.9", fisher = "169.237.46.3")If we already have a vector, then we can assign names to the elements using the
names
function (or
technically the names<-
function).
> x = 1:26 > names(x) <- letters a b c d e f g h i j k l m n o p q r s t u v w x y z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 > y = c(0, 0, 256) > names(y) = c("R", "G", "B") > y R G B 0 0 256Useful general facilities for operating on vectors are
rep
,
rev
,
sort
.
rep
allows us to replicate
a vector in convenient ways.
> rep(1, 2)For character vectors,
paste
is convenient for
combining strings together. strsplit
can be used for splitting strings by user-specified delimiters.
substring
can be used to extract a
sub-part of a string. And we can match and substutute text using
regular expressions with the functions
grep
and
gsub
.
for(i = 0; i < n; i++) { f(x[i]) } |
---|
> c(1, 2, 3) + c(4, 5, 6) [1] 5 7 9The first element of each vector are added together to get 5. Similarly, we get 7 and 9 by adding the second elements, and the third elements. This is very powerful and convenient. It allows us to express computations at a high-level, indicating what we mean rather than hiding it in a loop. Many functions in S are vectorized, meaning that if you give them a vector of length n, they will operate on all n elements rather than just the first one.
strsplit
is an example.
If we give it the vector of IP addresses and ask it to
break the strings into sub-parts separated by .,
then we get
> strsplit(ip, "\\\.") $wald [1] "169" "237" "46" "2" $anson [1] "169" "237" "46" "9" $fisher [1] "169" "237" "46" "3"Here, we get back a collection of character vectors. The collection has the same names as the original input vector (wald, anson, fisher) and each element is a string with the particular part of the IP address. The actual data type of the result is a list which we shall see shortly. When you right your own functions, you should try to make them vectorized so that they take in a vector and give back a value for each element. Of course, if these are aggregator functions (e.g. sum, prod, lm), then they should work on all of the elements and combine them into a single result.
c(1, 2) + 2
?
We would like S to be smart enough to add 2 to each element.
And that is what happens
> c(1, 2) + 2 [1] 3 4What about
c(1, 10) + c(100, 200, 300, 400)
where the second vector has two more elements
than the first.
> c(1, 10) + c(100, 200, 300, 400) [1] 101 210 301 410R does the right thing, depending on what you think the right thing is! But what did it do? It appears to have created the vector
c(1 + 100, 10 + 200 , 1 + 300, 10 + 400)
and indeed that is what it did. This is a general concept in S; it
recycles or replicates the smaller vector to have the same length as
the larger one. So, in this case, we recycle c(1,
10)
to have length 4. We do this as the function
rep
would, basically by concatenating
several copies of the original vector to get the right length. So we
get c( 1, 10, 1, 10)
to have length 4,
the same as the larger vector and then we can do the basic arithmetic
as before.
We can now understand how c(1, 2) + 2
works.
What about the following expression
c(1, 2) + c(10, 11, 12)
,
i.e. using vectors of length 2 and length 3.
> c(1, 2) + c(10, 11, 12) [1] 11 13 13 Warning message: longer object length is not a multiple of shorter object length in: c(1, 2) + c(10, 11, 12)First thing to note is that R generates a warning telling you that you may want to check whether the result is as you expected. The problem is that recycling the smaller vector did not naturally yield a vector of the same length as the larger one. That is why R gave a warning. But it went ahead and did the addition using
c(1, 2, 1)
+ c(10, 11, 12)
as it recycled the smaller vector to
have the same length as the larger one and threw away any left over
elements.