Vectors and Matrices
1 Modes and Classes
It was mentioned earlier that all the elements of a vector must
be of the same mode. To see the mode of an object, you can use
the mode function. What happens if we try to combine
objects of different modes using the c function?
The answer is that R will find a common mode that can accomodate
all the objects, resulting in the mode of some of the objects changing.
For example, let's try combining some numbers and some character strings:
> both = c('dog',3,'cat','mouse',7,12,9,'chicken')
> both
[1] "dog" "3" "cat" "mouse" "7" "12" "9"
[8] "chicken"
> mode(both)
[1] "character"
You can see that the numbers have been changed to characters because
they are now displayed surrounded by quotes. They also will no longer
behave like numbers:
> both[2] + both[5]
Error in both[2] + both[5] : non-numeric argument to binary operator
The error message means that the two values can no longer be
added. If you really need to treat character strings like numbers, you
can use the as.numeric function:
> as.numeric(both[2]) + as.numeric(both[5])
[1] 10
Of course, the best thing is to avoid combining objects of different
modes with the c function. We'll see later that R provides an object
known as a list that can store different types of objects without having to
change their modes.
2 Reading Vectors
Once you start working with larger amounts of data,
it becomes very tedious to enter data into the c function,
especially considering the need to put quotes around character values and
commas between values.
To read data from a file or from the terminal without the need for
quotes and commas, you can use the scan function. To read from
a file (or a URL), pass it a quoted string with the name of the file or
URL you wish to read; to read from the terminal, call scan() with no
arguments, and enter a completely blank line when you're done entering
your data. Additionally, on Windows or Mac OS X, you can substitute a call to the
file.choose() function for the quoted string with the file name,
and you'll be presented with the familiar file chooser used by most
programs on those platforms.
Suppose there's a file called numbers in your working directory. (You
can get your working directory by calling the getwd() function,
or set it using the setwd function or File -> Change dir selection in the R console.)
Let's say the contents of this file looks like this:
12 7
9 8 14 10
17
The scan function can be used to read these numbers as follows:
> nums = scan('numbers')
Read 7 items
> nums
[1] 12 7 9 8 14 10 17
The optional what= argument to scan can be used to
read vectors of character or logical values, but remember a vector can only
hold objects all of which are of the same mode.
3 Missing Values
No matter how carefully we collect our data, there will always be
situations where we don't know the value of a particular variable. For
example, we might conduct a survey where we ask people 10 questions,
and occasionally we forget to ask one, or people don't know the proper
answer. We don't want values like this to enter into calculations, but
we can't just eliminate them because then observations that have
missing values won't "fit in" with the rest of the data.
In R, missing values are represented by the string NA. For example,
suppose we have a vector of 10 values, but the fourth one is missing. I can
enter a missing value by passing NA to the c function just
as if it was a number (no quotes needed):
x = c(1,4,7,NA,12,19,15,21,20)
R will also recognize the unquoted string NA as a missing
value when data is read from a file or URL.
Missing values are different from other values in R in two ways:
- Any computation involving a missing value will return a missing value.
-
Unlike other quantities in R, we can't directly test to see if something is
equal to a missing value with the equality operator (==). We must use
the builtin is.na function, which will return TRUE if a value
is missing and FALSE otherwise.
Here are some simple R statements that illustrate these points:
> x = c(1,4,7,NA,12,19,15,21,20)
> mean(x)
[1] NA
> x == NA
[1] NA NA NA NA NA NA NA NA NA
Fortunately, these problems are fairly easy to solve. In the first
case, many functions (like mean, min, max, sd,
quantile, etc.) accept an na.rm=TRUE argument, that tells the
function to remove any missing values before performing the computation:
> mean(x,na.rm=TRUE)
[1] 12.375
In the second case, we just need to remember to always use is.na
whenever we are testing to see if a value is a missing value.
> is.na(x)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
By combining a call to is.na to the logical "not" operator (!)
we can filter out missing values in cases where no na.rm= argument is
available:
> x[!is.na(x)]
[1] 1 4 7 12 19 15 21 20
4 Matrices
A very common way of storing data is in a matrix, which is basically a
two-way generalization of a vector. Instead of a single index, we can
use two indexes, one representing a row and the second representing a
column. The matrix function takes a vector and makes it into
a matrix in a column-wise fashion. For example,
> mymat = matrix(1:12,4,3)
> mymat
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
The last two arguments to matrix tell it the number
of rows and columns the matrix should have. If you used a named argument,
you can specify just one dimension, and R will figure out the other:
> mymat = matrix(1:12,ncol=3)
> mymat
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
To create a matrix by rows instead of by columns, the
byrow=TRUE argument can be used:
> mymat = matrix(1:12,ncol=3,byrow=TRUE)
> mymat
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
When data is being read from a file, you can simply imbed a call to scan
into a call to matrix. Suppose we have a file called matrix.dat
with the following contents:
7 12 19 4
18 7 12 3
9 5 8 42
We could create a 3×4 matrix, read in by rows, with the
following command:
matrix(scan('matrix.dat'),nrow=3,byrow=TRUE)
To access a single element of a matrix, we need to specify both the row
and the column we're interested in. Consider the following matrix,
containing the numbers from 1 to 10:
> m = matrix(1:10,5,2)
> m
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
Now suppose we want the element in row 4 and column 1:
> m[4,1]
[1] 4
If we leave out either one of the subscripts, we'll get the entire row
or column of the matrix, depending on which subscript we leave out:
> m[4,]
[1] 4 9
> m[,1]
[1] 1 2 3 4 5
5 Data Frames
One shortcoming of vectors and matrices is that they can only hold one
mode of data; they don't allow us to mix, say, numbers and character strings.
If we try to do so, it will change the mode of the other elements in the
vector to conform. For example:
> c(12,9,"dog",7,5)
[1] "12" "9" "dog" "7" "5"
Notice that the numbers got changed to character values so that
the vector could accomodate all the elements we passed to the c
function. In R, a special object known as a data frame resolves this problem.
A data frame is like a matrix in that it represents a rectangular array of
data, but each column in a data frame can be of a different mode, allowing
numbers, character strings and logical values to coincide in a single object
in their original forms. Since most interesting data problems involve a
mixture of character variables and numeric variables, data frames are
usually the best way to store information in R. (It should be mentioned that
if you're dealing with data of a single mode, a matrix may be more efficient
than a data frame.) Data frames correspond to the traditional
"observations and variables" model that most statistical software uses,
and they are also similar to database tables. Each row of a data frame
represents an observation; the elements in a given row represent information
about that observation. Each column, taken as a whole, has all the information
about a particular variable for the data set.
For small datasets, you can enter each of the columns (variables) of your
data frame using the data.frame function. For example, let's
extend our temperature example by creating a data frame that has the
day of the month, the minimum temperature and the maximum temperature:
> temps = data.frame(day=1:10,
+ min = c(50.7,52.8,48.6,53.0,49.9,47.9,54.1,47.6,43.6,45.5),
+ max = c(59.5,55.7,57.3,71.5,69.8,68.8,67.5,66.0,66.1,61.7))
> head(temps)
day min max
1 1 50.7 59.5
2 2 52.8 55.7
3 3 48.6 57.3
4 4 53.0 71.5
5 5 49.9 69.8
6 6 47.9 68.8
Note that the names we used when we created the data frame are
displayed with the data. (You can add names after the fact with the
names function.)
Also, instead of typing the name temps
to see the data frame, we used a call the the head function
instead. This will show me just the first six observations (by default) of
the data
frame, and is very handy to check to make sure a large data.frame really
looks the way you think. (There's a function called tail that
shows the last lines in an object as well.)
If we try to look at the class or mode of a data frame, it's not that
informative:
> class(temps)
[1] "data.frame"
> mode(temps)
[1] "list"
We'll see the same results for every data frame we use. To
look at the modes of the individual columns of a data frame, we can use
the sapply function. This function simplifies operations that would
require loops in other languages, and automatically returns the appropriate
results for the operation it performs. To use sapply on a data frame,
pass the data frame as the first argument to sapply, and the function
you wish to use as the second argument. So to find the modes of the
individual columns of the temps data frame, we could use
> sapply(temps,mode)
date min maximum
"numeric" "numeric" "numeric"
Notice that sapply even labeled the result with the name of
each column.
Suppose we want to concentrate on the maximum daily temperature (which
we've called
max in our data frame) among the days recorded. There are several
ways we can refer to the columns of a data frame:
- Probably the easiest way to refer to this column is to use a special notation that eliminates the need to put
quotes around the variable names (unless they contain blanks or other
special characters). Separate the data frame name from the variable name
with a dollar sign ($):
> temps$max
[1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7
-
We can treat the data frame like it was a matrix. Since the maximum
temperature is in the third column, we could say
> temps[,3]
[1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7
-
Since we named the columns of temps we can use a character subscript:
> temps[,"max"]
[1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7
-
When you use a single subscript with a data frame, it refers to a data frame
consisting of just that column. R also provides a special subscripting
method (double brackets) to extract the actual data (in this case a vector)
from the data frame:
> temps['max']
max
1 59.5
2 55.7
3 57.3
4 71.5
5 69.8
6 68.8
7 67.5
8 66.0
9 66.1
10 61.7
> temps[['max']]
[1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7
Notice that this second form is identical to temps$max.
We could also use the equivalent numerical subscript (in this case 3)
with single or double brackets.
File translated from
TEX
by
TTH,
version 3.67.
On 26 Jan 2011, 09:02.