In this class, we won't look very much at matrices. Instead, we will
focus on data frames which are a richer, more appropriate way to think
about observed data. However, since matrices arise within
computational statistics a lot, it is good to understand the basics.
We can create a matrix in S using the
matrix
function. We give this values
for the elements, the number of rows and the number of columns.
> matrix(1:6, 2, 3)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Note that the values (1, 2, 3, 4, 5, 6) are arranged column-wise.
If we wanted to arrange them row-wise, we could use
the byRow argument.
> x = matrix(1:6, 2, 3, byrow = TRUE)
Because of the recycling rule for vectors,
we can give fewer values than are needed.
matrix
uses the number of rows and columns
to determine how many values it needs and recycles
appropriately.
For example,
> matrix(NA, 2, 3)
creates a matrix of NAs, i.e. missing values.
We can even omit either dimension if we specify the correct
number of values.
For example,
> matrix(1:6, , 3)
omits the number of rows, and S infers that it is 2.
Now that we know one way to create a matrix, let's think about what it
is in S. Essentially, it is a vector with additional information
about the dimensions of the matrix. The dimensions are stored as an
integer vector of length 2. This additional information is stored in
a general "attributes" field that we can associate with any S object.
We can query this value
using the
dim
function.
> dim(matrix(1:6, , 3))
[1] 2 3
We can even assign a value to the dimension to change the
"shape" of the matrix.
> x = matrix(1:6, , 3)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> x
> dim(x) <- c(3, 2)
x
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
This is another way to create matrices.
From the perspective of a matrix, the fact that it is a vector
has a very strong implication. Being a vector, a matrix in S can only
contain values or elements of the same type. This is like its
mathematical counterpart, and it means that we cannot mix character
elements with numbers or logicals. As we saw for vectors, if we
combine elements of different basic types, they get coerced into the
common form that doesn't lose information. This is rarely desirable
and this is where data frames are more appropriate and useful
as we shall see.
We do get nice things from a matrix being a vector with dimensions.
We can perform arithmetic, etc. simply on two matrices of the right
dimensions using the basic vector operations.
> matrix(1:4, 2, 2) + matrix(101:104, 2, 2)
[,1] [,2]
[1,] 102 106
[2,] 104 108
> matrix(1:4, 2, 2)^3
[,1] [,2]
[1,] 1 27
[2,] 8 64
As with vectors, it is often convenient to supply names to label the
elements. In the case of matrices, we have two dimensions and so two
sets of names, one for the rows and one for the columns. We can
access the names using
rownames
and
colnames
. These are actually stored
together as a list containing two character vectors of the appropriate
lengths. They are again stored in the attributes part of the object.
We can access them together using the
dimnames
function meaning, of course,
the dimension names.
One of the things we sometimes want to do on a matrix is to apply a
function to each of the rows or each of the columns. This is like a
vectorized operation that operates on the entire collection of
elements, but for a matrix there are times we want to respect the two
dimensional structure and apply the function to the different
collection of elements given by the rows or the columns. The function
apply
does this for us. We give it the
matrix of interest, the dimension we want to sweep over (i.e. the rows
or the columns) and a function that takes in a vector and returns a
value.
Let's take the simple matrix we have above
> x = matrix(1:6, 2, 3)
Now, we can sum the elements in each row
using
apply
in the following way:
> apply(x, 1, sum)
[1] 9 12
One way to remember which dimension to specify is
to think about which one we want left with us.
And 1 refers to rows and 2 refers to columns.
We can use other functions, and indeed these functions
can return complicated objects, not just simple vectors
of length 1.
The apply functions are very powerful. They work on matrices, and we
can use them on simple vectors also. They remove the need for verbose
looping constructs such as
ans <- numeric(nrow(x))
for(i in 1:nrow(x)) {
ans[i] <- sum(x[i,])
}
which is the equivalent to the
apply
command in the previous paragraph. And we shall see that the
different forms of the
apply
functions
are very important for operating on lists.
A little thought about the fact that matrices are simply vectors with
an associated dimension vector might lead to thinking about
multi-dimensional matrices or
arrays. We can use
the same structure and provide a dimension vector of length k to
specify a k dimensional array.
> array(1:60, c(3, 4, 5))
This creates a collection of five 3 by 4 matrices.
Of course, we can look at this along any dimension and
see it also as a collection of three 4 x 5 matrices,
or four 3 x 5 matrices.
And we can use the
apply
function, this time with multiple dimensions.
Again, using the rule that we specify what dimensions
we want to be left with, we perform an operation
along the other dimensions
such as
> apply(array(1:60, c(3, 4, 5)), c(1, 3), sum)
[,1] [,2] [,3] [,4] [,5]
[1,] 22 70 118 166 214
[2,] 26 74 122 170 218
[3,] 30 78 126 174 222
to end up with a 3 x 5 matrix collapse across the
middle dimension of 4 matrices.
Similarly, we collapse the second and third dimension
in the following command.
> apply(array(1:60, c(3, 4, 5)), c(1), sum)
[1] 590 610 630
Some functions to look at for operating on matrices in S
are:
dimnames
,
t
,
eigen
,
diag
,
solve
So far we have looked at vectors and we have emphasized that they must
have the same type of elements. If we try to combine different types
of elements, S coerces them to an appropriate common type.
For example
> c(1, 1.2, TRUE, "abc")
[1] "1" "1.2" "TRUE" "abc"
results in a character vector.
You can try combining different elements
and see what you get,
e.g.
> c(as.integer(1), 1.2)
There are so many situations that we want to be able to group values
with different types. An observation may be made up of an identfier
such as a name or social security number; age; day, month and year of
birth; gender; height; 3 measures of blood pressure; etc. We most
definitely don't want to put these into a vector as then everything
will have to be a string. We will throw away good information about
the fact that some are numbers, some are integers, etc. Instead, we
want to be able to group them together but keep their different types.
Otherwise, if we want to take the mean of the blood pressure
measurements for each person, or find the average height, we would
have to convert the strings to numbers, handle errors in the data that
had snuck in because we were treating numbers as strings and hadn't
verified that they were numbers, and so on. Generally, giving up
information about the type of a value is a bad idea. This
meta-information can be important, and is
becoming much more commonly available in data analysis and good
programming these days.
So what we need is a container to group things together that supports
elements with different types. This is exactly what a list does in S.
A list is essentially a vector but can support elements with different
types. We put things into a list
using the list function:
> x = list(1:10, "B", list(name = "Duncan Temple Lang", ssn = "999-99-9999"), rnorm(20))
This shows that we can use any S object as an element in a list,
including lists themselves. We can use names for elements of lists as
we did for vectors.
We can use the same style of subsetting as we did for vectors also.
For example, we can get the first two elements of our
list above as
> x[1:2]
And we can drop elements using negative indices:
> x[-1]
Similarly, names and logical vectors will work the same way as they
did for vectors. What is important to note is that, just as for
vectors, the
[
returns an object of the same
type as the one being subsetted. So using the
[
on a list means that we will get back a
list. This is true even if the subset only has one element. So if we
want to get an element itself, rather than a list containing that one
element, we are going to need some other mechanism.
For example, consider the simple list
> x = list(1, "a")
containing just two elements.
If we get the subset containing the first element
> x[1]
[[1]]
[1] 1
it is a list of length 1.
And
> x[1][1]
is the same thing!
So we need another operator to extract an individual element. We use
[[
for this.
> x[[1]]
Other than this and the fact that we can hold arbitrary types in a
list, they are like vectors. The same subsetting works. Elements can
have names. One thing we can do is use the
$
to access elements by name.
If we have a list
> x = list(sample1 = rnorm(10), sample2 = rnorm(1000))
we can access the elements as
> x$sample1
> x$sample2
We can apply a function to each element of a list using the function
lapply
, for "list apply". This works
very much the same way as
apply
does
for vectors. We give it the list and the function to apply to each
element. The result is itself a list where each element is the result
of applying the function to the corresponding element in the original
list. For example, with our list above containing two samples
of observations from a random normal distribution, we can calculate
the sum and mean of the values as
> lapply(x, sum)
$sample1
[1] -6.707135
$sample2
[1] 0.0317475
> lapply(x, mean)
$sample1
[1] -0.6707135
$sample2
[1] 0.00317475
Note that the names of the new list are the same
as that of the original list.
If we want to save looping over the list
twice, we could compute both the sum and the mean
in a single call.
> lapply(x, function(x) c(sum = sum(x), mean = mean(x)))
$sample1
sum mean
-6.7071349 -0.6707135
$sample2
sum mean
0.03174750 0.00317475
Lists are very useful when we want to do a number of iterations and
store the results from each. The bootstrap or any form of simulation
is a good example. The example above suggests that if we wanted to
loop over different sample sizes - say 10, 100, 1000, 10000 -
and compute samples of that size, we might do this and store the
results in a list. Many people are included to try to save the
results of each iteration to a variable with a name made up by pasting
a name and the sample size together.
This can be done, but it is not a very good way.
for(i in c(10, 100, 1000, 10000)) {
x = rnorm(i)
assign(paste("sample", i, sep="."))
}
And then we end up with the results in the variables
sample.10,
sample.100,
sample.1000 and
sample.10000. This will overwrite any
existing variables having these names. It is also hard to deal with
the collection of results as a collection. Instead, they are distinct
variables.
A better way to do this is
ans <- list()
for(i in c(10, 100, 1000, 10000)) {
ans[[paste("sample", i, sep=".")]] <- rnorm(i)
}
Now we end up with a list containing the 4 sample vectors.
And one of the nice things we can do is then
use
lapply
to compute on
the elements as a collection: e.g.
> lapply(ans, mean)
as before.
Perhaps the nicest way to do these iterative computations is
to use
lapply
on the vector of sample sizes.
> ans = lapply(c(10, 100, 1000, 10000), rnorm)
We lose the names, but we can put them on ourselves after the
computation. We do avoid having to declare a global variable
(
ans) and then add to it within each
iteration. And the names are very important in some contexts.
Just to show what we might do in a
call to
apply
,
let's create a histogram of each of the 4 samples.
> par(mfrow=c(2,2))
> lapply(ans, hist)
Here we split the graphics screen (which will be created automatically
if necessary) into 2 rows and 2 columns and then use the
hist
to create the individual
histograms. We would probably want to ensure that they had the same
scale and the right title and axis labels.
> lapply(ans, function(x) hist(x, xlab="", main=paste("Sample size", length(x))))
There is one other detail in using
lapply
. We mentioned that it returns a
list containing the new elements. If all the elements have the same
type, it is often much more convenient to have them as a vector and
not a list. Our example of when we computed the mean of the samples
is illustrative. Suppose we had 1000 samples in our list, each of
sample size 200. Then we might want to compute different statistics
on these samples and look at their distributions.
Let's do this by looking at the scatterplot of
means and medians.
First we generate our 1000 samples
> samples = lapply(1:1000, function(x) rnorm(200))
Now, if we use
lapply
to compute the
medians and the means, we will end up with two lists of length 1000.
But to display a scatterplot of these, we need vectors, not
lists. (See the help page for
plot
.)
So what can we do?
One thing to do is call
unlist
.
This will unravel the elements in the list and try to
remove their structure and create a vector.
In this case, this will work nicely.
> unlist(lapply(samples, mean))
In other cases, we have to be careful only to unlist at the top-level.
For example, if we have a list of lists, then we may not want to
unravel the entire two levels, but just the first. (See the
recursive argument for
unlist
.)
But better than
unlist
in many
cases is the function
sapply
. This is
the same as
lapply
but it attempts to
coerce the result into a vector. If it can't, it simply returns the
result as a list, as would
lapply
.
So
sapply
is exactly what we want here.
So our plot is easily created as
> plot(sapply(samples, mean), sapply(samples, median))
and this gives us a sense of the relationship between the mean and the
median for standard normal distributions with samples of size 200.
(Again, fix the axes labels!)