Matrices

In this class, we won't look very much at matrices. Instead, we will focus on data frames which are a richer, more appropriate way to think about observed data. However, since matrices arise within computational statistics a lot, it is good to understand the basics. We can create a matrix in S using the matrix function. We give this values for the elements, the number of rows and the number of columns.
> matrix(1:6, 2, 3)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Note that the values (1, 2, 3, 4, 5, 6) are arranged column-wise. If we wanted to arrange them row-wise, we could use the byRow argument.
> x = matrix(1:6, 2, 3, byrow = TRUE)

Because of the recycling rule for vectors, we can give fewer values than are needed. matrix uses the number of rows and columns to determine how many values it needs and recycles appropriately. For example,
> matrix(NA, 2, 3)

creates a matrix of NAs, i.e. missing values.

We can even omit either dimension if we specify the correct number of values. For example,
> matrix(1:6, , 3)

omits the number of rows, and S infers that it is 2.

Now that we know one way to create a matrix, let's think about what it is in S. Essentially, it is a vector with additional information about the dimensions of the matrix. The dimensions are stored as an integer vector of length 2. This additional information is stored in a general "attributes" field that we can associate with any S object. We can query this value using the dim function.
> dim(matrix(1:6, , 3))
[1] 2 3

We can even assign a value to the dimension to change the "shape" of the matrix.
> x = matrix(1:6, , 3)
   [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> x
> dim(x) <- c(3, 2)
x
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

This is another way to create matrices.

From the perspective of a matrix, the fact that it is a vector has a very strong implication. Being a vector, a matrix in S can only contain values or elements of the same type. This is like its mathematical counterpart, and it means that we cannot mix character elements with numbers or logicals. As we saw for vectors, if we combine elements of different basic types, they get coerced into the common form that doesn't lose information. This is rarely desirable and this is where data frames are more appropriate and useful as we shall see.

We do get nice things from a matrix being a vector with dimensions. We can perform arithmetic, etc. simply on two matrices of the right dimensions using the basic vector operations.
> matrix(1:4, 2, 2) + matrix(101:104, 2, 2)
     [,1] [,2]
[1,]  102  106
[2,]  104  108
> matrix(1:4, 2, 2)^3
     [,1] [,2]
[1,]    1   27
[2,]    8   64

As with vectors, it is often convenient to supply names to label the elements. In the case of matrices, we have two dimensions and so two sets of names, one for the rows and one for the columns. We can access the names using rownames and colnames. These are actually stored together as a list containing two character vectors of the appropriate lengths. They are again stored in the attributes part of the object. We can access them together using the dimnames function meaning, of course, the dimension names.

One of the things we sometimes want to do on a matrix is to apply a function to each of the rows or each of the columns. This is like a vectorized operation that operates on the entire collection of elements, but for a matrix there are times we want to respect the two dimensional structure and apply the function to the different collection of elements given by the rows or the columns. The function apply does this for us. We give it the matrix of interest, the dimension we want to sweep over (i.e. the rows or the columns) and a function that takes in a vector and returns a value. Let's take the simple matrix we have above
> x = matrix(1:6, 2, 3)

Now, we can sum the elements in each row using apply in the following way:
> apply(x, 1, sum)
[1]  9 12

One way to remember which dimension to specify is to think about which one we want left with us. And 1 refers to rows and 2 refers to columns. We can use other functions, and indeed these functions can return complicated objects, not just simple vectors of length 1.

The apply functions are very powerful. They work on matrices, and we can use them on simple vectors also. They remove the need for verbose looping constructs such as
ans <- numeric(nrow(x))
for(i in 1:nrow(x)) {
 ans[i] <- sum(x[i,])
}

which is the equivalent to the apply command in the previous paragraph. And we shall see that the different forms of the apply functions are very important for operating on lists.

A little thought about the fact that matrices are simply vectors with an associated dimension vector might lead to thinking about multi-dimensional matrices or arrays. We can use the same structure and provide a dimension vector of length k to specify a k dimensional array.
> array(1:60, c(3, 4, 5))

This creates a collection of five 3 by 4 matrices. Of course, we can look at this along any dimension and see it also as a collection of three 4 x 5 matrices, or four 3 x 5 matrices.

And we can use the apply function, this time with multiple dimensions. Again, using the rule that we specify what dimensions we want to be left with, we perform an operation along the other dimensions such as
> apply(array(1:60, c(3, 4, 5)), c(1, 3), sum)
     [,1] [,2] [,3] [,4] [,5]
[1,]   22   70  118  166  214
[2,]   26   74  122  170  218
[3,]   30   78  126  174  222

to end up with a 3 x 5 matrix collapse across the middle dimension of 4 matrices.

Similarly, we collapse the second and third dimension in the following command.
> apply(array(1:60, c(3, 4, 5)), c(1), sum)
[1] 590 610 630

Some functions to look at for operating on matrices in S are: dimnames, t, eigen, diag, solve

Lists

So far we have looked at vectors and we have emphasized that they must have the same type of elements. If we try to combine different types of elements, S coerces them to an appropriate common type. For example
> c(1, 1.2, TRUE, "abc")
[1] "1"    "1.2"  "TRUE" "abc" 

results in a character vector. You can try combining different elements and see what you get, e.g.
> c(as.integer(1), 1.2)

There are so many situations that we want to be able to group values with different types. An observation may be made up of an identfier such as a name or social security number; age; day, month and year of birth; gender; height; 3 measures of blood pressure; etc. We most definitely don't want to put these into a vector as then everything will have to be a string. We will throw away good information about the fact that some are numbers, some are integers, etc. Instead, we want to be able to group them together but keep their different types. Otherwise, if we want to take the mean of the blood pressure measurements for each person, or find the average height, we would have to convert the strings to numbers, handle errors in the data that had snuck in because we were treating numbers as strings and hadn't verified that they were numbers, and so on. Generally, giving up information about the type of a value is a bad idea. This meta-information can be important, and is becoming much more commonly available in data analysis and good programming these days.

So what we need is a container to group things together that supports elements with different types. This is exactly what a list does in S. A list is essentially a vector but can support elements with different types. We put things into a list using the list function:
> x = list(1:10, "B", list(name = "Duncan Temple Lang", ssn = "999-99-9999"), rnorm(20))

This shows that we can use any S object as an element in a list, including lists themselves. We can use names for elements of lists as we did for vectors.

We can use the same style of subsetting as we did for vectors also. For example, we can get the first two elements of our list above as
> x[1:2]

And we can drop elements using negative indices:
> x[-1]

Similarly, names and logical vectors will work the same way as they did for vectors. What is important to note is that, just as for vectors, the [ returns an object of the same type as the one being subsetted. So using the [ on a list means that we will get back a list. This is true even if the subset only has one element. So if we want to get an element itself, rather than a list containing that one element, we are going to need some other mechanism. For example, consider the simple list
> x = list(1, "a")

containing just two elements. If we get the subset containing the first element
> x[1]
[[1]]
[1] 1

it is a list of length 1. And
> x[1][1]

is the same thing!

So we need another operator to extract an individual element. We use [[ for this.
> x[[1]]

Other than this and the fact that we can hold arbitrary types in a list, they are like vectors. The same subsetting works. Elements can have names. One thing we can do is use the $ to access elements by name. If we have a list
> x = list(sample1 = rnorm(10), sample2 = rnorm(1000))

we can access the elements as
> x$sample1
> x$sample2

We can apply a function to each element of a list using the function lapply, for "list apply". This works very much the same way as apply does for vectors. We give it the list and the function to apply to each element. The result is itself a list where each element is the result of applying the function to the corresponding element in the original list. For example, with our list above containing two samples of observations from a random normal distribution, we can calculate the sum and mean of the values as
> lapply(x, sum)
$sample1
[1] -6.707135

$sample2
[1] 0.0317475
> lapply(x, mean)
$sample1
[1] -0.6707135

$sample2
[1] 0.00317475

Note that the names of the new list are the same as that of the original list.

If we want to save looping over the list twice, we could compute both the sum and the mean in a single call.
> lapply(x, function(x) c(sum = sum(x), mean = mean(x)))
$sample1
       sum       mean 
-6.7071349 -0.6707135 

$sample2
       sum       mean 
0.03174750 0.00317475 

Lists are very useful when we want to do a number of iterations and store the results from each. The bootstrap or any form of simulation is a good example. The example above suggests that if we wanted to loop over different sample sizes - say 10, 100, 1000, 10000 - and compute samples of that size, we might do this and store the results in a list. Many people are included to try to save the results of each iteration to a variable with a name made up by pasting a name and the sample size together. This can be done, but it is not a very good way.
for(i in c(10, 100, 1000, 10000)) {
  x = rnorm(i)
  assign(paste("sample", i, sep="."))
}

And then we end up with the results in the variables sample.10, sample.100, sample.1000 and sample.10000. This will overwrite any existing variables having these names. It is also hard to deal with the collection of results as a collection. Instead, they are distinct variables.

A better way to do this is
ans <- list()
for(i in c(10, 100, 1000, 10000)) {
 ans[[paste("sample", i, sep=".")]] <- rnorm(i)
}

Now we end up with a list containing the 4 sample vectors. And one of the nice things we can do is then use lapply to compute on the elements as a collection: e.g.
> lapply(ans, mean)

as before.

Perhaps the nicest way to do these iterative computations is to use lapply on the vector of sample sizes.
> ans = lapply(c(10, 100, 1000, 10000), rnorm)

We lose the names, but we can put them on ourselves after the computation. We do avoid having to declare a global variable (ans) and then add to it within each iteration. And the names are very important in some contexts.

Just to show what we might do in a call to apply, let's create a histogram of each of the 4 samples.
> par(mfrow=c(2,2))
> lapply(ans, hist)

Here we split the graphics screen (which will be created automatically if necessary) into 2 rows and 2 columns and then use the hist to create the individual histograms. We would probably want to ensure that they had the same scale and the right title and axis labels.
> lapply(ans, function(x) hist(x, xlab="", main=paste("Sample size", length(x))))

There is one other detail in using lapply. We mentioned that it returns a list containing the new elements. If all the elements have the same type, it is often much more convenient to have them as a vector and not a list. Our example of when we computed the mean of the samples is illustrative. Suppose we had 1000 samples in our list, each of sample size 200. Then we might want to compute different statistics on these samples and look at their distributions. Let's do this by looking at the scatterplot of means and medians. First we generate our 1000 samples
> samples = lapply(1:1000, function(x) rnorm(200))

Now, if we use lapply to compute the medians and the means, we will end up with two lists of length 1000. But to display a scatterplot of these, we need vectors, not lists. (See the help page for plot.) So what can we do?

One thing to do is call unlist. This will unravel the elements in the list and try to remove their structure and create a vector. In this case, this will work nicely.
> unlist(lapply(samples, mean))

In other cases, we have to be careful only to unlist at the top-level. For example, if we have a list of lists, then we may not want to unravel the entire two levels, but just the first. (See the recursive argument for unlist.)

But better than unlist in many cases is the function sapply. This is the same as lapply but it attempts to coerce the result into a vector. If it can't, it simply returns the result as a list, as would lapply. So sapply is exactly what we want here. So our plot is easily created as
> plot(sapply(samples, mean), sapply(samples, median))

and this gives us a sense of the relationship between the mean and the median for standard normal distributions with samples of size 200. (Again, fix the axes labels!)