Subsets

A lot of what we do in statistics and exploratory data analysis is to look at subgroups of a sample or population. We determine characteristics about that subset and compare them to other groups or the same characteristic of the overall group. We might look at how height and weight are related for both men and women separately. We might look at milk yield for cows of different breeds. We might look at stock prices within a particular week and so look at that particular subset of time. We might also look at stock prices every Friday rather than consecutive days. For Web page "hits" on a server, we might look at the other requests from the site of the requester. These are all examples of how we look at different parts of our data using categorical or continuous variables to "zoom in" on a subgroup. The criteria we use might be known ahead of time (type of cow, male/female) or might depend on the data itself (e.g. other web hits from the most frequent downloading site).

Being able to compute subgroups easily within our data is one of the things that is most powerful in S, but also one that takes some time to get used to. The flexibility comes from the fact that there are many ways to specify the subset of interest and this can be confusing. You should sit down and work with R to try to understand what is happening and master the concepts. They are very useful. There are essentially 5 different ways to subset a vector in R. They all use the [ function or operator and the only differences are what you specify as the value to use to identify the particular subset of interest. We'll use the built-in vector of lower case letters of the alphabet as our simple vector to illustrate the ideas.
> letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

Let's start with the most obvious and simple one. Suppose we have a vector with n entries. A common thing to ask is for one or more elements identified by position. For example, I can ask for the 2nd, 5th and 10th elements. We do this by passing the positions of the elements we want.
> letters[c(2, 5, 10)]
[1] "b" "e" "j"

As we might have expected, we get back the specified elements of our original vector. Note that what we get back is also a vector of the same type as our original one, in this case a character vector. The result has as many elements as we asked for in our specification of the subset.

This is very simple: we ask for the values we want by identifying their position. What if we give a position that makes no sense, e.g. that is larger than the length of the starting vector. For example, let's ask for the 30th element of the letters object.
> letters[30]
[1] NA

The result is a missing value, NA. This makes sense in many contexts. It is something we should be aware of so that we can understand how NAs might be introduced into our computations.

There are two other values that might be considered meaningless. What if we ask for the 0-th element?
> letters[0]
character(0)
> letters[c(0, 1)]
[1] "a"

Essentially, S ignores a request for the 0-th element and doesn't include a value in the result for that element. This means that the result may not have as many elements as we asked for.

And what if I ask for a negative index? For example,
> letters[-c(1, 3)]

is outside the range of the indices of the elements of letters. What does S do with such a request? (Try it and see what happens.)

Negative numbers for subsetting mean to drop those elements. What happens in the above example is that we get a new vector derived from letters with the first and third elements dropped or removed.
> letters[-c(1, 3)]
 [1] "b" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
[20] "v" "w" "x" "y" "z"

There are some restrictions on this. We cannot mix positive indices and negative indices in a single subsetting call. In other words, we cannot include some and omit others in one action. So
> letters[c(-1, -3, 5, 6, 7)]

might seem reasonable to drop the first and third elements and include the fifth, sixth and seventh. But if we give such a command, we get an error
> letters[c(-1, -3, 5, 6, 7)]
Error: only 0's may mix with negative subscripts

and it makes sense. We are saying "I want you to only include these, but also exclude those". That is not a good way to give an instruction.

So we have seen two ways to get subsets so far. Both involve identifying the elements of interest by position or index in the original vector and either including them in the result or excluding them. One of the problems is that we have to know the indices of the desired elements. And this brings us to the next two subsetting approaches.

We have seen that vectors can have names. If we are subsetting a vector with names, we can refer to the elements we want in the subset using these names. Let's suppose we have our vector of IP addresses
> ip = c(wald="169.237.46.2", anson = "169.237.46.9", fisher = "169.237.46.3")

To get only wald and fisher, we pass a vector giving these names.
> ip[c("wald", "fisher")]
          wald         fisher 
"169.237.46.2" "169.237.46.3" 

So far so good. Note that we are passing a vector to the [. We can't just put in names like
> ip["wald", "fisher"]

This would be two arguments to [ and this confuses it for a simple, linear, one dimensional vector.
> ip["wald", "fisher"]
Error in ip["wald", "fisher"] : incorrect number of dimensions

The error gives a hint that we might be able to do two dimensional subsetting on other types of objects. See matrices and data frames below.

Again, if we ask for a non-existent element, we will get an NA in the result.
ip[c("wald", "fishesr")]
          wald           <NA> 
"169.237.46.2"             NA 

And we cannot use this style of subsetting to exclude elements. Think about what
ip[-c("wald", "fisher")]

means when R interprets the command. While we can understand that we mean to drop the wald and fisher elements, R first evaluates
-c("wald", "fisher")

This is meaningless as the negative of a string doesn't make sense. So the error comes from this part of the computation.
> ip[-c("wald", "fisher")]
Error in -c("wald", "fisher") : Invalid argument to unary operator

What's the unary operator? It is the - operator.

So now we have covered three types of subsetting: indexing by position, exclusion by position, indexing by name. The next one is to use a logical vector to index the elements we want. Like names, this is used when we don't know the position of an argument but know what we are looking for. We give the [ a logical vector and R returns the subset of the original vector containing the elements corresponding to TRUE values in our logical "indexer". Basically, this is like super-imposing our logical vector over the vector being subset, and dropping all the values under the FALSE elements, and keeping all the elements under the TRUE values. In this way, it works like a "mask". A couple of examples may make this clearer. The simplest and least interesting is the following:
> x = c("a", "b", "c", "b")
> x[c(TRUE, FALSE, TRUE, FALSE)]
[1] "a" "c"

Here, we just extract the first and third elements.

Suppose we wanted to get all the elements that were equal to "b". Remember, that S is a vectorized language with the recycling rule. The command
 x == "b"

returns a logical vector with as many values as there are in x and the result contains TRUEs and FALSEs according to the condition.
> x == "b"
[1] FALSE  TRUE FALSE  TRUE

Now we can use this to subset x to get all the "b" elements:
> x[x == "b"]

This reads as "get all the elements of x such that x is equal to 'b'".

There are several other ways to do this subsetting. We could find the positions of all the "b" elements and then use the positions as our subsetting vector. This can be done in one command as
> x[(1:length(x))[x == "b"]]

Think about what this is doing to make certain you understand it. We can do the computations separately and look at the intermediate results to see what is happening.
> x == "b"
[1] FALSE TRUE FALSE  TRUE
> 1:length(x)
[1] 1 2 3 4
> c(1, 2, 3, 4)[c(FALSE, TRUE, FALSE,  TRUE)]
[1] 2 4
 x[c(2,4)]
[1] "b" "b"
So we see that it does give us the same result. But compare the two commands
> x[x == "b"]
> x[(1:length(x))[x == "b"]]

By the way, why do we put the parentheses around (1:length(x))? Try it with and without and see what you get.

Let's look at another example. R has many functions to generate random values from different probability distributions. One of the distributions it doesn't support is what is called the "truncated normal". This is a regular Normal distribution that is limited to values between a and b, where these are parameters specifying the distribution. Suppose we want to generate values from such a distribution, how would we do it? One approach is to sample from the associated Normal distribution using the rnorm function and then discard any values that are less than a and greater than b. In other words, we keep only the values in the interval [a, b]. We can do this by simple subsetting using a logical vector. Let's suppose we use a standard normal, N(0, 1), and a and b are -.1 and .3 respectively.
> x = rnorm(100, mean = 0, sd = 1)
> x[x < .3 & x > -.1] 

Make certain you use the element-wise operator & and not the other form - &&.

Note that we can readily use logical vectors to exclude certain elements rather than include them. Just like we negate the indices giving positions to exclude values when subsetting, we can negate the TRUEs and FALSEs easily. The ! does exactly this. So if we want to drop elements specified by a logical vector i, we need do only the following:
> x[ !i ]

Again, go through the intermediate computations, looking at i and !i to see what is actually happening.

So now we have seen 4 ways to subset: inclusion and exclusion by position, names and logical "masks". We said at the outset there were 5, so we only have one remaining and this is a special, degenerate one. What if I pass no value for the indexing vector, i.e.
> x[ ]

The result is x itself, i.e the original vector. This is not the same as passing in a vector with length 0
> x[integer()]

That gives back a subset of x with the same length as the indexing vector and so is
> x[ integer() ]
numeric(0)

Why is the empty subsetting (x[]) useful and why are we making a big deal of it? There are several reasons. One of the things we haven't mentioned about subsetting until now is that not only can we access sub-vectors using these 5 techniques, but we can also modify the elements in the original vector by simply assigning elements to the specified subset. We use the same subsetting on the left hand-side of an assignment as we did earlier but specify an object on the right side and good things happen.
> x = c(1, 2, 3)
> x[c(1, 3)] <- 10
> x
[1] 10  2 10

Similarly, if we want to replace all the "G"'s in a character vector with a string "GG", we can do this simply
> x = c("A", "G", "C", "C", "G", "G", "A")
> x[x == "G"] <- "GG"

And if we realized that we had made a mistake and erroneously switched the IP addresses of anson and wald, we could switch them back via
> ip = c(wald="169.237.46.2", anson = "169.237.46.9", fisher = "169.237.46.3")
> ip[c("wald", "anson") ] <- ip[c("anson", "wald")]
> ip
          wald          anson         fisher 
"169.237.46.9" "169.237.46.2" "169.237.46.3" 

Note that the recycling rule is in effect in all of these cases. The number of values on the right must match the number of values expected on the left hand side and the recycling rule works to do this.

So what does this have to do with the empty subsetting capabilities? Well, what's the difference between
> x <- 0
> y[] <- 0

In the first case, we are assigning the value 0 to the name "x". In the second case, we are assigning 0 to each element of the vector y.

Another reason why the empty subsetting is useful is when we deal with multi-dimensional vectors, i.e. matrices and arrays. For these, we can say "give me all the columns for the first four rows" as
> m[1:4, ]

The same subsetting rules apply for each dimension and so we need a convenient way to say "everything" in this dimension. And that is the empty subsetting operation.