A lot of what we do in statistics and exploratory data analysis is to
look at subgroups of a sample or population. We determine
characteristics about that subset and compare them to other groups or
the same characteristic of the overall group. We might look at how
height and weight are related for both men and women separately. We
might look at milk yield for cows of different breeds. We might look
at stock prices within a particular week and so look at that
particular subset of time. We might also look at stock prices every
Friday rather than consecutive days. For Web page "hits" on a server,
we might look at the other requests from the site of the requester.
These are all examples of how we look at different parts of our data
using categorical or continuous variables to "zoom in" on a subgroup.
The criteria we use might be known ahead of time (type of cow,
male/female) or might depend on the data itself (e.g. other web hits
from the most frequent downloading site).
Being able to compute subgroups easily within our data is one
of the things that is most powerful in S, but also one that takes some
time to get used to. The flexibility comes from the fact that there
are many ways to specify the subset of interest and this can be
confusing. You should sit down and work with R to try to understand
what is happening and master the concepts. They are very useful.
There are essentially 5 different ways to subset a vector
in R.
They all use the [ function or operator and the only differences
are what you specify as the value to use to identify the
particular subset of interest.
We'll use the built-in vector of lower case letters
of the alphabet as our simple vector to illustrate the ideas.
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
Let's start with the most obvious and simple one. Suppose we
have a vector with n entries. A common thing to ask is for one or
more elements identified by position. For example, I can ask for the
2nd, 5th and 10th elements.
We do this by passing the positions of the elements we want.
> letters[c(2, 5, 10)]
[1] "b" "e" "j"
As we might have expected, we get back the specified elements of our
original vector. Note that what we get back is also a vector of the
same type as our original one, in this case a character vector. The
result has as many elements as we asked for in our specification of
the subset.
This is very simple: we ask for the values we want by identifying
their position.
What if we give a position that makes no sense, e.g.
that is larger than the length of the starting vector.
For example, let's ask for the 30th element of the
letters object.
> letters[30]
[1] NA
The result is a missing value, NA. This makes sense
in many contexts. It is something we should
be aware of so that we can understand how NAs might
be introduced into our computations.
There are two other values that might be considered meaningless.
What if we ask for the 0-th element?
> letters[0]
character(0)
> letters[c(0, 1)]
[1] "a"
Essentially, S ignores a request for the 0-th element and doesn't
include a value in the result for that element.
This means that the result may not have as many elements as
we asked for.
And what if I ask for a negative index?
For example,
> letters[-c(1, 3)]
is outside the range of the indices of the elements of
letters.
What does S do with such a request?
(Try it and see what happens.)
Negative numbers for subsetting mean to drop those elements.
What happens in the above example is that we get a new vector derived
from
letters with the first and third
elements dropped or removed.
> letters[-c(1, 3)]
[1] "b" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
[20] "v" "w" "x" "y" "z"
There are some restrictions on this. We cannot mix positive indices
and negative indices in a single subsetting call. In other words, we
cannot include some and omit others in one action.
So
> letters[c(-1, -3, 5, 6, 7)]
might seem reasonable to drop the first and third elements
and include the fifth, sixth and seventh.
But if we give such a command, we get an error
> letters[c(-1, -3, 5, 6, 7)]
Error: only 0's may mix with negative subscripts
and it makes sense.
We are saying "I want you to only include these, but also
exclude those". That is not a good way to give an instruction.
So we have seen two ways to get subsets so far. Both involve
identifying the elements of interest by position or index in the
original vector and either including them in the result or excluding
them. One of the problems is that we have to know the indices of the
desired elements. And this brings us to the next two subsetting
approaches.
We have seen that vectors can have names. If we are subsetting
a vector with names, we can refer to the elements we want in the
subset using these names.
Let's suppose we have our vector of IP addresses
> ip = c(wald="169.237.46.2", anson = "169.237.46.9", fisher = "169.237.46.3")
To get only wald and fisher, we pass a vector giving these names.
> ip[c("wald", "fisher")]
wald fisher
"169.237.46.2" "169.237.46.3"
So far so good.
Note that we are passing a vector to
the
[
.
We can't just put in names like
> ip["wald", "fisher"]
This would be two arguments to
[
and this confuses it for a simple, linear, one dimensional vector.
> ip["wald", "fisher"]
Error in ip["wald", "fisher"] : incorrect number of dimensions
The error gives a hint that we might be able to do two dimensional
subsetting on other types of objects.
See matrices and data frames below.
Again, if we ask for a non-existent element, we will
get an
NA in the result.
ip[c("wald", "fishesr")]
wald <NA>
"169.237.46.2" NA
And we cannot use this style of subsetting to
exclude elements.
Think about what
ip[-c("wald", "fisher")]
means when R interprets the command.
While we can understand that we mean
to drop the wald and fisher elements,
R first evaluates
-c("wald", "fisher")
This is meaningless as the negative of a string doesn't make sense.
So the error comes from this part of the computation.
> ip[-c("wald", "fisher")]
Error in -c("wald", "fisher") : Invalid argument to unary operator
What's the unary operator? It is the
-
operator.
So now we have covered three types of subsetting: indexing by
position, exclusion by position, indexing by name. The next one is to
use a logical vector to index the elements we want. Like names, this
is used when we don't know the position of an argument but know what
we are looking for. We give the
[
a logical
vector and R returns the subset of the original vector containing the
elements corresponding to
TRUE
values in our logical "indexer".
Basically, this is like super-imposing our logical vector over the
vector being subset, and dropping all the values under the
FALSE
elements, and keeping all the elements under the
TRUE
values. In
this way, it works like a "mask".
A couple of examples may make this clearer.
The simplest and least interesting is
the following:
> x = c("a", "b", "c", "b")
> x[c(TRUE, FALSE, TRUE, FALSE)]
[1] "a" "c"
Here, we just extract
the first and third elements.
Suppose we wanted to get all the elements that
were equal to "b".
Remember, that S is a vectorized language with
the recycling rule.
The command
x == "b"
returns a logical vector with as many values
as there are in
x
and the result contains
TRUE
s
and
FALSE
s according to the condition.
> x == "b"
[1] FALSE TRUE FALSE TRUE
Now we can use this to subset
x
to get all the "b" elements:
> x[x == "b"]
This reads as
"get all the elements of x such that x is equal to 'b'".
There are several other ways to do this subsetting.
We could find the positions of all the "b" elements
and then use the positions as our subsetting vector.
This can be done in one command as
> x[(1:length(x))[x == "b"]]
Think about what this is doing to make certain you understand it.
We can do the computations separately and look at the intermediate
results to see what is happening.
> x == "b"
[1] FALSE TRUE FALSE TRUE
> 1:length(x)
[1] 1 2 3 4
> c(1, 2, 3, 4)[c(FALSE, TRUE, FALSE, TRUE)]
[1] 2 4
x[c(2,4)]
[1] "b" "b"
So we see that it does give us the same result.
But compare the two commands
> x[x == "b"]
> x[(1:length(x))[x == "b"]]
By the way, why do we put the parentheses around
(1:length(x))
? Try it with and without
and see what you get.
Let's look at another example. R has many functions to
generate random values from different probability distributions. One
of the distributions it doesn't support is what is called the
"truncated normal". This is a regular Normal distribution that is
limited to values between a and b, where these are parameters
specifying the distribution. Suppose we want to generate values from
such a distribution, how would we do it? One approach is to sample
from the associated Normal distribution using the
rnorm
function and then discard any
values that are less than a and greater than b. In other words, we
keep only the values in the interval [a, b].
We can do this by simple subsetting using a logical vector.
Let's suppose we use a standard normal, N(0, 1),
and a and b are -.1 and .3 respectively.
> x = rnorm(100, mean = 0, sd = 1)
> x[x < .3 & x > -.1]
Make certain you use the element-wise operator &
and not the other form - &&.
Note that we can readily use logical vectors to
exclude certain elements rather than include them.
Just like we negate the indices giving positions to exclude
values when subsetting, we can negate the
TRUE
s and
FALSE
s easily.
The
!
does exactly this.
So if we want to drop elements specified by
a logical vector
i,
we need do only the following:
> x[ !i ]
Again, go through the intermediate computations, looking at
i and
!i
to see
what is actually happening.
So now we have seen 4 ways to subset: inclusion and exclusion by
position, names and logical "masks". We said at the outset there were
5, so we only have one remaining and this is a special, degenerate
one. What if I pass no value for the indexing vector, i.e.
> x[ ]
The result is
x itself, i.e
the original vector.
This is not the same as passing in a vector with
length 0
> x[integer()]
That gives back a subset of
x with the same length
as the indexing vector and so is
> x[ integer() ]
numeric(0)
Why is the empty subsetting (
x[]
) useful
and why are we making a big deal of it? There are several reasons.
One of the things we haven't mentioned about subsetting until now is
that not only can we access sub-vectors using these 5 techniques, but
we can also modify the elements in the original vector by simply
assigning elements to the specified subset.
We use the same subsetting on the left hand-side
of an assignment as we did earlier but specify an object
on the right side and good things happen.
> x = c(1, 2, 3)
> x[c(1, 3)] <- 10
> x
[1] 10 2 10
Similarly, if we want to replace all the "G"'s in
a character vector with a string "GG",
we can do this simply
> x = c("A", "G", "C", "C", "G", "G", "A")
> x[x == "G"] <- "GG"
And if we realized that we had made a mistake
and erroneously switched the IP addresses of anson and wald,
we could switch them back via
> ip = c(wald="169.237.46.2", anson = "169.237.46.9", fisher = "169.237.46.3")
> ip[c("wald", "anson") ] <- ip[c("anson", "wald")]
> ip
wald anson fisher
"169.237.46.9" "169.237.46.2" "169.237.46.3"
Note that the recycling rule is in effect in all of these cases. The
number of values on the right must match the number of values expected
on the left hand side and the recycling rule works to do this.
So what does this have to do with the empty subsetting capabilities?
Well, what's the difference between
> x <- 0
> y[] <- 0
In the first case, we are assigning the value 0 to the name "x".
In the second case, we are assigning 0 to each element of the
vector
y.
Another reason why the empty subsetting is useful is when
we deal with multi-dimensional vectors, i.e. matrices and arrays.
For these, we can say "give me all the columns for the first four rows"
as
> m[1:4, ]
The same subsetting rules apply for each dimension
and so we need a convenient way to say "everything" in this
dimension. And that is the empty subsetting operation.