Rvecops

Vectorized Operations

In lower-level languages like C/C++ and Java, we operate on entire arrays by iterating over each element. We have code something like:

for(i = 0; i < n; i++) { f(x[i]) }

where f is some function to do something on the individual element of the array.

In S, since vectors are the basic types, and because in statistics we typically want to work on groups of observations or experimental units, the philosophy is that operations work on an entire vector. This means users don't have to write loops for many operations. A simple example is the + function. We can add two vectors together elementwise using the + operation:

> c(1, 2, 3) + c(4, 5, 6)
[1] 5 7 9

The first element of each vector are added together to get 5. Similarly, we get 7 and 9 by adding the second elements, and the third elements.

This is very powerful and convenient. It allows us to express computations at a high-level, indicating what we mean rather than hiding it in a loop. Many functions in S are vectorized, meaning that if you give them a vector of length n, they will operate on all n elements rather than just the first one. strsplit is an example. If we give it the vector of IP addresses and ask it to break the strings into sub-parts separated by ., then we get

> strsplit(ip, "\\\.")
$wald
[1] "169" "237" "46"  "2"  

$anson
[1] "169" "237" "46"  "9"  

$fisher
[1] "169" "237" "46"  "3"

Here, we get back a collection of character vectors. The collection has the same names as the original input vector (wald, anson, fisher) and each element is a string with the particular part of the IP address. The actual data type of the result is a list which we shall see shortly.

When you right your own functions, you should try to make them vectorized so that they take in a vector and give back a value for each element. Of course, if these are aggregator functions (e.g. sum, prod, lm), then they should work on all of the elements and combine them into a single result.

The Recycling Rule

What if we add two vectors with different lengths. For example, what happens to c(1, 2) + 2? We would like S to be smart enough to add 2 to each element. And that is what happens

> c(1, 2) + 2
[1] 3 4

What about c(1, 10) + c(100, 200, 300, 400) where the second vector has two more elements than the first.

> c(1, 10) + c(100, 200, 300, 400)
[1] 101 210 301 410

R does the right thing, depending on what you think the right thing is! But what did it do? It appears to have created the vector c(1 + 100, 10 + 200 , 1 + 300, 10 + 400) and indeed that is what it did. This is a general concept in S; it recycles or replicates the smaller vector to have the same length as the larger one. So, in this case, we recycle

c(1,
10)

to have length 4. We do this as the function rep would, basically by concatenating several copies of the original vector to get the right length. So we get c( 1, 10, 1, 10) to have length 4, the same as the larger vector and then we can do the basic arithmetic as before.

We can now understand how c(1, 2) + 2 works.

What about the following expression c(1, 2) + c(10, 11, 12), i.e. using vectors of length 2 and length 3.

> c(1, 2) + c(10, 11, 12)
[1] 11 13 13
Warning message: 
longer object length
	is not a multiple of shorter object length in: c(1, 2) + c(10, 11, 12)

First thing to note is that R generates a warning telling you that you may want to check whether the result is as you expected. The problem is that recycling the smaller vector did not naturally yield a vector of the same length as the larger one. That is why R gave a warning. But it went ahead and did the addition using

c(1, 2, 1)
+ c(10, 11, 12)

as it recycled the smaller vector to have the same length as the larger one and threw away any left over elements.