August 2013, UC Berkeley
Chris Paciorek
Note to BB: remember to start recording.
As we've seen, R has many functions that allow you to operate on each element of a vector all at once.
vals <- rnorm(1000)
chi2vals <- vals^2
chi2_df1000 <- sum(chi2vals)
# imagine if the code above were a loop, or three separate loops
Advantages:
Sometimes there are surprises in terms of what is fast, as well as tricks for vectorizing things in unexpected ways:
vals <- rnorm(1e+06)
system.time(trunc <- ifelse(vals > 0, vals, 0))
## user system elapsed
## 0.592 0.008 0.601
system.time(vals <- vals * (vals > 0))
## user system elapsed
## 0.008 0.000 0.010
tmp <- as.character(vote$age60)
vote$ageMin <- substring(tmp, 1, 2)
vote$ageMin[1:5]
## [1] "18" "18" "30" "30" "30"
What am I doing with vals * (vals > 0)
? What happens behind the scenes in R?
If you use a trick like this, having a comment in your code is a good idea.
R can do essentially any linear algebra you need. It uses system-level packages called BLAS (basic linear algebra subroutines) and LAPACK (linear algebra package). Note that these calculations will be essentially as fast as if you wrote C code because R just calls C and Fortran routines to do the calculations.
The BLAS that comes with R is fairly slow. It's possible to use a faster BLAS, as well as one that uses multiple cores automatically. This can in some cases give you an order of magnitude speedup if your work involves a lot of matrix manipulations/linear algebra. More details in Module 11.
Recall that +
, -
,*
, /
do vectorized calculations:
A <- matrix(1:9, 3)
B <- matrix(seq(4, 36, by = 4), 3)
A + B
## [,1] [,2] [,3]
## [1,] 5 20 35
## [2,] 10 25 40
## [3,] 15 30 45
A + B[, 1]
## [,1] [,2] [,3]
## [1,] 5 8 11
## [2,] 10 13 16
## [3,] 15 18 21
A * B
## [,1] [,2] [,3]
## [1,] 4 64 196
## [2,] 16 100 256
## [3,] 36 144 324
A * B[, 1]
## [,1] [,2] [,3]
## [1,] 4 16 28
## [2,] 16 40 64
## [3,] 36 72 108
Matrix/vector multiplication
A %*% B[, 1]
## [,1]
## [1,] 120
## [2,] 144
## [3,] 168
A %*% B
## [,1] [,2] [,3]
## [1,] 120 264 408
## [2,] 144 324 504
## [3,] 168 384 600
identical(t(A) %*% A, crossprod(A))
## [1] TRUE
Some decompositions
require(fields)
times <- seq(0, 1, length = 100)
R <- exp(-rdist(times)/0.2) # a correlation matrix
e <- eigen(R)
range(e$values)
## [1] 0.02525 32.85537
e$vectors[, 1]
## [1] 0.05195 0.05449 0.05699 0.05946 0.06190 0.06431 0.06669 0.06903
## [9] 0.07133 0.07360 0.07583 0.07802 0.08017 0.08227 0.08433 0.08635
## [17] 0.08833 0.09025 0.09213 0.09396 0.09575 0.09748 0.09916 0.10079
## [25] 0.10236 0.10388 0.10535 0.10676 0.10812 0.10942 0.11066 0.11185
## [33] 0.11297 0.11404 0.11505 0.11599 0.11688 0.11770 0.11846 0.11916
## [41] 0.11980 0.12038 0.12089 0.12134 0.12172 0.12204 0.12230 0.12249
## [49] 0.12262 0.12269 0.12269 0.12262 0.12249 0.12230 0.12204 0.12172
## [57] 0.12134 0.12089 0.12038 0.11980 0.11916 0.11846 0.11770 0.11688
## [65] 0.11599 0.11505 0.11404 0.11297 0.11185 0.11066 0.10942 0.10812
## [73] 0.10676 0.10535 0.10388 0.10236 0.10079 0.09916 0.09748 0.09575
## [81] 0.09396 0.09213 0.09025 0.08833 0.08635 0.08433 0.08227 0.08017
## [89] 0.07802 0.07583 0.07360 0.07133 0.06903 0.06669 0.06431 0.06190
## [97] 0.05946 0.05699 0.05449 0.05195
sv <- svd(R)
U <- chol(R)
devs <- rnorm(100)
Rinvb <- solve(R, devs) # R^{-1} b
Rinv <- solve(R) # R^{-1} -- try to avoid this
This is slow.
vals <- 0
n <- 50000
system.time({
for (i in 1:n) vals <- c(vals, i)
})
## user system elapsed
## 2.968 0.044 3.022
The same holds for using rbind()
, cbind()
, or adding to a list, one element at a time.
This is slow and unclear:
vals <- 0
n <- 50000
system.time({
for (i in 1:n) vals[i] <- i
})
## user system elapsed
## 1.984 0.004 1.994
Thoughts on why these are so slow? Think about what R might be doing behind the scenes
This is not so slow. (Please ignore the for-loop hypocrisy and the fact that I could do this as vals <- 1:n
.)
n <- 50000
system.time({
vals <- rep(NA, n)
# alternatively: vals <- as.numeric(NA); length(vals) <- n
for (i in 1:n) vals[i] <- i
})
## user system elapsed
## 0.068 0.000 0.067
Here's how to pre-allocate an empty list:
vals <- list()
length(vals) <- n
head(vals)
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
Some functions aren't vectorized, or you may want to use a function on every row or column of a matrix/data frame, every element of a list, etc.
For this we use the apply()
family of functions.
mat <- matrix(rnorm(100 * 1000), nr = 100)
row_min <- apply(mat, MARGIN = 1, FUN = min)
col_max <- apply(mat, MARGIN = 2, FUN = max)
There are actually some even faster specialized functions:
row_mean <- rowMeans(mat)
col_sum <- colSums(mat)
lapply()
and sapply()
myList <- list(rnorm(3), rnorm(3), rnorm(5))
lapply(myList, min)
## [[1]]
## [1] -0.9019
##
## [[2]]
## [1] -1.542
##
## [[3]]
## [1] -0.8389
sapply(myList, min)
## [1] -0.9019 -1.5418 -0.8389
Note that we don't generally want to use apply()
on a data frame.
You can use lapply()
and sapply()
on regular vectors, such as vectors of indices, which can come in handy, though this is a silly example:
sapply(1:10, function(x) x^2)
## [1] 1 4 9 16 25 36 49 64 81 100
Here's a cool trick to pull off a particular element of a list of lists:
params <- list(a = list(mn = 7, sd = 3), b = list(mn = 6, sd = 1), c = list(mn = 2,
sd = 1))
sapply(params, "[[", 1)
## a b c
## 7 6 2
Think about why this works.
Hint:
test <- list(5, 7, 3)
test[[2]]
## [1] 7
# `[[`(test, 2) # need it commented or R Markdown processing messes it
# up...
# `+`(3, 7)
apply()
sThere are a bunch of apply()
variants, as well as parallelized versions of them:
tapply()
, vapply()
, mapply()
, rapply()
, eapply()
?clusterApply
)table()
is our friendunique(vote$pres04)
## [1] 1 2 3 9 NA 0 4
tbl <- table(vote$race, vote$pres04)
tbl
##
## 0 1 2 3 4 9
## white 111 26184 33045 417 14 409
## black 18 6183 824 56 0 21
## hispanic/latino 6 2665 1639 34 3 49
## asian 0 626 384 7 1 2
## other 16 1036 653 22 0 32
round(prop.table(tbl, margin = 1), 3)
##
## 0 1 2 3 4 9
## white 0.002 0.435 0.549 0.007 0.000 0.007
## black 0.003 0.871 0.116 0.008 0.000 0.003
## hispanic/latino 0.001 0.606 0.373 0.008 0.001 0.011
## asian 0.000 0.614 0.376 0.007 0.001 0.002
## other 0.009 0.589 0.371 0.013 0.000 0.018
table(vote$race, vote$pres04, vote$sex)
## , , = male
##
##
## 0 1 2 3 4 9
## white 55 11200 15582 230 9 262
## black 6 2440 419 37 0 13
## hispanic/latino 3 1140 784 18 3 28
## asian 0 300 181 2 1 1
## other 7 464 335 9 0 25
##
## , , = female
##
##
## 0 1 2 3 4 9
## white 55 14922 17387 186 5 147
## black 11 3688 401 19 0 8
## hispanic/latino 3 1505 845 15 0 21
## asian 0 325 201 4 0 1
## other 9 566 310 12 0 7
with(vote[vote$sex == "female", ], table(pres04, race))
## race
## pres04 white black hispanic/latino asian other
## 0 55 11 3 0 9
## 1 14922 3688 1505 325 566
## 2 17387 401 845 201 310
## 3 186 19 15 4 12
## 4 5 0 0 0 0
## 9 147 8 21 1 7
Can you figure out what with()
does just by example?
Often we want to do individual analyses within subsets or clusters of our data.
As a first step, we might want to just split our dataset by a stratifying variable.
subsets <- split(earnings, earnings$race)
length(subsets)
## [1] 5
subsets[["9"]]
## earn height1 height2 sex race hisp ed yearbn height
## 239 NA NA NA 2 9 9 16 99 NA
## 794 NA 5 5 2 9 2 16 34 65
## 966 0 5 1 2 9 1 16 67 61
## 1027 NA 5 5 2 9 2 12 31 65
## 1054 0 5 3 2 9 2 8 38 63
## 1507 4416 5 2 2 9 2 6 18 62
## 1676 18000 5 6 1 9 1 17 64 66
## 1713 53000 5 3 2 9 1 17 58 63
The %in%
operator can also be helpful.
sub <- earnings[earnings$race %in% c(1, 2, 4), ]
Often we want to do individual analyses within subsets or clusters of our data. R has a variety of tools for this; for now we'll look at aggregate()
and by()
. These are wrappers of tapply()
.
aggregate(earnings, by = list(educ = earnings$ed), FUN = median, na.rm = TRUE)
## educ earn height1 height2 sex race hisp ed yearbn height
## 1 2 NA 5 0.0 1 2 2 2 0.0 60.0
## 2 3 1400 5 2.0 2 1 1 3 26.0 62.0
## 3 4 6600 5 7.5 2 1 2 4 21.0 66.0
## 4 5 1200 5 3.0 2 1 2 5 17.0 63.0
## 5 6 4416 5 5.0 1 1 2 6 20.0 67.0
## 6 7 7000 5 5.0 2 1 2 7 22.0 66.0
## 7 8 6250 5 5.0 2 1 2 8 32.0 66.0
## 8 9 7000 5 4.0 2 1 2 9 38.0 65.0
## 9 10 8000 5 4.0 2 1 2 10 35.0 66.0
## 10 11 10000 5 6.0 2 1 2 11 42.5 67.0
## 11 12 13000 5 5.0 2 1 2 12 52.0 66.0
## 12 13 15000 5 4.0 2 1 2 13 53.0 66.0
## 13 14 20000 5 5.0 2 1 2 14 53.0 66.5
## 14 15 17000 5 4.0 2 1 2 15 51.5 66.0
## 15 16 24000 5 5.0 2 1 2 16 55.0 66.0
## 16 17 25000 5 6.0 2 1 2 17 50.0 67.0
## 17 18 30000 5 6.0 1 1 2 18 47.0 68.0
## 18 98 NA 5 6.0 2 2 2 98 18.0 66.0
## 19 99 NA 6 0.0 1 1 2 99 37.0 72.0
aggregate(earn ~ ed, data = earnings, FUN = median)
## ed earn
## 1 3 1400
## 2 4 6600
## 3 5 1200
## 4 6 4416
## 5 7 7000
## 6 8 6250
## 7 9 7000
## 8 10 8000
## 9 11 10000
## 10 12 13000
## 11 13 15000
## 12 14 20000
## 13 15 17000
## 14 16 24000
## 15 17 25000
## 16 18 30000
aggregate(earn ~ ed + hisp, data = earnings, FUN = median)
## ed hisp earn
## 1 3 1 1400
## 2 6 1 7000
## 3 8 1 5200
## 4 9 1 0
## 5 10 1 0
## 6 11 1 15000
## 7 12 1 12000
## 8 13 1 17500
## 9 14 1 17000
## 10 15 1 2500
## 11 16 1 17000
## 12 17 1 27000
## 13 18 1 42500
## 14 4 2 6600
## 15 5 2 1200
## 16 6 2 4416
## 17 7 2 7000
## 18 8 2 6250
## 19 9 2 8500
## 20 10 2 9000
## 21 11 2 10000
## 22 12 2 14000
## 23 13 2 15000
## 24 14 2 20000
## 25 15 2 17500
## 26 16 2 24500
## 27 17 2 25000
## 28 18 2 30000
agg <- aggregate(earn ~ ed + hisp, data = earnings, FUN = median)
xtabs(earn ~ ., data = agg)
## hisp
## ed 1 2
## 3 1400 0
## 4 0 6600
## 5 0 1200
## 6 7000 4416
## 7 0 7000
## 8 5200 6250
## 9 0 8500
## 10 0 9000
## 11 15000 10000
## 12 12000 14000
## 13 17500 15000
## 14 17000 20000
## 15 2500 17500
## 16 17000 24500
## 17 27000 25000
## 18 42500 30000
Notice the 'long' vs. 'wide' formats. You'll see more about that sort of thing in Module 6.
You may need to discretize a categorical variable, e.g., by education level:
earnings$edLevel <- cut(earnings$ed, breaks = c(0, 11, 12, 15, 16, 18, 99))
levels(earnings$edLevel) <- c("no HS diploma", "HS grad", "some college", "college grad",
"grad study", "other")
boxplot(earn ~ edLevel, data = earnings)
plot of chunk unnamed-chunk-19
aggregate()
works fine when the output is univariate, but what about more complicated analyses than computing the median, such as fitting a set of regressions?
out <- by(earnings, earnings$edLevel, function(x) {
if (sum(!is.na(x$earn)))
lm(earn ~ height, data = x) else NA
})
length(out)
## [1] 6
summary(out[[5]])
##
## Call:
## lm(formula = earn ~ height, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48563 -16907 -5810 6902 157470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -121555 43803 -2.78 0.00623 **
## height 2344 648 3.62 0.00041 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29700 on 148 degrees of freedom
## (75 observations deleted due to missingness)
## Multiple R-squared: 0.0812, Adjusted R-squared: 0.075
## F-statistic: 13.1 on 1 and 148 DF, p-value: 0.000409
What's the business with the if
statement? Why is this good practice?
sort()
applied to a vector does what you expect.
Sorting a matrix or dataframe based on one or more columns is a somewhat manual process, but once you get the hang of it, it's not bad.
ord <- order(earnings$earn, earnings$height, decreasing = TRUE)
# ord <- with(earnings, order(earn, height, decreasing = TRUE))
ord[1:5]
## [1] 1860 583 351 618 314
(earnings$earn[ord])[c(1:5, 2025:2029)] # parentheses for clarity
## [1] 200000 175000 170000 148000 125000 NA NA NA NA NA
earnings_ordered <- earnings[ord, ]
You could of course write your own sort function that uses order()
. More in Module 5.
We often need to combine data across multiple data frames, merging on common fields (i.e., keys). In database terminology, this is a join operation.
Here's an example using the voting data combined with a built-in R dataset on state information. Warning: the state dataset is very old; this is just a toy example.
In this case (as often true) we need to do some machinations to get everything prepared for the merge. The key we use is the state name.
# PRE-PROCESSING a bit of querying indicates the state numbers are in
# alphabetical order, so we can do this:
numToName <- data.frame(stateNum = 1:50, stateName = row.names(state.x77))
voteWithStateNames <- merge(vote, numToName, by.x = "state", by.y = "stateNum",
all.x = TRUE, all.y = FALSE)
stateInfo <- data.frame(state.x77)
# need the names as column, not as the row names attribute:
stateInfo$name <- row.names(stateInfo)
# ACTUAL DEMONSTRATION
fullVote <- merge(voteWithStateNames, stateInfo[, c("name", "Population", "Income",
"Illiteracy", "Life.Exp")], by.x = "stateName", by.y = "name", all.x = TRUE,
all.y = FALSE)
dim(vote)
## [1] 76205 18
dim(fullVote)
## [1] 76205 23
fullVote[1:2, ]
## stateName state pres04 sex race age9 partyid income
## 1 Alabama 1 2 female white 50-59 republican $30,000-$49,999
## 2 Alabama 1 2 male white 30-39 republican $15,000-$29,999
## relign8 age60 age65 geocode sizeplac brnagain attend
## 1 protestant 45-59 50-64 4 city: 10,000 to 49,999 yes <NA>
## 2 mormon/lds 30-44 30-39 2 rural yes <NA>
## year region y ageMin Population Income Illiteracy Life.Exp
## 1 2004 1 1 45 3615 3624 2.1 69.05
## 2 2004 1 1 30 3615 3624 2.1 69.05
What's the deal with the all.x
and all.y
? We can tell R whether we want to keep all of the x
observations, all the y
observations, or neither, or both, when there may be rows in either of the datasets that don't match the other dataset.
Suppose we have two categorical variables and we conduct a hypothesis test of independence. The chi-square statistic is:
where , with
the sum of the values in the i'th row,
the sum of values in the j'th column, and
the sum of all the values. Suppose I give you a matrix in R with the
values.
You can generate a test matrix as: y <- matrix(sample(1:10, 12, replace = TRUE), nrow = 3, ncol = 4)
.
Compute the statistic without any loops as follows:
outer()
function can be used.For each combination of sex and education level, find the second largest value of earnings amongst the people in the group without any looping.