R bootcamp, Module 11: Parallel processing

August 2013, UC Berkeley

Chris Paciorek

Computer architecture

Note to BB: remember to start recording.

Note to participants: I'm having trouble with parallelization in RStudio, so we'll just run the demo code in this module in a command line R session. You can open the basic R GUI for Mac or Windows, or, on a Mac, start R in a terminal window.

We'll focus on shared memory computation here.

How do I know how many cores a computer has?

To see if multiple cores are being used by your job, you can do:

How can we make use of multiple cores?

Some basic approaches are:

Threaded linear algebra

R comes with a default BLAS (basic linear algebra subroutines) and LAPACK (linear algebra package) that carry out the core linear algebra computations. However, you can generally improve performance (sometimes by an order of magnitude) by using a different BLAS. Furthermore a threaded BLAS will allow you to use multiple cores.

A 'thread' is a lightweight process, and the operating system sees multiple threads as part of a single process.

We'll show by demonstration that my Mac laptop is using multiple cores for linear algebra operations.

# note to CJP: don't run in RStudio
n <- 5000
x <- matrix(rnorm(n^2), n)
U <- chol(crossprod(x))

You should see that your R process is using more than 100% of CPU. Inconceivable!

More details on the BLAS

You can talk with your systems administrator about linking R to a fast BLAS or you can look into it yourself for your personal machine; see the R Installation and Administration manual.

Note that in some cases, in particular for small matrix operations, using multiple threads may actually slow down computation, so you may want to experiment, particularly with Linux. You can force the linear algebra to use only a single core by doing (assuming you're using the bash shell) export OMP_NUM_THREADS=1 in the terminal window before starting R in the same terminal.

Finally, note that threaded BLAS and either foreach or parallel versions of apply() can conflict and cause R to hang, so you're likely to want to set the number of threads to 1 as above if you're doing explicit parallelization.

What is an embarrassingly parallel (EP) problem?

Do you think you should be asking?

An EP problem is one that can be solved by doing independent computations as separate processes without communication between the processes. You can get the answer by doing separate tasks and then collecting the results.

Examples in statistics include

  1. stratified analyses
  2. cross-validation
  3. simulations with many independent replicates
  4. bootstrapping
  5. random forests models

Can you think of others in your work?

Some things that are not EP:

  1. Markov chain Monte Carlo for fitting Bayesian models
  2. optimization

Using multiple cores for EP problems: foreach

First, make sure your iterations are independent and don't involve sequential calculations!

The foreach package provides a way to do a for loop using multiple cores. It can use a variety of 'back-ends' that handle the nitty-gritty of the parallelization.

To use multiple cores on a single machine, use the parallel back-end from the doParallel package (or multicore from doMC)

# note to CJP: don't run in RStudio
require(parallel)
require(doParallel)
require(foreach)
nCores <- 4  # actually only 2 on my laptop, but appears hyperthreaded
registerDoParallel(nCores)

testFun <- function(i) {
    set.seed(i)
    mn <- mean(rnorm(n))
    mn
}

n <- 1e+07
nSims <- 100
out <- foreach(i = 1:nSims, .combine = c) %dopar% {
    cat("Starting ", i, "th job.\n", sep = "")
    outSub <- testFun(i)
    cat("Finishing ", i, "th job.\n", sep = "")
    outSub  # this will become part of the out objec
}
out[1:5]
## [1]  0.0004037 -0.0002178 -0.0002258  0.0005587  0.0002852

What do you think are the advantages and disadvantages of having many small tasks vs. a few large tasks?

You can leave out the .combine argument - foreach will do something reasonably sensible on its own in terms of combining the results from the different tasks.

Using multiple cores for EP problems: parallel apply variants

help(clusterApply) shows the wide variety of parallel versions of the apply family of functions.

require(parallel)
nCores <- 4
cluster <- makeCluster(nCores)
cluster
## socket cluster with 4 nodes on host 'localhost'

out <- parSapply(cluster, seq_len(nSims), testFun)
out[1:5]
## [1]  0.0004037 -0.0002178 -0.0002258  0.0005587  0.0002852
require(parallel)
nCores <- 4

testFun <- function(i) {
    set.seed(i)
    mn <- mean(rnorm(n))
    mn
}

n <- 1e+07
nSims <- 100

out2 <- mclapply(seq_len(nSims), testFun, mc.cores = nCores)
out2[1:5]
## [[1]]
## [1] 0.0004037
## 
## [[2]]
## [1] -0.0002178
## 
## [[3]]
## [1] -0.0002258
## 
## [[4]]
## [1] 0.0005587
## 
## [[5]]
## [1] 0.0002852

One thing to keep in mind is whether the different tasks all take about the same amount of time or widely different times. In the latter case, one wants to sequentially dispatch tasks as earlier tasks finish, rather than dispatching a block of tasks to each core. Some of these parallel apply variants allow you to control this.

Parallelization and Random Number Generation

A tale of the good, the bad, and the ugly

Random numbers on a computer are not truly random but are generated as a sequence of pseudo-random numbers. The sequence is finite (but very, very, very, very long) and eventally repeats itself.

A random number seed determines where in the sequence one starts when generating random numbers.

set.seed(0)
rnorm(5)
## [1]  1.2630 -0.3262  1.3298  1.2724  0.4146
set.seed(0)
rnorm(5)
## [1]  1.2630 -0.3262  1.3298  1.2724  0.4146

The syntax for using L'Ecuyer is available in my parallel computing workshop notes.

A brief note on distributed computing for advanced users

If you have access to multiple machines within a networked environment, such as the compute servers in the Statistics Department, there are a couple straightforward ways to parallelize EP jobs across machines.

  1. Use foreach with doMPI as the back-end. You'll need MPI and Rmpi installed on all machines.
  2. Use sockets to make a cluster in R and then use parLapply(), parSapply(), mclapply(), etc.

See my notes on distributed computing for syntax and more details.

Breakout

Figure out how many cores your laptop has.

Fit logistic regression models of preference for Bush/Kerry on income, stratified by state. Use foreach or a parallel version of one of the apply variants. Collect the resulting coefficients (and standard errors, if you're feeling ambitious) in a clean format such as a matrix or data frame. Check to see if multiple cores are being used in the execution.