Note to BB: remember to start recording.
Note to participants: I'm having trouble with parallelization in RStudio, so we'll just run the demo code in this module in a command line R session. You can open the basic R GUI for Mac or Windows, or, on a Mac, start R in a terminal window.
We'll focus on shared memory computation here.
system_profiler | grep -i 'Cores'
To see if multiple cores are being used by your job, you can do:
Some basic approaches are:
apply()
and its variantsR comes with a default BLAS (basic linear algebra subroutines) and LAPACK (linear algebra package) that carry out the core linear algebra computations. However, you can generally improve performance (sometimes by an order of magnitude) by using a different BLAS. Furthermore a threaded BLAS will allow you to use multiple cores.
A 'thread' is a lightweight process, and the operating system sees multiple threads as part of a single process.
We'll show by demonstration that my Mac laptop is using multiple cores for linear algebra operations.
# note to CJP: don't run in RStudio
n <- 5000
x <- matrix(rnorm(n^2), n)
U <- chol(crossprod(x))
You should see that your R process is using more than 100% of CPU. Inconceivable!
You can talk with your systems administrator about linking R to a fast BLAS or you can look into it yourself for your personal machine; see the R Installation and Administration manual.
Note that in some cases, in particular for small matrix operations, using multiple threads may actually slow down computation, so you may want to experiment, particularly with Linux. You can force the linear algebra to use only a single core by doing (assuming you're using the bash shell) export OMP_NUM_THREADS=1
in the terminal window before starting R in the same terminal.
Finally, note that threaded BLAS and either foreach
or parallel versions of apply()
can conflict and cause R to hang, so you're likely to want to set the number of threads to 1 as above if you're doing explicit parallelization.
Do you think you should be asking?
An EP problem is one that can be solved by doing independent computations as separate processes without communication between the processes. You can get the answer by doing separate tasks and then collecting the results.
Examples in statistics include
Can you think of others in your work?
Some things that are not EP:
First, make sure your iterations are independent and don't involve sequential calculations!
The foreach package provides a way to do a for loop using multiple cores. It can use a variety of 'back-ends' that handle the nitty-gritty of the parallelization.
To use multiple cores on a single machine, use the parallel back-end from the doParallel package (or multicore from doMC)
# note to CJP: don't run in RStudio
require(parallel)
require(doParallel)
require(foreach)
nCores <- 4 # actually only 2 on my laptop, but appears hyperthreaded
registerDoParallel(nCores)
testFun <- function(i) {
set.seed(i)
mn <- mean(rnorm(n))
mn
}
n <- 1e+07
nSims <- 100
out <- foreach(i = 1:nSims, .combine = c) %dopar% {
cat("Starting ", i, "th job.\n", sep = "")
outSub <- testFun(i)
cat("Finishing ", i, "th job.\n", sep = "")
outSub # this will become part of the out objec
}
out[1:5]
## [1] 0.0004037 -0.0002178 -0.0002258 0.0005587 0.0002852
What do you think are the advantages and disadvantages of having many small tasks vs. a few large tasks?
You can leave out the .combine
argument - foreach
will do something reasonably sensible on its own in terms of combining the results from the different tasks.
help(clusterApply)
shows the wide variety of parallel versions of the apply family of functions.
require(parallel)
nCores <- 4
cluster <- makeCluster(nCores)
cluster
## socket cluster with 4 nodes on host 'localhost'
out <- parSapply(cluster, seq_len(nSims), testFun)
out[1:5]
## [1] 0.0004037 -0.0002178 -0.0002258 0.0005587 0.0002852
require(parallel)
nCores <- 4
testFun <- function(i) {
set.seed(i)
mn <- mean(rnorm(n))
mn
}
n <- 1e+07
nSims <- 100
out2 <- mclapply(seq_len(nSims), testFun, mc.cores = nCores)
out2[1:5]
## [[1]]
## [1] 0.0004037
##
## [[2]]
## [1] -0.0002178
##
## [[3]]
## [1] -0.0002258
##
## [[4]]
## [1] 0.0005587
##
## [[5]]
## [1] 0.0002852
One thing to keep in mind is whether the different tasks all take about the same amount of time or widely different times. In the latter case, one wants to sequentially dispatch tasks as earlier tasks finish, rather than dispatching a block of tasks to each core. Some of these parallel apply variants allow you to control this.
A tale of the good, the bad, and the ugly
Random numbers on a computer are not truly random but are generated as a sequence of pseudo-random numbers. The sequence is finite (but very, very, very, very long) and eventally repeats itself.
A random number seed determines where in the sequence one starts when generating random numbers.
set.seed(0)
rnorm(5)
## [1] 1.2630 -0.3262 1.3298 1.2724 0.4146
set.seed(0)
rnorm(5)
## [1] 1.2630 -0.3262 1.3298 1.2724 0.4146
The (not so) bad: Use a different seed for each task or each process. It's possible the subsequences will overlap but quite unlikely.
The good: Use the L'Ecuyer algorithm (library(lecuyer)
) to ensure distinct subsequences
foreach
you can use %dorng%
from the doRNG package in place of %dopar%
mclapply()
, use the mc.set.seed
argument (see help(mcparallel)
) The ugly but good: Generate all your random numbers in the master process and distribute them to the tasks if feasible.
The syntax for using L'Ecuyer is available in my parallel computing workshop notes.
If you have access to multiple machines within a networked environment, such as the compute servers in the Statistics Department, there are a couple straightforward ways to parallelize EP jobs across machines.
foreach
with doMPI as the back-end. You'll need MPI and Rmpi installed on all machines. parLapply()
, parSapply()
, mclapply()
, etc.See my notes on distributed computing for syntax and more details.
Figure out how many cores your laptop has.
Fit logistic regression models of preference for Bush/Kerry on income, stratified by state. Use foreach
or a parallel version of one of the apply variants. Collect the resulting coefficients (and standard errors, if you're feeling ambitious) in a clean format such as a matrix or data frame. Check to see if multiple cores are being used in the execution.