Parallel Computing on the Cluster using R
Introduction
While many users are content running their analyses and simulations sequentially using the departmental computing utilities, many of us wish to take advantage of the true power of the cluster by using multiple cores for a single analysis. Luckily, there is software available to make this feasible for seasoned R users.
In order to start, we will need to be using a copy of R with both the Rmpi and the snow packages installed. For this, I suggest using the HPSCC build of R maintained by Kasper Daniel Hansen. Instructions on how to set this up are available at
http://www.biostat.jhsph.edu/~khansen/hpscc.html.
There are two easy to use methods for parallel processing in R that we will consider. The first, using the multicore package, is restricted to processors on one node. Although this may seem to be a major disadvantage, it can actually be significantly faster as the communication between processes is orders of magnitude faster. The second possibility is to use the snow package which allows for MPI subclusters (across nodes) to be used for calculations.
Using multicore within nodes
The multicore package is very effective at speeding up more simple calculations by using more than one processing core at a time. It is very easy to use and fast to implement.
The first step is to log in to Enigma. Then, start a process on a cluster node that uses multiple processors by typing:
qrsh -l mcmc -pe local 10-12
Note that here, we are asking for 10-12 processors on a node in the mcmc queue. The number of cores that are acquired is found in the NSLOTS environment variable, accessible in R with as.integer(Sys.getenv("NSLOTS")). The next step is to start R and load the multicore library. Finally, one proceeds by using the mclapply command instead of lapply in R (passing the number of cores to be used as the mc.cores argument).
Using snow across nodes
The following tutorial is a simple example of how to open R with an MPI cluster running and how to use the cluster for a simple calculation.
First, log in to Enigma. I would recommend setting up password-less logins for the cluster nodes before continuing, as this will save some time (it is not, however, necessary).
In order to start an MPI cluster with 12 nodes (cores), we type:
qrsh -V -l cegs -pe orte 12 /opt/openmpi/bin/mpirun -np 12 ~hcorrada/bioconductor/Rmpi/bstRMPISNOW
This should open 12 instances of R, with one running as the master. Note that here was are starting a process on the cegs queue. Then, you can grab the nodes that you have set up with mpirun by typing into R:
cl <- getMPIcluster()
Then, you can go through and use the snow commands that you would like to run. For example,
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
will give you a list of the nodes in your cluster (more precisely, our 11 running slave nodes). A list of the available commands is available at http://www.sfu.ca/~sblay/R/snow.html .
When you have finished your calculations with the cluster, you can close the nodes with:
stopCluster(cl)
NOTE: be sure that you don’t cancel R processes with Ctl-C. This can cause issues with SNOW.
Example: Comparison between using parallel and sequential processing
> f.long<-function(n) {
+ xx<-rnorm(n)
+ log(abs(xx))+xx^2
+ }
#Using multicore
############
> system.time(mclapply(rep(5E6,11),f.long,mc.cores=11))
user system elapsed
26.271 3.514 5.516
#Using snow via MPI
############
> system.time(sapply(rep(5E6,11),f.long))
user system elapsed
17.975 1.325 19.303
> system.time(parSapply(cl,rep(5E6,11),f.long))
user system elapsed
4.224 4.113 8.338
Note that here, the snow parallel processing routines give us an improvement of over 50% in terms of computation time. Although one might suggest that we should have a 10/11=91% improvement, it is important to remember that the processors do not necessarily reside on the same node and communication between the nodes can be quite slow. This communication is so slow, in fact, that the multicore procedure which avoids this results in a further 40% reduction in computation time.
It is thus important to remember that the computational gains of using more than one cluster node can be severely outweighed by the time required to communicate between multiple nodes. Of course, this depends on the kind of calculations you are conducting and the way they are implemented.
This tutorial was drafted by Taki Shinohara based closely on notes from Dr. Hector Corrada Bravo and Dr. Kasper Daniel Hansen. A special thanks is extended to them, as well as Marvin Newhouse and for his help.