R
R is programing language that continues to grow in popularity for data analysis. It is very fast to write code in, but the software that results is much much slower than C or Fortran; one should be wary of doing too much compute-intensive work in R.
There is a dizzying amount of documentation available for programming in R on the internet. SciNet has given a mini-course of 8 lectures on Research Computing with Python in the Fall of 2013.
Running R on Niagara
We currently have two families of R installed on Niagara.
- Anaconda R
- regular R
Here we describe the differences between these packages.
Anaconda R
Anaconda is a pre-assembled set of commonly-used data science tools, which recently added R to its suite of packages. The source for this collection is here.
As of 30 July 2018 the following Anaconda modules are available:
$ module avail anaconda ----------------- /scinet/niagara/software/2018a/modules/base ------------------ anaconda2/5.1.0 python/2.7.14-anaconda5.1.0 r/3.4.3-anaconda5.1.0 anaconda3/5.1.0 python/3.6.4-anaconda5.1.0
Note that there is a single Anaconda R module available, and that none of these modules require a compiler to be loaded. The Anaconda R module is R version 3.4.3, which comes from the Anaconda version 5.1.0.
You load the module in the usual way:
$ module load r/3.4.3-anaconda5.1.0 $ R >
Regular R
The base R program has also been installed from source. This installation comes with no R packages installed other than the base installation.
$ module spider r -------------------------------------------------------------------------------- r: -------------------------------------------------------------------------------- Versions: r/3.4.3-anaconda5.1.0 r/3.5.0 $ $ module spider r/3.5.0 -------------------------------------------------------------------------------- r: r/3.5.0 -------------------------------------------------------------------------------- You will need to load all module(s) on any one of the lines below before the "r/3.5.0" module is available to load. intel/2018.2 $ $ module load intel/2018.2 r/3.5.0 $ R >
Many optional packages are available for R which add functionality for specific domains; they are available through the Comprehensive R Archive Network (CRAN).
R provides an easy way for users to install the libraries they need in their home directories rather than having them installed system-wide; there are so many potential optional packages for R people could potentially want, we recommend users who want additional packages to proceed this way. This is almost certainly the easiest way to deal with the wide range of packages, ensure they're up to date, and ensure that users package choices don't conflict.
In general, you can install those that you need yourself in your home directory; eg,
$ R > install.packages("package-name", dependencies = TRUE)
will download and compile the source for the packages you need in your home directory under ${HOME}/R/x86_64-unknown-linux-gnu-library/2.11/ (you can specify another directory with a lib= option.) Then take a look at help(".libPaths") to make sure that R knows where to look for the packages you've compiled. Note that you must install packages with logged into a development node as write access to the library folder is not available to a standard node on the cluster.
Running serial R jobs
As with all serial jobs, if your R computation do not use multiple cores, you should bundle them up so the 8 cores of a nodes are all performing work. Examples of this can be found on the User_Serial page.
Saving images from R in compute jobs
To make use of the graphics capability of R, R insists on having an X server running, even if you're just writing to a file. There is no X server on the compute nodes, and you'd get a message like
unable to open connection to X11 display
To get around this issue, you can run a 'virtual' X server on the compute nodes by adding the following commands at the start of your job script:
# Make virtual X server command called Xvfb available: module load Xlibraries # Select a unique display number: let DISPLAYNUM=$UID%65274 export DISPLAY=":$DISPLAYNUM" # Start the virtual X server Xvfb $DISPLAY -fp $SCINET_FONTPATH -ac 2>/dev/null &
After this, run R or Rscript as usual. The virtual X server will be running in the background and will get killed which your job is done. Alternatively, you may want to kill it explicitly at the end of you job script using
# Kill any remaining Xvfb server pkill -u $UID Xvfb
Rmpi (R with MPI)
All the newer R installations on the GPC have Rmpi installed by default using OpenMPI. Be sure to load the OpenMPI module if you wish to use Rmpi.
Installing Rmpi, version 2.13.1
Version 2.13.1 does not have the Rmpi library as a standard package, which means you have to install it yourself if you are using that version. The same is true if you want to use IntelMPI instead of OpenMPI.
Installing the Rmpi package can be a bit challenging, since some additional parameters need to be given to the installation, which contain the path to various header files and libraries. These paths differ depending on what MPI version you are using.
The various MPI versions on the GPC are loaded with the module command. So the first thing to do is to decide what mpi version to use (openmpi or intelmpi), and to type the corresponding "module load" command on the command-line (as well as in your jobs scripts).
Because the MPI modules define all the paths in environment variables, the following line seem to work for installations of all openmpi versions.
> install.packages("Rmpi", configure.args = c(paste("--with-Rmpi-include=",Sys.getenv("SCINET_MPI_INC"),sep=""), paste("--with-Rmpi-libpath=",Sys.getenv("SCINET_MPI_LIB"),sep=""), "--with-Rmpi-type=OPENMPI"))
For intelmpi, you only need to change OPENMPI to MPICH2 in the last line.
Running Rmpi
To start using R with Rmpi, make sure you have all require modules loaded (e.g. module load intel openmpi R/2.14.1), then launch it with
$ mpirun -np 1 R --no-save
which starts one master mpi process, but starts up the infrastructure to be able to spawn additional processes.
Creating an R cluster
The 'parallel' package allows you to use R to launch individual serial subjobs across multiple nodes. This section describes how this is accomplished.
Creating your Rscript wrapper
The first thing to do is create a wrapper for Rscript. This needs to be done because the R module needs to be loaded on all nodes, but the submission script only loads modules on the head node of the job. The wrapper script, let's call it MyRscript.sh, is short:
#!/bin/bash module load intel/13.1.1 R/3.0.1 ${SCINET_R_BIN}/Rscript --no-restore "$@"
The "--no-restore" flag prevents Rscript from loading your "workspace image", if you have one saved. Loading the image causes problems for the cluster.
Once you've created your wrapper, make it executable:
$ chmod u+x MyRscript.sh
Your wrapper is now ready to be used.
The cluster R code
The R code which we will run consists of two parts, the code which launches the cluster, and does pre- and post-analysis, and the code which is run on the individual cluster "nodes". Here is some code which demonstrates this functionality. Let's call it MyClusterCode.R.
###################################################### # # worker code # # first define the function which will be run on all the cluster nodes. This is just a test function. # Put your real worker code here. testfunc <- function(a) { # this part is just to waste time b <- 0 for (i in 1:10000) { b <- b + 1 } s <- Sys.info()['nodename'] return(paste0(s, " ", a[1], " ", a[2])) } ###################################################### # # head node code # # Create a bunch of index pairs to feed to the worker function. These could be parameters, # or whatever your code needs to vary across jobs. Note that the worker function only # takes a single argument; each entry in the list must contain all the information # that the function needs to run. In this example, each entry contains a list which # contains two pieces of information, a pair of indices. indexlist <- list() index <- 1 for (i in 1:10) { for (j in 1:10) { indexlist[index] <- list(c(i,j)) index <- index +1 } } # Now set up the cluster. # First load the parallel library. library(parallel) # Next find all the nodes which the scheduler has given to us. # These are listed in the file which is indicated by the PBS_NODEFILE # environment variable. nodefile <- Sys.getenv("PBS_NODEFILE") hostnames <- readLines(nodefile) # Now launch the cluster, using the list of nodes and our Rscript # wrapper. cl <- makePSOCKcluster(names = hostnames, rscript = "/path/to/your/MyRscript.sh") # Now run the worker code, using the parameter list we created above. result <- clusterApplyLB(cl, indexlist, testfunc) # The results of all the jobs will now be put in the 'result' variable, # in the order they were specified in the 'indexlist' variable. # Don't forget to stop the cluster when you're finished. stopCluster(cl)
You can, of course, add any post-processing code you need to the above code.
Submitting an R cluster job
You are now ready to submit your job to the GPC queue. The submission script is like most others:
#!/bin/bash #PBS -l nodes=3:ppn=8 #PBS -l walltime=5:00:00 #PBS -N MyRCluster cd $PBS_O_WORKDIR module load intel/13.1.1 R/3.0.1 ${SCINET_R_BIN}/Rscript --no-restore MyClusterCode.R
Be sure to use whatever number of nodes, length of time, etc., is appropriate for your job.