R

From SciNet Users Documentation
Jump to navigation Jump to search

R is programing language that continues to grow in popularity for data analysis. It is very fast to write code in, but the software that results is much much slower than C or Fortran; one should be wary of doing too much compute-intensive work in R.

Running R on Niagara

We currently have two families of R installed on Niagara.

  • Anaconda R
  • regular R

Here we describe the differences between these packages.

Anaconda R

Anaconda is a pre-assembled set of commonly-used data science tools, which recently added R to its suite of packages. The source for this collection is here.

As of 30 July 2018 the following Anaconda modules are available:

   $ module avail anaconda
   ----------------- /scinet/niagara/software/2018a/modules/base ------------------
    anaconda2/5.1.0    python/2.7.14-anaconda5.1.0    r/3.4.3-anaconda5.1.0
    anaconda3/5.1.0    python/3.6.4-anaconda5.1.0

Note that there is a single Anaconda R module available, and that none of these modules require a compiler to be loaded. The Anaconda R module is R version 3.4.3, which comes from the Anaconda version 5.1.0.

You load the module in the usual way:

    $ module load r/3.4.3-anaconda5.1.0
    $ R
    >

Regular R

The base R program has also been installed from source. This installation comes with no R packages installed other than the base installation.

    $ module spider r
    --------------------------------------------------------------------------------
    r:
    --------------------------------------------------------------------------------
    Versions:
       r/3.4.3-anaconda5.1.0
       r/3.5.0
    $
    $ module spider r/3.5.0
    --------------------------------------------------------------------------------
    r: r/3.5.0
    --------------------------------------------------------------------------------
    You will need to load all module(s) on any one of the lines below before the "r/3.5.0" module is available to load.
    intel/2018.2
    $
    $ module load intel/2018.2 r/3.5.0
    $ R
    >


Many optional packages are available for R which add functionality for specific domains; they are available through the Comprehensive R Archive Network (CRAN).

R provides an easy way for users to install the libraries they need in their home directories rather than having them installed system-wide; there are so many potential optional packages for R people could potentially want, we recommend users who want additional packages to proceed this way. This is almost certainly the easiest way to deal with the wide range of packages, ensure they're up to date, and ensure that users package choices don't conflict.

In general, you can install those that you need yourself in your home directory; eg,

$ R 
> install.packages("package-name", dependencies = TRUE)

will download and compile the source for the packages you need in your home directory under ${HOME}/R/x86_64-unknown-linux-gnu-library/2.11/ (you can specify another directory with a lib= option.) Then take a look at help(".libPaths") to make sure that R knows where to look for the packages you've compiled. Note that you must install packages with logged into a development node as write access to the library folder is not available to a standard node on the cluster.


Running serial R jobs

As with all serial jobs, if your R computation does not use multiple cores, you should bundle them up so the 40 cores of a node are all performing work. Examples of this can be found on this page.


Using a Jupyter Notebook

You may develop your R scripts in a Jupyter Notebook on Niagara. A node has been set aside as a Jupyter Hub. See this page for details on how to access that node, and develop your code.

Rmpi (R with MPI)

None of the R installations on Niagara have Rmpi installed by default.

Installing Rmpi, version 3.5.0

Version 3.5.0 does not have the Rmpi library as a standard package, which means you have to install it yourself if you are using that version. The same is true if you want to use IntelMPI instead of OpenMPI.

Installing the Rmpi package can be a bit challenging, since some additional parameters need to be given to the installation, which contain the path to various header files and libraries. These paths differ depending on what MPI version you are using.

The various MPI versions on Niagara are loaded with the module command. So the first thing to do is to decide what MPI version to use (OpenMPI or IntelMPI), and to type the corresponding "module load" command on the command-line (as well as in your jobs scripts).

Because the MPI modules define all the paths in environment variables, the following line seem to work for installations of all OpenMPI versions.

$ module load intel/2018.2
$ module load openmpi/3.1.0
$ module load r/3.5.0
$
$ R
>
> install.packages("Rmpi",
                   configure.args =
                   c(paste("--with-Rmpi-include=", Sys.getenv("SCINET_OPENMPI_ROOT"), "/include", sep=""),
                     paste("--with-Rmpi-libpath=", Sys.getenv("SCINET_OPENMPI_ROOT"), "lib", sep=""),
                     "--with-Rmpi-type=OPENMPI"))

For intelmpi, you only need to change OPENMPI to MPICH2 in the last line.

Running Rmpi

To start using R with Rmpi, make sure you have all require modules loaded (e.g. module load intel/2018.2 openmpi/3.1.0 r/3.5.0), then launch it with

$ mpirun -np 1 R --no-save

which starts one master mpi process, but starts up the infrastructure to be able to spawn additional processes.

Creating an R cluster

The 'parallel' package allows you to use R to launch individual serial subjobs across multiple nodes. This section describes how this is accomplished.

Creating your Rscript wrapper

The first thing to do is create a wrapper for Rscript. This needs to be done because the R module needs to be loaded on all nodes, but the submission script only loads modules on the head node of the job. The wrapper script, let's call it MyRscript.sh, is short:

#!/bin/bash

module load intel/2018.2 r/3.5.0
${SCINET_R_ROOT}/bin/Rscript --no-restore "$@"

The "--no-restore" flag prevents Rscript from loading your "workspace image", if you have one saved. Loading the image causes problems for the cluster.

Once you've created your wrapper, make it executable:

$ chmod u+x MyRscript.sh

Your wrapper is now ready to be used.

The cluster R code

The R code which we will run consists of two parts, the code which launches the cluster, and does pre- and post-analysis, and the code which is run on the individual cluster "nodes". Here is some code which demonstrates this functionality. Let's call it MyClusterCode.R.


######################################################
#
#  worker code
#

# first define the function which will be run on all the cluster nodes.  This is just a test function.  
# Put your real worker code here.
testfunc <- function(a) {

  # this part is just to waste time
  b <- 0
  for (i in 1:10000) {
      b <- b + 1
  }

  s <- Sys.info()['nodename']
  return(paste0(s, " ", a[1], " ", a[2]))

}


######################################################
#
#  head node code
#

# Create a bunch of index pairs to feed to the worker function.  These could be parameters,
# or whatever your code needs to vary across jobs.  Note that the worker function only 
# takes a single argument; each entry in the list must contain all the information 
# that the function needs to run.  In this example, each entry contains a list which
# contains two pieces of information, a pair of indices.
indexlist <- list()
index <- 1
for (i in 1:10) {
  for (j in 1:10) {
     indexlist[index] <- list(c(i,j))
     index <- index +1
   }
}
 

# Now set up the cluster.

# First load the parallel library.
library(parallel)

# Next find all the nodes which the scheduler has given to us.
# These are given by the SLURM_JOB_NODELIST environment variable.
nodelist <- Sys.getenv("SLURM_JOB_NODELIST")

node_ids <- unlist(strsplit(nodelist,split="[^a-z0-9-]"))[-1]

if (length(node_ids)>0) {
  expanded_ids <- lapply(node_ids, function (id) {
    ranges <- as.numeric(
      unlist(strsplit(id, split="[-]"))
    )
    if (length(ranges)>1) seq(ranges[1], ranges[2], by=1) else ranges
  })
  
  nodelist <- sprintf("nia%04d", unlist(expanded_ids))
}

# Now launch the cluster, using the list of nodes and our Rscript
# wrapper.
cl <- makePSOCKcluster(names = nodelist, rscript = "/path/to/your/MyRscript.sh")

# Now run the worker code, using the parameter list we created above.
result <- clusterApplyLB(cl, indexlist, testfunc)

# The results of all the jobs will now be put in the 'result' variable,
# in the order they were specified in the 'indexlist' variable.

# Don't forget to stop the cluster when you're finished.
stopCluster(cl)

You can, of course, add any post-processing code you need to the above code.

Submitting an R cluster job

You are now ready to submit your job to the Niagara queue. The submission script is like most others:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=5:00:00
#SBATCH --job-name MyRCluster

cd $SLURM_SUBMIT_DIR

module load intel/2018.2 r/3.5.0

${SCINET_R_ROOT}/bin/Rscript --no-restore MyClusterCode.R

Be sure to use whatever number of nodes, length of time, etc., is appropriate for your job.

SciNet's R Classes

There is a dizzying amount of documentation available for programming in R; consult your favourite search engine. That begin said, SciNet runs several classes each year on using R for research:

  • MSC1090: Introduction to Computational BioStatistics with R. This class graduate-level IMS-sponsored class is open to graduate students in IMS or other fields. This class is intended for those with little-to-no programming experience who wish to use R in scientific research.
  • EES1137: Quantitative Applications for Data Analysis. This class is similar to MSC1090, but takes class at UTSC, and is sponsored by the department of Physical and Environmental Sciences.age].