Octave

From SciNet Users Documentation
Revision as of 13:36, 6 December 2018 by Ejspence (talk | contribs)
Jump to navigation Jump to search

GNU Octave is a popular interactive data analysis language. It has been specifically designed to be an open-source alternative to MATLAB. As such, most MATLAB code can be successfully run using Octave. It is very fast to write code in, but the software that results is much much slower than C or Fortran; one should be wary of doing too much compute-intensive work in Octave.

Running Octave on Niagara

The available Octave modules can be found using the usual module commands. As of 6 December July 2018 the following Octave module is available:

   $ module spider octave
   ----------------------------------------------------------------------------
   octave: octave/4.4.0
   ----------------------------------------------------------------------------

The octave module depends upon the gcc and mkl modules. You load the module in the usual way:

    $ module load gcc/7.3.0 mkl/2018.2
    $ module load octave/4.4.0
    $ octave
    octave:1>

Running serial Octave jobs

As with all serial jobs, if your Octave computation does not use multiple cores, you should bundle them up so the 40 cores of a node are all performing work. Examples of this can be found on this page.

Using a Jupyter Notebook

You may develop your R scripts in a Jupyter Notebook on Niagara. A node has been set aside as a Jupyter Hub. See this page for details on how to access that node, and develop your code.

Rmpi (R with MPI)

None of the R installations on Niagara have Rmpi installed by default.

Installing Rmpi, version 3.5.0

Version 3.5.0 does not have the Rmpi library as a standard package, which means you have to install it yourself if you are using that version. The same is true if you want to use IntelMPI instead of OpenMPI.

Installing the Rmpi package can be a bit challenging, since some additional parameters need to be given to the installation, which contain the path to various header files and libraries. These paths differ depending on what MPI version you are using.

The various MPI versions on Niagara are loaded with the module command. So the first thing to do is to decide what MPI version to use (OpenMPI or IntelMPI), and to type the corresponding "module load" command on the command-line (as well as in your jobs scripts).

Because the MPI modules define all the paths in environment variables, the following line seem to work for installations of all OpenMPI versions.

$ module load intel/2018.2
$ module load openmpi/3.1.0
$ module load r/3.5.0
$
$ R
>
> install.packages("Rmpi",
                   configure.args =
                   c(paste("--with-Rmpi-include=", Sys.getenv("SCINET_OPENMPI_ROOT"), "/include", sep=""),
                     paste("--with-Rmpi-libpath=", Sys.getenv("SCINET_OPENMPI_ROOT"), "lib", sep=""),
                     "--with-Rmpi-type=OPENMPI"))

For intelmpi, you only need to change OPENMPI to MPICH2 in the last line.

Running Rmpi

To start using R with Rmpi, make sure you have all require modules loaded (e.g. module load intel/2018.2 openmpi/3.1.0 r/3.5.0), then launch it with

$ mpirun -np 1 R --no-save

which starts one master mpi process, but starts up the infrastructure to be able to spawn additional processes.

Creating an R cluster

The 'parallel' package allows you to use R to launch individual serial subjobs across multiple nodes. This section describes how this is accomplished.

Creating your Rscript wrapper

The first thing to do is create a wrapper for Rscript. This needs to be done because the R module needs to be loaded on all nodes, but the submission script only loads modules on the head node of the job. The wrapper script, let's call it MyRscript.sh, is short:

#!/bin/bash

module load intel/2018.2 r/3.5.0
${SCINET_R_ROOT}/bin/Rscript --no-restore "$@"

The "--no-restore" flag prevents Rscript from loading your "workspace image", if you have one saved. Loading the image causes problems for the cluster.

Once you've created your wrapper, make it executable:

$ chmod u+x MyRscript.sh

Your wrapper is now ready to be used.

The cluster R code

The R code which we will run consists of two parts, the code which launches the cluster, and does pre- and post-analysis, and the code which is run on the individual cluster "nodes". Here is some code which demonstrates this functionality. Let's call it MyClusterCode.R.


######################################################
#
#  worker code
#

# first define the function which will be run on all the cluster nodes.  This is just a test function.  
# Put your real worker code here.
testfunc <- function(a) {

  # this part is just to waste time
  b <- 0
  for (i in 1:10000) {
      b <- b + 1
  }

  s <- Sys.info()['nodename']
  return(paste0(s, " ", a[1], " ", a[2]))

}


######################################################
#
#  head node code
#

# Create a bunch of index pairs to feed to the worker function.  These could be parameters,
# or whatever your code needs to vary across jobs.  Note that the worker function only 
# takes a single argument; each entry in the list must contain all the information 
# that the function needs to run.  In this example, each entry contains a list which
# contains two pieces of information, a pair of indices.
indexlist <- list()
index <- 1
for (i in 1:10) {
  for (j in 1:10) {
     indexlist[index] <- list(c(i,j))
     index <- index +1
   }
}
 

# Now set up the cluster.

# First load the parallel library.
library(parallel)

# Next find all the nodes which the scheduler has given to us.
# These are given by the SLURM_JOB_NODELIST environment variable.
nodelist <- Sys.getenv("SLURM_JOB_NODELIST")

node_ids <- unlist(strsplit(nodelist,split="[^a-z0-9-]"))[-1]

if (length(node_ids)>0) {
  expanded_ids <- lapply(node_ids, function (id) {
    ranges <- as.numeric(
      unlist(strsplit(id, split="[-]"))
    )
    if (length(ranges)>1) seq(ranges[1], ranges[2], by=1) else ranges
  })
  
  nodelist <- sprintf("nia%04d", unlist(expanded_ids))
}

# Now launch the cluster, using the list of nodes and our Rscript
# wrapper.
cl <- makePSOCKcluster(names = nodelist, rscript = "/path/to/your/MyRscript.sh")

# Now run the worker code, using the parameter list we created above.
result <- clusterApplyLB(cl, indexlist, testfunc)

# The results of all the jobs will now be put in the 'result' variable,
# in the order they were specified in the 'indexlist' variable.

# Don't forget to stop the cluster when you're finished.
stopCluster(cl)

You can, of course, add any post-processing code you need to the above code.

Submitting an R cluster job

You are now ready to submit your job to the Niagara queue. The submission script is like most others:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=5:00:00
#SBATCH --job-name MyRCluster

cd $SLURM_SUBMIT_DIR

module load intel/2018.2 r/3.5.0

${SCINET_R_ROOT}/bin/Rscript --no-restore MyClusterCode.R

Be sure to use whatever number of nodes, length of time, etc., is appropriate for your job.

SciNet's R Classes

There is a dizzying amount of documentation available for programming in R; consult your favourite search engine. That begin said, SciNet runs several classes each year on using R for research:

  • MSC1090: Introduction to Computational BioStatistics with R. This class graduate-level IMS-sponsored class is open to graduate students in IMS or other fields. This class is intended for those with little-to-no programming experience who wish to use R in scientific research.
  • EES1137: Quantitative Applications for Data Analysis. This class is similar to MSC1090, but takes class at UTSC, and is sponsored by the department of Physical and Environmental Sciences.age].