SOSCIP GPU

From SciNet Users Documentation
Jump to navigation Jump to search
SOSCIP GPU
S882lc.png
Installed September 2017
Operating System Ubuntu 16.04 le
Number of Nodes 14x Power 8 with 4x NVIDIA P100
Interconnect Infiniband EDR
Ram/Node 512 GB
Cores/Node 2 x 10core (20 physical, 160 SMT)
Login/Devel Node sgc01
Vendor Compilers xlc/xlf, nvcc

SOSCIP

The SOSCIP GPU Cluster is a Southern Ontario Smart Computing Innovation Platform (SOSCIP) resource located at theUniversity of Toronto's SciNet HPC facility. The SOSCIP multi-university/industry consortium is funded by the Ontario Government and the Federal Economic Development Agency for Southern Ontario [1].

Support Email

Please use <soscip-support@scinet.utoronto.ca> for SOSCIP GPU specific inquiries.

Specifications

The SOSCIP GPU Cluster consists of of 14 IBM Power 822LC "Minsky" Servers each with 2x10core 3.25GHz Power8 CPUs and 512GB Ram. Similar to Power 7, the Power 8 utilizes Simultaneous MultiThreading (SMT), but extends the design to 8 threads per core allowing the 20 physical cores to support up to 160 threads. Each node has 4x NVIDIA Tesla P100 GPUs each with 16GB of RAM with CUDA Capability 6.0 (Pascal) connected using NVlink.

Access and Login

In order to obtain access to the system, you must request access to the SOSCIP GPU Platform. Instructions will have been sent to your sponsoring faculty member via E-mail at the beginning of your SOSCIP project.

Access to the SOSCIP GPU Platform is provided through the BGQ login node, bgqdev.scinet.utoronto.ca using ssh, and from there you can proceed to the GPU development node sgc01-ib0 via ssh. Your user name and password is the same as it is for SciNet systems.

Filesystem

The filesystem is shared with the BGQ system. See here for details.

Job Submission

The SOSCIP GPU cluster uses SLURM as a job scheduler and jobs are scheduled by node, ie 20 cores and 4 GPUs each. Jobs are submitted from the login/development node sgc01. The maximum walltime per job is 12 hours (except in the 'long' queue, see below) with up to 8 nodes.

$ sbatch myjob.script

Where myjob.script is

#!/bin/bash
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=20  # MPI tasks (needed for srun/mpirun) 
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node

cd $SLURM_SUBMIT_DIR

hostname
nvidia-smi

More information about the sbatch command is found here.


You can query job information using

squeue

To see only your own jobs, run

squeue -u <userid>

Once your job is running, SLURM creates a file usually named slurm<jobid>.out in the directory from where you issued the sbatch command. This contains the console output from your job. You can monitor the output of your job by using the tail -f <file> command.


To cancel a job use

scancel $JOBID

Longer jobs

If your job takes more than 12 hours, the sbatch command will not let you submit your job. There is, however, a way to have jobs up to 24 hours long, by specifying "-p long" as an option (i.e., add #SBATCH -p long to your job script). The priority of such jobs may be throttled in the future if we see that the 'long' queue is having a negative efffect on turnover time in the queue.

Interactive

For an interactive session use

salloc --gres=gpu:4

After executing this command, you may have to wait in the queue until a system is available.

More information about the salloc command is here.

Automatic Re-submission and Job Dependencies

Commonly you may have a job that you know will take longer to run than what is permissible in the queue. As long as your program contains checkpoint or restart capability, you can have one job automatically submit the next. In the following example it is assumed that the program finishes before the time limit requested and then resubmits itself by logging into the development nodes. Job dependencies and a maximum number of job re-submissions are used to ensure sequential operation.

#!/bin/bash 

#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=20  # MPI tasks (needed for srun) 
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node

cd $SLURM_SUBMIT_DIR

: ${job_number:="1"}           # set job_nubmer to 1 if it is undefined
job_number_max=3

echo "hi from ${SLURM_JOB_ID}"

#RUN JOB HERE


# SUBMIT NEXT JOB
if [[ ${job_number} -lt ${job_number_max} ]]
then
  (( job_number++ ))
  next_jobid=$(ssh sgc01-ib0 "cd $SLURM_SUBMIT_DIR; /opt/slurm/bin/sbatch --export=job_number=${job_number} -d afterok:${SLURM_JOB_ID} thisscript.sh | awk '{print $4}'")
  echo "submitted ${next_jobid}"
fi
 
sleep 15

echo "${SLURM_JOB_ID} done"

Packing single-GPU jobs within one SLURM job submission

Jobs are scheduled by node (4 GPUs) on SOSCIP GPU cluster. If user's code/program cannot utilize all 4 GPUs, user can use GNU Parallel tool to pack 4 or more single-GPU jobs into one SLURM job. Below is an example of submitting 4 single-GPU python codes within one job: (When using GNU parallel for a publication please cite as per parallel --citation)

#!/bin/bash
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=20  # MPI tasks (needed for srun) 
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node

module load gnu-parallel/20180422
cd $SLURM_SUBMIT_DIR

parallel -a jobname-params.input --colsep ' ' -j 4 'CUDA_VISIBLE_DEVICES=$(( {%} - 1 )) numactl -N $(( ({%} -1) / 2 )) python {1} {2} {3} &> jobname-{#}.out'

The jobname-params.input file contains:

code-1.py --param1=a --param2=b
code-2.py --param1=c --param2=d
code-3.py --param1=e --param2=f
code-4.py --param1=g --param2=h
  • In the above example, GNU Parallel tool will read jobname-params.input file and separate parameters. Each row in the input file has to contain exact 3 parameters to python. code-N.py is also considered as a parameter. User can change parameter number in the parallel command ({1} {2} {3}...).
  • "-j 4" flag limits the max number of jobs to be 4. User can have more rows in the input file, but GNU Parallel tool only executes maximum of 4 at the same time.
  • "CUDA_VISIBLE_DEVICES=$(( {%} - 1 ))" will set one GPU for each job. "numactl -N $(( ({%} -1) / 2 ))" will bind 2 jobs on CPU socket 0, other 2 jobs on socket 1. {%} is job slot which will be translated to 1 or 2 or 3 or 4 in this case.
  • Outputs will be jobname-1.out, jobname-2.out,jobname-3.out,jobname-4.out... {#} is job number which will be translated to the row number in the input file.

Software Installed

IBM PowerAI

The PowerAI platform contains popular open machine learning frameworks such as Caffe, TensorFlow, and Torch. Run the module avail command for a complete listing. More information is available at this link: https://developer.ibm.com/linuxonpower/deep-learning-powerai/releases/. Release 4.0 is currently installed.

GNU Compilers

System default compiler is GCC/5.4.0. More recent versions of the GNU Compiler Collection (C/C++/Fortran) are provided in the IBM Advance Toolchain with enhancements for the POWER8 CPU. To load the newer advance toolchain version use:

Advance Toolchain V10.0

module load gcc/6.4.1

Advance Toolchain V11.0

module load gcc/7.3.1

More information about the IBM Advance Toolchain can be found here: https://developer.ibm.com/linuxonpower/advance-toolchain/

IBM XL Compilers

To load the native IBM xlc/xlc++ and xlf (Fortran) compilers, run

module load xlc/13.1.5
module load xlf/15.1.5

IBM XL Compilers are enabled for use with NVIDIA GPUs, including support for OpenMP 4.5 GPU offloading and integration with NVIDIA's nvcc command to compile host-side code for the POWER8 CPU.

Information about the IBM XL Compilers can be found at the following links:

IBM XL C/C++

IBM XL Fortran

NVIDIA GPU Driver

The current NVIDIA driver version is 396.26

CUDA

The current installed CUDA Tookits is are version 8.0, 9.0 and 9.2.

module load cuda/<version>


The CUDA driver is installed locally, however the CUDA Toolkit is installed in:

/scinet/sgc/Libraries/CUDA/8.0
/scinet/sgc/Libraries/CUDA/9.0
/usr/local/cuda-9.2

Documentation and API reference information for the CUDA Toolkit can be found here: http://docs.nvidia.com/cuda/index.html

OpenMPI

Currently OpenMPI has been setup on the 14 nodes connected over EDR Infiniband.

$ module load openmpi/2.1.1-gcc-5.4.0
$ module load openmpi/2.1.1-XL-13_15.1.5

Other Software

Other software packages can be installed onto the SOSCIP GPU Platform. It is best to try installing new software in your own home directory, which will give you control of the software (e.g. exact version, configuration, installing sub-packages, etc.).

In the following subsections are instructions for installing several common software packages.

Anaconda (Python)

Anaconda is a popular distribution of the Python programming language. It contains several common Python libraries such as SciPy and NumPy as pre-built packages, which eases installation.

Anaconda can be downloaded from here: https://www.anaconda.com/download/#linux

NOTE: Be sure to download the Power8 installer.

TIP: If you plan to use Tensorflow within Anaconda, download the Python 2.7 version of Anaconda and conda install tensorflow-gpu.

Bazel

Bazel is provided as modules on the system:

bazel/0.10.0
bazel/0.15.0

cuDNN

The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN accelerates widely used deep learning frameworks, including Caffe2, MATLAB, Microsoft Cognitive Toolkit, TensorFlow, Theano, and PyTorch. If a specific version of cuDNN is needed, user can download from https://developer.nvidia.com/cudnn and choose "cuDNN [VERSION] Library for Linux (Power8/Power9)".

The default cuDNN installed on the system is version 6 with CUDA-8 from IBM PowerAI. More recent cuDNN versions are installed as modules:

cudnn/cuda9.0/7.0.5
cudnn/cuda9.2/7.1.4

Horovod

Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. The goal of Horovod is to make distributed Deep Learning fast and easy to use.

To use Horovod on SOSCIP GPU cluster, user should to have TensorFlow or PyTorch installed first then load the modules:

module load openmpi/2.1.1-gcc-5.4.0 cuda/9.2 cudnn/cuda9.2/7.1.4 nccl/2.2.13

Horovod can be installed by pip with the following configuration:

HOROVOD_CUDA_HOME=/usr/local/cuda-9.2/ HOROVOD_NCCL_HOME=/scinet/sgc/Libraries/NCCL/cuda9.2/2.2.13/ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod

A multi-node Tensorflow-benchmarks example is below:

git clone https://github.com/tensorflow/benchmarks.git
git reset --hard c12839f

A 2-node job script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=20  # MPI tasks (needed for srun/mpirun)
#SBATCH --cpus-per-task=8 # One physical core has 8 logical cores
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node

module load openmpi/2.1.1-gcc-5.4.0 cuda/9.2 cudnn/cuda9.2/7.1.4 nccl/2.2.13

export OMP_NUM_THREADS=1 #numpy package on ppc64le with OpenBLAS multithreading may lead to incorrect answers
#User also needs to setup TensorFlow environment as well

mpirun -np $((SLURM_NTASKS/5)) -bind-to core -map-by slot:PE=5 -report-bindings -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x OMP_NUM_THREADS -mca pml ob1 -mca btl ^openib python -u scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 32 --variable_update horovod

20 tasks is required per node. This will give MPI 20 slots per node. But only 4 "$((SLURM_NTASKS/5))" MPI ranks per node will be launched by mpirun. For each rank, it has 5 slots (-bind-to core -map-by slot:PE=5), which is 5 physical cores (40 threads). User will see the detail binding information with -report-bindings flag.

  • Benchmarking results in images/sec: (ResNet-50 with 32 batch size per GPU, 128 batch size per node)
1-node: 805.85
2-node: 1572.88
4-node: 3105.03
8-node: 6153.69

-mca pml ob1 and -mca btl ^openib flags force the use of TCP for MPI communication. This avoids many multiprocessing issues that Open MPI has with RDMA which typically result in segmentation faults. Using TCP for MPI does not have noticeable performance impact since most of the heavy communication is done by NCCL, which will use RDMA via InfiniBand. Notable exceptions from this rule are models that heavily use hvd.broadcast() and hvd.allgather() operations. To make those operations use RDMA, multithreading must be disabled via -x HOROVOD_MPI_THREADS_DISABLE=1 option being added and -mca btl ^openib being removed.

Keras

Keras (https://keras.io/) is a popular high-level deep learning software development framework. It runs on top of other deep-learning frameworks such as TensorFlow.

  • The easiest way to install Keras is to install Anaconda first, then install Keras by using using the conda install command. Keras uses TensorFlow underneath to run neural network models. Before running code using Keras, be sure to load the PowerAI TensorFlow module and the cuda module, OR install customized TensorFlow.
  • Keras can also be installed into a Python virtual environment by using pip. User can install optimized scipy (built with OpenBLAS) before installing Keras.

In a virtual environment (python2.7 + pip example):

pip install --upgrade /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp27-cp27mu-linux_ppc64le.whl
pip install --upgrade /scinet/sgc/Libraries/scipy/scipy-1.1.0-cp27-cp27mu-linux_ppc64le.whl
pip install keras

NCCL

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. NCCL is provided as modules on the system:

module load cuda/9.2 nccl/2.2.13

NumPy/SciPy (built with OpenBLAS)

Optimized NumPy and SciPy are provided as Python wheels located in /scinet/sgc/Libraries/numpy and /scinet/sgc/Libraries/scipy and can be installed by pip. Please uninstall old numpy/scipy before installing the new ones.

PyTorch

PyTorch is provided as prebuilt Python Wheel that users can use pip to install under user space. Custom Python wheels are stored in /scinet/sgc/Applications/PyTorch_wheels. It is highly recommended to install custom PyTorch wheels into a Python virtual environment.

Installing with Python2.7:

  • Create a virtual environment pytorch-env-py2 with packages installed with system:
virtualenv --python=python2.7 --system-site-packages pytorch-env-py2
  • Activate virtual environment:
source pytorch-env-py2/bin/activate
  • Install PyTorch into the virtual environment: (A custom Numpy built with OpenBLAS library can be installed)
pip install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp27-cp27mu-linux_ppc64le.whl
pip install /scinet/sgc/Applications/PyTorch_wheels/torch-0.4.0a0+3749c58-cp27-cp27mu-linux_ppc64le.whl

Installing with Python3.5:

  • Create a virtual environment pytorch-env-py3 with packages installed with system:
virtualenv --python=python3.5 --system-site-packages pytorch-env-py3
  • Activate virtual environment:
source pytorch-env-py3/bin/activate
  • Install PyTorch into the virtual environment: (A custom Numpy built with OpenBLAS library can be installed)
pip3 install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp35-cp35m-linux_ppc64le.whl
pip3 install /scinet/sgc/Applications/PyTorch_wheels/torch-0.4.0a0+3749c58-cp35-cp35m-linux_ppc64le.whl

Submitting jobs

The above myjob.script file needs to be modified to run custom PyTorch. cuda/9.2 and cudnn/cuda9.2/7.1.4 modules need to be loaded. Virtual environment needs to be activated.

#!/bin/bash
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=20  # MPI tasks (needed for srun/mpirun) 
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node

module purge
module load cuda/9.2 cudnn/cuda9.2/7.1.4
source  pytorch-env-py2/bin/activate #change this to the location where virtual environment is created

export OMP_NUM_THREADS=1 #numpy package on ppc64le with OpenBLAS multithreading may lead to incorrect answers

cd $SLURM_SUBMIT_DIR
python code.py

TensorFlow (new versions and python3)

The TensorFlow which is included in PowerAI or Anaconda may not be the most recent version. Newer versions of TensorFlow are provided as prebuilt Python Wheels that users can use pip to install under user space. Custom Python wheels are stored in /scinet/sgc/Applications/TensorFlow_wheels. It is highly recommended to install custom TensorFlow wheels into a Python virtual environment.

Installing with Python2.7:

  • Create a virtual environment tensorflow-1.8-py2 with packages installed with system:
virtualenv --python=python2.7 --system-site-packages tensorflow-1.8-py2
  • Activate virtual environment:
source tensorflow-1.8-py2/bin/activate
  • Install TensorFlow into the virtual environment: (A custom Numpy built with OpenBLAS library can be installed)
pip install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp27-cp27mu-linux_ppc64le.whl
pip install /scinet/sgc/Applications/TensorFlow_wheels/tensorflow-1.8.0-cp27-cp27mu-linux_ppc64le.whl

Installing with Python3.5:

  • Create a virtual environment tensorflow-1.8-py3 with packages installed with system:
virtualenv --python=python3.5 --system-site-packages tensorflow-1.8-py3
  • Activate virtual environment:
source tensorflow-1.8-py3/bin/activate
  • Install TensorFlow into the virtual environment: (A custom Numpy built with OpenBLAS library can be installed)
pip3 install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp35-cp35m-linux_ppc64le.whl
pip3 install /scinet/sgc/Applications/TensorFlow_wheels/tensorflow-1.8.0-cp35-cp35m-linux_ppc64le.whl

Submitting jobs

The above myjob.script file needs to be modified to run custom TensorFlow. cuda/9.0 and cudnn/cuda9.0/7.0.5 modules need to be loaded. Virtual environment needs to be activated.

#!/bin/bash
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=20  # MPI tasks (needed for srun/mpirun) 
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node

module purge
module load cuda/9.0 cudnn/cuda9.0/7.0.5
source tensorflow-1.8-py2/bin/activate #change this to the location where virtual environment is created

export OMP_NUM_THREADS=1 #numpy package on ppc64le with OpenBLAS multithreading may lead to incorrect answers
cd $SLURM_SUBMIT_DIR
python code.py

LINKS

Summit Dev System at ORNL

DOCUMENTATION

  1. GPU Cluster Introduction: SOSCIP GPU Platform