Difference between revisions of "FAQ"

From SciNet Users Documentation
Jump to: navigation, search
(Running your jobs)
(Running your jobs)
Line 574: Line 574:
 
-->
 
-->
  
 +
<!--
 
===How do I run serial jobs?===
 
===How do I run serial jobs?===
  
 
'''Answer''':
 
'''Answer''':
  
<!--
+
 
 
So it should be said first that SciNet is a parallel computing resource,  
 
So it should be said first that SciNet is a parallel computing resource,  
 
and our priority will always be parallel jobs.  Having said that, if  
 
and our priority will always be parallel jobs.  Having said that, if  
Line 695: Line 696:
  
 
==Errors in running jobs==
 
==Errors in running jobs==
-->
+
 
  
 
=== I couldn't find the  .o output file in the .pbs_spool directory as I used to ===
 
=== I couldn't find the  .o output file in the .pbs_spool directory as I used to ===

Revision as of 16:09, 30 July 2018

Contents


The Basics

Whom do I contact for support?

Whom do I contact if I have problems or questions about how to use the SciNet systems?

Answer:

E-mail <support@scinet.utoronto.ca>

In your email, please include the following information:

  • your username on SciNet
  • the cluster that your question pertains to (NIA, BGQ, GPU, ...; SciNet is not a cluster!),
  • any relevant error messages
  • the commands you typed before the errors occurred
  • the path to your code (if applicable)
  • the location of the job scripts (if applicable)
  • the directory from which it was submitted (if applicable)
  • a description of what it is supposed to do (if applicable)
  • if your problem is about connecting to SciNet, the type of computer you are connecting from.

Note that your password should never, never, never be to sent to us, even if your question is about your account.

Avoid sending email only to specific individuals at SciNet. Your chances of a quick reply increase significantly if you email our team! (support@scinet.utoronto.ca)

What does code scaling mean?

Answer:

Please see A Performance Primer

What do you mean by throughput?

Answer:

Please see A Performance Primer.

Here is a simple example:

Suppose you need to do 10 computations. Say each of these runs for 1 day on 8 cores, but they take "only" 18 hours on 16 cores. What is the fastest way to get all 10 computations done - as 8-core jobs or as 16-core jobs? Let us assume you have 2 nodes at your disposal. The answer, after some simple arithmetic, is that running your 10 jobs as 8-core jobs will take 5 days, whereas if you ran them as 16-core jobs it would take 7.5 days. Take your own conclusions...

I changed my .bashrc/.bash_profile and now nothing works

The default startup scripts provided by SciNet, and guidelines for them, can be found here. Certain things - like sourcing /etc/profile and /etc/bashrc are required for various SciNet routines to work!

If the situation is so bad that you cannot even log in, please send email support.

Could I have my login shell changed to (t)csh?

The login shell used on our systems is bash. While the tcsh is available on the GPC and the TCS, we do not support it as the default login shell at present. So "chsh" will not work, but you can always run tcsh interactively. Also, csh scripts will be executed correctly provided that they have the correct "shebang" #!/bin/tcsh at the top.

How can I run Matlab / IDL / Gaussian / my favourite commercial software at SciNet?

Answer:

Because SciNet serves such a disparate group of user communities, there is just no way we can buy licenses for everyone's commercial package. The only commercial software we have purchased is that which in principle can benefit everyone -- fast compilers and math libraries (Intel's on GPC, and IBM's on TCS).

If your research group requires a commercial package that you already have or are willing to buy licenses for, contact us at support@scinet and we can work together to find out if it is feasible to implement the packages licensing arrangement on the SciNet clusters, and if so, what is the the best way to do it.

Note that it is important that you contact us before installing commercially licensed software on SciNet machines, even if you have a way to do it in your own directory without requiring sysadmin intervention. It puts us in a very awkward position if someone is found to be running unlicensed or invalidly licensed software on our systems, so we need to be aware of what is being installed where.

Do you have a recommended ssh program that will allow scinet access from Windows machines?

Answer:

The SSH for Windows users programs we recommend are:

  • MobaXterm is a tabbed ssh client with some Cygwin tools, including ssh and X, all wrapped up into one executable.
  • PuTTY - this is a terminal for windows that connects via ssh. It is a quick install and will get you up and running quickly.
    WARNING: Make sure you download putty from the official website, because there are "trojanized" versions of putty around that will send your login information to a site in Russia (as reported here).
    To set up your passphrase protected ssh key with putty, see here.
  • CygWin - this is a whole linux-like environment for windows, which also includes an X window server so that you can display remote windows on your desktop. Make sure you include the openssh and X window system in the installation for full functionality. This is recommended if you will be doing a lot of work on Linux machines, as it makes a very similar environment available on your computer.
    To set up your ssh keys, following the Linux instruction on the Ssh keys page.


To set up your ssh keys, following the Linux instruction on the Ssh keys page.

My ssh key does not work! WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!

Answer:

Testing Your Key

  • If this doesn't work, you should be able to login using your password, and investigate the problem. For example, if during a login session you get an message similar to the one below, just follow the instruction and delete the offending key on line 3 (you can use vi to jump to that line with ESC plus : plus 3). That only means that you may have logged in from your home computer to SciNet in the past, and that key is obsolete.
$ ssh USERNAME@login.scinet.utoronto.ca

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@**@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@**@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle
attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
53:f9:60:71:a8:0b:5d:74:83:52:**fe:ea:1a:9e:cc:d3.
Please contact your system administrator.
Add correct host key in /home/<user>/.ssh/known_hosts to get rid of
this message.
Offending key in /home/<user>/.ssh/known_hosts:3
RSA host key for login.scinet.utoronto.ca 
<http://login.scinet.utoronto.ca <http://login.scinet.utoronto.ca>> has
changed and you have requested
  • If you get the message below you may need to logout of your gnome session and log back in since the ssh-agent needs to be

restarted with the new passphrase ssh key.

$ ssh USERNAME@login.scinet.utoronto.ca

Agent admitted failure to sign using the key.

Can't get graphics: "Can't open display/DISPLAY is not set"

To use graphics on SciNet machines and have it displayed on your machine, you need to have a X server running on your computer (an X server is the standard way graphics is done on linux). One an X server is running, you can log in with the "-Y" option to ssh ("-X" sometimes also works).

How to get an X server running on your computer, depends on the operating system. On linux machines with a graphical interface, X will already be running. On windows, the easiest solution is using MobaXterm, which comes with an X server (alternatives, such as cygwin with the x11 server installed, or running putty+Xming, can also work, but are a bit more work to set up. For Macs, you will need to install Xquartz.

Remote graphics stops working after a while: "Can't open display"

If you still cannot get graphics, or it works only for a while and then suddenly it "can't open display localhost:....", your X11 graphics connection may have timed out (Macs seem to be particularly prone to this). You'll have to tell your own computer not to allow, and not to timeout the X11 graphics connection.

The following should fix it. The ssh configuration settings are in a file called /etc/ssh/ssh_config (or /etc/ssh_config in older OS X versions, or $HOME/.ssh/config for specific users). In the config file, find (or create) the section "Host *" (meaning all hosts) and add the following lines:

 Host *
  ServerAliveInterval 60
  ServerAliveCountMax 3
  ForwardX11 yes
  ForwardX11Trusted yes
  ForwardX11Timeout 596h

(The Host * is only needed if there was no Host section yet to append these settings to.)

If this does not resolve it, try it again with "ssh -vvv -Y ....". The "-vvv" spews out a lot of diagnostic messages. Look for anything resembling a timeout, and let us know (support AT scinet DOT utoronto DOT ca).

Can't forward X: "Warning: No xauth data; using fake authentication data", or "X11 connection rejected because of wrong authentication."

I used to be able to forward X11 windows from SciNet to my home machine, but now I'm getting these messages; what's wrong?

Answer:

This very likely means that ssh/xauth can't update your ${HOME}/.Xauthority file.

The simplest pssible reason for this is that you've filled your 10GB /home quota and so can't write anything to your home directory. Use

$ module load extras
$ diskUsage

to check to see how close you are to your disk usage on ${HOME}.

Alternately, this could mean your .Xauthority file has become broken/corrupted/confused some how, in which case you can delete that file, and when you next log in you'll get a similar warning message involving creating .Xauthority, but things should work.

I have a CCDB account, but I can't login to SciNet. How can I get a SciNet account?

Answer:

You must extend your CCDB application process to also get a SciNet account:

https://wiki.scinet.utoronto.ca/wiki/index.php/Application_Process

https://www.scinethpc.ca/getting-a-scinet-account/


How can I reset the password for my Compute Canada account?

Answer:

You can reset your password for your Compute Canada account here:

https://ccdb.computecanada.ca/security/forgot


How can I change or reset the password for my SciNet account?

Answer:

To reset your password at SciNet please go to Password reset page.

If you know your old password and want to change it, that can be done here after logging in on the portal:

https://portal.scinet.utoronto.ca

Why am I getting the error "Permission denied (publickey,gssapi-with-mic,password)"?

This error can pop up in a variety of situations: when trying to log in, or when after a job has finished, when the error and output files fail to be copied (there are other possible reasons for this failure as well -- see My GPC job died, telling me:Copy Stageout Files Failed). In most cases, the "Permission denioed" error is caused by incorrect permission of the (hidden) .ssh directory. Ssh is used for logging in as well as for the copying of the standard error and output files after a job.

For security reasons, the directory .ssh should only be writable and readable to you, but yours has read permission for everybody, and thus it fails. You can change this by

   chmod 700 ~/.ssh

And to be sure, also do

   chmod 600 ~/.ssh/id_rsa ~/authorized_keys

ERROR:102: Tcl command execution failed? when loading modules

Modules sometimes require other modules to be loaded first. Module will let you know if you didn’t. For example:

$ module purge
$ module load python
python/2.6.2(11):ERROR:151: Module ’python/2.6.2’ depends on one of the module(s) ’gcc/4.4.0’
python/2.6.2(11):ERROR:102: Tcl command execution failed: prereq gcc/4.4.0
$ gpc-f103n084-$ module load gcc python
$


How do I compute the core-years usage of my code?

The "core-years" quantity is a way to account for the time your code runs, by considering the total number of cores and time used, accounting for the total number of hours in a year. For instance if your code uses HH hours, in NN nodes, where each node has CC cores, then "core-years" can be computed as follow:

HH*(NN*CC)/(365*24)

If you have several independent instances (batches) running on different nodes, with BB number of batches and each batch during HH hours, then your core-years usage can be computed as,

BB*HH*(NN*CC)/(365*24)

As a general rule, in our GPC system, each node has only 8 cores, so CC will be always 8.

Compiling your Code

How do I link against the Intel Math Kernel Library?

If you need to link to the Intel Math Kernal Library (MKL) with the intel compilers, just add the
-mkl
flag. There are in fact three flavours: -mkl=sequential, -mkl=parallel and -mkl=cluster, for the serial version, the threaded version and the mpi version, respectively. (Note: The cluster version is available only when using the intelmpi module and mpi compilation wrappers.)

If you need to link in the Intel Math Kernel Library (MKL) libraries to gcc/gfortran/c++, you are well advised to use the Intel(R) Math Kernel Library Link Line Advisor: http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ for help in devising the list of libraries to link with your code.

Note that this give the link line for the command line. When using this in Makefiles, replace $MKLPATH by ${MKLPATH}.

Note too that, unless the integer arguments you will be passing to the MKL libraries are actually 64-bit integers, rather than the normal int or INTEGER types, you want to specify 32-bit integers (lp64) .


Testing your Code

Can I run a something for a short time on the development nodes?

I am in the process of playing around with the mpi calls in my code to get it to work. I do a lot of tests and each of them takes a couple of seconds only. Can I do this on the development nodes?

Answer:

Yes, as long as it's very brief (a few minutes). People use the development nodes for their work, and you don't want to bog it down for people, and testing a real code can chew up a lot more resources than compiling, etc. The procedures differ depending on what machine you're using.


Submitting your jobs

How do I charge jobs to my RAC allocation?

Answer:


How can I automatically resubmit a job?

Running your jobs

My job can't write to /home

Can I can use hybrid codes consisting of MPI and openMP on the GPC?

Answer:


IB Memory Errors, eg reg_mr Cannot allocate memory

Infiniband requires more memory than ethernet; it can use RDMA (remote direct memory access) transport for which it sets aside registered memory to transfer data.

In our current network configuration, it requires a _lot_ more memory, particularly as you go to larger process counts; unfortunately, that means you can't get around the "I need more memory" problem the usual way, by running on more nodes. Machines with different memory or network configurations may exhibit this problem at higher or lower MPI task counts.

Right now, the best workaround is to reduce the number and size of OpenIB queues, using XRC: with the OpenMPI, add the following options to your mpirun command:

-mca btl_openib_receive_queues X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32 -mca btl_openib_max_send_size 12288

With Intel MPI, you should be able to do

module load intelmpi/4.0.3.008
mpirun -genv I_MPI_FABRICS=shm:ofa  -genv I_MPI_OFA_USE_XRC=1 -genv I_MPI_OFA_DYNAMIC_QPS=1 -genv I_MPI_DEBUG=5 -np XX ./mycode

to the same end.

For more information see GPC MPI Versions.


My compute job fails, saying libpng12.so.0: cannot open shared object file or libjpeg.so.62: cannot open shared object file

Answer:

To maximize the amount of memory available for compute jobs, the compute nodes have a less complete system image than the development nodes. In particular, since interactive graphics libraries like matplotlib and gnuplot are usually used interactively, the libraries for their use are included in the devel nodes' image but not the compute nodes.

Many of these extra libraries are, however, available in the "extras" module. So adding a "module load extras" to your job submission script - or, for overkill, to your .bashrc - should enable these scripts to run on the compute nodes.

Monitoring jobs in the queue

Why hasn't my job started?

Answer:

Use the moab command

checkjob -v jobid

and the last couple of lines should explain why a job hasn't started.

Please see Job Scheduling System (Moab) for more detailed information

How do I figure out when my job will run?

Answer:

Please see Job Scheduling System (Moab)



Running checkjob on my job gives me messages about JobFail and rejected

Running checkjob on my job gives me messages that suggest my job has failed, as below: what did I do wrong?

AName: test
State: Idle 
Creds:  user:xxxxxx  group:xxxxxxxx  account:xxxxxxxx  class:batch_ib  qos:ibqos
WallTime:   00:00:00 of 8:00:00
BecameEligible: Wed Jul 23 10:39:27
SubmitTime: Wed Jul 23 10:38:22
  (Time Queued  Total: 00:01:47  Eligible: 00:01:05)

Total Requested Tasks: 8

Req[0]  TaskCount: 8  Partition: ALL  
Opsys: centos6computeA  Arch: ---  Features: ---


Notification Events: JobFail

IWD:            /scratch/x/xxxxxxxx/xxxxxxx/xxxxxxx
Partition List: torque,DDR
Flags:          RESTARTABLE
Attr:           checkpoint
StartPriority:  76
rejected for Opsys        - (null)
rejected for State        - (null)
rejected for Reserved     - (null)
NOTE:  job req cannot run in partition torque (available procs do not meet requirements : 0 of 8 procs found)
idle procs: 793  feasible procs:   0

Node Rejection Summary: [Opsys: 117][State: 2895][Reserved: 19]

NOTE:  job violates constraints for partition SANDY (partition SANDY not in job partition mask)

NOTE:  job violates constraints for partition GRAVITY (partition GRAVITY not in job partition mask)

rejected for State        - (null)
NOTE:  

Answer:

The output from check job is a little cryptic in places, and if you are wondering why your job hasn't started yet, you might think that "rejection" and "JobFail" suggest that there's something wrong. But the above message is actually normal; you can use the showstart command on your job to get a (preliminary, subject to change) estimate as to when the job will start, and you'll find that it is in fact scheduled to start up in the near future.

In the above message:

  • `Notification Events: JobFail` just means that, if notifications are enabled, you'll get a message if the job fails;
  • `job req cannot run in partition torque` just means that the job cannot run just yet (that's why it's queued);
  • `job req cannot run in dynamic partition DDR now (insufficient procs available: 0 < 8)` says why: there aren't processors available; and
  • `job violates constraints for partition SANDY/GRAVITY` just means that the job isn't eligable to run in those paritcular (small) sections of the cluster.

that is, the above output is the normal and expected (if somewhat cryptic) explanation as to why the job is waiting - nothing to worry about.


How can I monitor my running jobs on TCS?

How can I monitor the load of TCS jobs?

Answer:

You can get more information with the command

/xcat/tools/tcs-scripts/LL/jobState.sh

which I alias as:

alias llq1='/xcat/tools/tcs-scripts/LL/jobState.sh'

If you run "llq1 -n" you will see a listing of jobs together with a lot of information, including the load.


How can I check the memory usage from my jobs?

How can I check the memory usage from my jobs?

Answer:

In many occasions it can be really useful to take a look at how much memory your job is using while it is running. There a couple of ways to do so:

1) using some of the command line utilities we have developed, e.g: by using the jobperf or jobtop utilities, it will allow you to check the job performance and head's node utilization respectively.

2) ssh into the nodes where your job is being run and check for memory usage and system stats right there. For instance, trying the 'top' or 'free' commands, in those nodes.

Also, it always a good a idea and strongly encouraged to inspect the standard output-log and error-log generated for your job submissions. These files are named respectively: JobName.{o|e}jobIdNumber; where JobName is the name you gave to the job (via the '-N' PBS flag) and JobIdNumber is the id number of the job. These files are saved in the working directory after the job is finished, but they can be also accessed on real-time using the jobError and jobOutput command line utilities available loading the extras module.

Other related topics to memory usage:
Using Ram Disk
Different Memory Configuration nodes
Monitoring Jobs in the Queue
Tech Talk on Monitoring Jobs


Can I run cron jobs on devel nodes to monitor my jobs?

Can I run cron jobs on devel nodes to monitor my jobs?

Answer:

No, we do not permit cron jobs to be run by users. To monitor the status of your jobs using a cron job running on your own machine, use the command

ssh myusername@login.scinet.utoronto.ca "qstat -u myusername"

or some variation of this command. Of course, you will need to have SSH keys setup on the machine running the cron job, so that password entry won't be necessary.


How does one check the amount of used CPU-hours in a project, and how does one get statistics for each user in the project?

Answer:

This information is available on the scinet portal,https://portal.scinet.utoronto.ca, See also SciNet Usage Reports.


I couldn't find the .o output file in the .pbs_spool directory as I used to

On Feb 24 2011, the temporary location of standard input and output files was moved from the shared file system ${SCRATCH}/.pbs_spool to the node-local directory /var/spool/torque/spool (which resides in ram). The final location after a job has finished is unchanged, but to check the output/error of running jobs, users will now have to ssh into the (first) node assigned to the job and look in /var/spool/torque/spool.

This alleviates access contention to the temporary directory, especially for those users that are running a lot of jobs, and reduces the burden on the file system in general.

Note that it is good practice to redirect output to a file rather than to count on the scheduler to do this for you.

My GPC job died, telling me `Copy Stageout Files Failed'

Answer:

When a job runs on GPC, the script's standard output and error are redirected to $PBS_JOBID.gpc-sched.OU and $PBS_JOBID.gpc-sched.ER in /var/spool/torque/spool on the (first) node on which your job is running. At the end of the job, those .OU and .ER files are copied to where the batch script tells them to be copied, by default $PBS_JOBNAME.o$PBS_JOBID and$PBS_JOBNAME.e$PBS_JOBID. (You can set those filenames to be something clearer with the -e and -o options in your PBS script.)

When you get errors like this:

An error has occurred processing your job, see below.
request to copy stageout files failed on node

it means that the copying back process has failed in some way. There could be a few reasons for this. The first thing to make sure that your .bashrc does not produce any output, as the output-stageout is performed by bash and further output can cause this to fail. But it also could have just been a random filesystem error, or it could be that your job failed spectacularly enough to shortcircuit the normal job-termination process (e.g. ran out of memory very quickly) and those files just never got copied.

Write to <support@scinet.utoronto.ca> if your input/output files got lost, as we will probably be able to retrieve them for you (please supply at least the jobid, and any other information that may be relevant).

Mind you that it is good practice to redirect output to a file rather than depending on the job scheduler to do this for you.


IB Memory Errors, eg reg_mr Cannot allocate memory

Infiniband requires more memory than ethernet; it can use RDMA (remote direct memory access) transport for which it sets aside registered memory to transfer data.

In our current network configuration, it requires a _lot_ more memory, particularly as you go to larger process counts; unfortunately, that means you can't get around the "I need more memory" problem the usual way, by running on more nodes. Machines with different memory or network configurations may exhibit this problem at higher or lower MPI task counts.

Right now, the best workaround is to reduce the number and size of OpenIB queues, using XRC: with the OpenMPI, add the following options to your mpirun command:

-mca btl_openib_receive_queues X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32 -mca btl_openib_max_send_size 12288

With Intel MPI, you should be able to do

module load intelmpi/4.0.3.008
mpirun -genv I_MPI_FABRICS=shm:ofa  -genv I_MPI_OFA_USE_XRC=1 -genv I_MPI_OFA_DYNAMIC_QPS=1 -genv I_MPI_DEBUG=5 -np XX ./mycode

to the same end.

For more information see GPC MPI Versions.

My compute job fails, saying libpng12.so.0: cannot open shared object file or libjpeg.so.62: cannot open shared object file

Answer:

To maximize the amount of memory available for compute jobs, the compute nodes have a less complete system image than the development nodes. In particular, since interactive graphics libraries like matplotlib and gnuplot are usually used interactively, the libraries for their use are included in the devel nodes' image but not the compute nodes.

Many of these extra libraries are, however, available in the "extras" module. So adding a "module load extras" to your job submission script - or, for overkill, to your .bashrc - should enable these scripts to run on the compute nodes.

-->

Data on SciNet disks

How do I find out my disk usage?

Answer:

The standard unix/linux utilities for finding the amount of disk space used by a directory are very slow, and notoriously inefficient on the GPFS filesystems that we run on the SciNet systems. There are utilities that very quickly report your disk usage:

The diskUsage command, available on the login nodes and datamovers, provides information in a number of ways on the home, scratch, and project file systems. For instance, how much disk space is being used by yourself and your group (with the -a option), or how much your usage has changed over a certain period ("delta information") or you may generate plots of your usage over time. This information is updated every 3-hours!

More information about these filesystems is available at the Data_Management.

How do I transfer data to/from SciNet?

Answer:

All incoming connections to SciNet go through relatively low-speed connections to the nagara.scinet gateways, so using scp to copy files the same way you ssh in is not an effective way to move lots of data. Better tools are described in our page on Moving data.

My group works with data files of size 1-2 GB. Is this too large to transfer by scp to login.scinet.utoronto.ca ?

Answer:

Generally, occasion transfers of data less than 10GB is perfectly acceptable to so through the login nodes. See Moving data.

How can I check if I have files in /scratch that are scheduled for automatic deletion?

Answer:

Please see Scratch Disk Purging Policy

How to allow my supervisor to manage files for me using ACL-based commands?

Answer:

Please see File/Ownership Management

Can I transfer files between BGQ and HPSS?

Answer: Yes, however for now you'll need to do this in 2 step:

  • transfer from BGQ to Niagara SCRATCH
  • then from Niagara SCRATCH to HPSS

Keep 'em Coming!

Next question, please

Send your question to <support@scinet.utoronto.ca>; we'll answer it asap!