FAQs for Queueing and Running on XC40

 

Where can I find the details of a job submission?

Details of the job submission are recorded in the <jobid>.cobaltlog. This file contains the qsub command and environment variables. The location of this file can be controlled with the ‘qsub --debuglog <path>’ that defaults to the same place as the .output and .error files.

Why is my job stuck in "starting" state?

If you submit a job and qstat shows it in "starting" state for 5 minutes or more, most likely your memory/numa mode selection requires rebooting some or all of the nodes your job was assigned. This process takes about 15 minutes, during which your job appears to be in the "starting" phase. When no reboots are required, the "starting" phase only lasts a matter of seconds.

What are the "utime" and "stime" values printed at the bottom of the <jobid>.output file?

At the bottom of a <jobid>.ouput file, there is usually a line like:

Application 3373484 resources: utime ~6s, stime ~6s, Rss ~5036, inblocks ~0, outblocks ~8

The "utime" and "stime" values are user CPU time and system CPU time from the aprun and getrusage commands. They are rounded aggregate numbers scaled by the number of resources used, and are approximate. The aprun man page has more information about them.