Running Jobs on XC40


Job Submission

The batch scheduler used on Theta is Cobalt. Below is a basic introduction to submitting jobs using Cobalt. More details on Cobalt options and how to query, alter, and delete submitted jobs can be found in the section on Cobalt Job Control (Cray XC40).

There are two main types of jobs one can submit: script and interactive. In a script job, a script is given to Cobalt and when scheduled, the script is run on one of the service nodes. The script can then run executables on the compute nodes via aprun. Interactive mode jobs may be used when you have a need to execute aprun invocations directly, such as when rapidly debugging.

Submitting a Script Job

If an executable program is invoked from within a script, use the special script mode when invoking qsub:

qsub -A <project_name> -t <mins> -n <nodes> --mode script

The script must include the aprun command to run the executable on the compute nodes:

#!/bin/bash
#COBALT -t 30
#COBALT -n 128
#COBALT --attrs mcdram=cache:numa=quad
#COBALT -A Catalyst
echo "Starting Cobalt job script"
export n_nodes=$COBALT_JOBSIZE
export n_mpi_ranks_per_node=32
export n_mpi_ranks=$(($n_nodes * $n_mpi_ranks_per_node))
export n_openmp_threads_per_rank=4
export n_hyperthreads_per_core=2
export n_hyperthreads_skipped_between_ranks=4
aprun -n $n_mpi_ranks -N $n_mpi_ranks_per_node \
  --env OMP_NUM_THREADS=$n_openmp_threads_per_rank -cc depth \
  -d $n_hyperthreads_skipped_between_ranks \
  -j $n_hyperthreads_per_core \
  <executable> <executable args>

Here, -n is the total number of MPI ranks in the job, and -N is the number of ranks per node. Using $COBALT_JOBSIZE and a $n_mpi_ranks_per_node factor allows the same script to be run at any supported size on Theta. The -cc argument specifies CPU affinity binding. When using aprun and the -d argument, you should always use -cc depth. Enter aprun -h or man aprun from the command prompt for more information about aprun and its arguments. This script must be executable from the “MOM” nodes, also known as “launch” nodes, and will run from a launch node.

To submit this script, you might use the following (on the login node):

qsub  myscript.sh

Any arguments passed on the command line to the script will override the arguments in #COBALT directives in the script itself with the exception of the “--env” argument, which will concatenate to the existing environment list.

Additional details:

  • The arguments to aprun are different than the qsub command; type "aprun -h” or “man aprun” for a complete list of arguments. Environment variables are specified by the “--env” flag. Multiple --env flags can be used to pass multiple variables, and they may also be passed in a ‘:’-delimited list to --env.
  • The aprun -N argument will specify the number of ranks per node. The number of nodes allocated for the job is determined by the qsub option "-n"; the aprun may specify any values for -n that fits within that number of nodes when combined with a count of ranks per node.
  • The --cwd argument may be used to set the working directory within the script to be different than the working directory it was invoked from. Note that the argument to --cwd must be an absolute path (i.e., starting with "/"). Relative paths are not currently supported.
  • The job script will be executed on a dedicated launch node, also known as a MOM node. These Intel Broadwell nodes are distinct from the KNL compute nodes. All script jobs share these nodes, so it is important that the user take into consideration these capabilities when deciding what to run in the script.
  • The entire time a job is running, a compute node partition is reserved exclusively for the user (regardless of whether the user is executing aprun or not). Important: The job charges are for the entire script job duration, not just the portion that actually runs with aprun
  • Redirection of stdin, stdout, or stderr (e.g., ">") on the aprun command should behave as expected. However, only PE 0 will receive stdin.
  • The exit status of the script will determine whether Cobalt considers the job script execution successful or not. This is important if the user is using --dependencies (see Job Dependencies). Normally, a script's exit status is the status of the last command executed, unless there is an explicit "exit" command.

Interactive Jobs

Apruns may be invoked interactively against a Cobalt resource allocation by passing qsub the -I flag and omitting the executable script. These jobs will provide a user a shell on the launch node where they may run aprun against their resource allocation. The user shell will not terminate at the end of the allocation time, athough all currently running apruns will be terminated and other apruns from that session will fail. This allows the user to take whatever actions are needed at the end of their session.

Example:

qsub -A <project_name> -t <mins> -n <nodes> -I

When Cobalt starts your job, you will get a command prompt on a launch node, from which you may issue aprun commands:

frodo@thetalogin6:~ $ qsub -I -n 1 -t 10 --queue <debug-cache-quad|debug-flat-quad> -A Project
Connecting to thetamom1 for interactive qsub...
Job routed to queue "<debug-cache-quad|debug-flat-quad>".
Wait for job 97442 to start...
Opening interactive session to 3208
frodo@thetamom1:/gpfs/theta-fs1/home/frodo $ aprun -n 1 -N 1 myprogram.exe <arguments to myprogram>

When you are finished with your interactive runs, you may end your job by exiting the shell that Cobalt spawned on your behalf. If the requested wallclock time of the interactive session is exceeded, then all currently running apruns in your job will terminate and all subsequent aprun invocations will fail due to the job’s resources being released and returned to the pool. Your shell session will continue, however, allowing you to take whatever cleanup actions you require before exiting the session.

Multiple Runs within a Script

If jobs all require the same-sized partition, the user can submit a single Cobalt job script and conduct multiple runs within the script. The advantage of this process is that the user waits in the queue only once. Users should reference the previous section about basic Cobalt script submission and its restrictions. The following code is an example of completing multiple runs within the script:

#!/bin/bash
echo "Starting Cobalt job script"
aprun -n 128 -N 64 myprogram.exe arg1
aprun -n 128 -N 64 myprogram.exe arg2
aprun -n 128 -N 64 myprogram.exe arg3

Multiple apruns may be invoked within a script. Together, they may use up to the number of nodes requested by the job at the same time. The aprun command blocks until task completion, at which point it exits, providing a convenient way to run multiple short jobs together. In addition, if a subset of nodes is requested, aprun will place jobs on nodes in the script’s reservation until the pool of inactive nodes is exhausted. Should the number of nodes requested by an aprun cause the number of reserved nodes to be exceeded, that aprun will fail to execute and an error will be returned.

Multiple simultaneous apruns may be accommodated by backgrounding the aprun in the script and then waiting for completion. At no point should an aprun frontend be paused due to internal keep-alive messaging to the frontend aprun process that will result in job termination if the aprun frontend cannot be reached. A short sleep between apruns is recommended to avoid a potential race condition during a large number of aprun starts.

-----myjob.sh------
#!/bin/sh
echo "Starting Cobalt job script"
aprun -n 128 -N 64 run1.exe arg1 &
sleep 1
aprun -n 256 -N 64 run1.exe arg1 &
sleep 1
aprun -n 512 -N 64 run1.exe arg1 &
wait

-----end myjob.sh---

Submit the job using qsub:

qsub -A <project_name> -t 60 -n 14 myjob.sh

Job Settings

Environment Variables

Pre-defined

The following environment variables are set in the Cobalt script job environment:

COBALT_PARTNAME - physical nodes assigned by cobalt (e.g., "340-343" from a 4-node run)
COBALT_PARTSIZE - on XC40, identical to COBALT_JOBSIZE
COBALT_JOBSIZE - number of nodes requested

The following environment variables are set in the Cobalt script environment, as well as in the compute node environment:

COBALT_JOBID - the job ID assigned by cobalt (e.g., 130850)

User-defined

# Pass a single variable
qsub -t 30 -n 64 --env VAR=value myjob.sh

# Pass more than one environment variable using colon separation
qsub -t 30 -n 64 --env VAR1=value1:VAR2=value2:VAR3=value3 myjob.sh

# Multiple  --env arguments are additive:qsub -t 30 -n 64 --env VAR1=value1 --env VAR2=value2 myjob.sh

Remember to place this argument and the other Cobalt arguments before your executable name.

Within a script mode job, use the -e argument to aprun, as shown in the following example:

# Pass a single variable
aprun -n 64 -N 64 -e VAR=value myprogram.exe

# Pass more than one variable using multiple -e arguments
aprun -n 64 -N 64 -e VAR1=value1 -e VAR2=value2 myprogram.exe

Additional details: The environment variables set in the script job's shell will not be passed into myprogram.exe:

#!/bin/sh
export VAR=value                        # incorrect
aprun -n 64 -N 64 myprogram.exe

Script Environment

The script job will receive your non-interactive login shell environment as it is set at the time the script job is launched. Any changes needed from your default login environment must be placed in the script. Note that any changes made in the configuration files of the default environment after job submission but prior to job execution will show up in the executed job.

Program and Argument Length Limit

The total length of the executable and all arguments to your program may not be longer than 4,096 characters (this is a system limitation). The executable must be no more than 512 characters.

Job Dependencies

Cobalt’s job dependency feature is described in the qsub manpage:

--dependencies <jobid1>:<jobid2>

Set the dependencies for the job being submitted. This job won't run until jobs having ids jobid1 and jobid2 have finished successfully.

If a job terminates abnormally, any jobs depending on that one will not be released and will remain in the dep_hold state. To clear this hold:

qalter --dependencies none

System Sizes and Limitations

On Theta, job sizes from a single node to all 3,620 nodes are supported, barring queue limitations. A job must have sufficient nodes available and idle in its queue to run. Within a Cobalt job, there is a limitation of 1,000 simultaneous aprun invocations permitted due to a system limitation in Cray’s ALPS software stack to prevent resource starvation on the launch nodes. When running many small jobs, it is highly advised to mitigate startup costs and resource startup time by bundling them together into a script. This provides more efficient use of node-hours, as well as making the job’s score accrual more favorable due to the larger overall job size.

Requesting Local SSD Requirements

Theta's compute nodes are equipped with SSDs that are available to projects that request the usage of them.  You may indicate that your job requires the usage of local SSD storage during your job, and indicate the amount of free space on the SSDs that your job requires. SSD storage is only for use during your job, and all data written to the local SSDs will be deleted when your Cobalt job terminates.  There is no automated backup of any data on the local node SSDs.  If your project has requested the use of local SSDs, the storage is located at /local/scratch.

You may indicate that your job requires the use of local SSD storage by adding ​ssds=required to your --attrs argument.  You may indicate the minimum amount of free space on local SSDs required by adding ssd_size=N, where N is the required size in GB to your --attrs argument.  Any job with these settings will be be run on nodes that have enabled SSDs.  If there are insufficient SSD-enabled nodes available on the system for a job's nodecount, the job will not run.  Currently the maximum size of SSD available on Theta is 128GB.  
 

For example:

--attrs ssds=required:ssd_size=128