Running Jobs on XC40


Job Submission

The batch scheduler used on Theta is Cobalt. Below is a basic introduction to submitting jobs using Cobalt. More details on Cobalt options and how to query, alter, and delete submitted jobs can be found in the section on Cobalt Job Control (Cray XC40). For information on queues on Cray XC40, see XC40 queues. For information on priority and scheduling, see XC40 priority and scheduling.

There are two main types of jobs one can submit: script and interactive.

  • In a script job, a script is given to Cobalt and when scheduled, the script is run on one of the service nodes. The script can call "aprun <executable>" to run executables on the compute nodes via aprun.
  • In an interactive job, a "-I" is passed to Cobalt and when scheduled, you are given a shell on a service node. You can then execute aprun to launch jobs to the compute nodes directly. This is useful for rapid debugging.

Overview of Where Jobs Run on Theta

To understand how jobs run, it is useful to keep in mind the different types of nodes (login, service, and compute) that make up Theta, and on which nodes the jobs run. When a user ssh's into Theta, they are given a shell on a login node. The login nodes are typically where a user compiles code and submits jobs to the batch scheduler. To run a job on Theta, the user submits a script (or interactive job) to the batch scheduler from the login node. However, the script (or shell in an interactive job) does not run directly on the compute nodes--it first runs on a service node. Like the login nodes, service nodes are not the compute nodes that make up the main computational resources of the machine, but are an intermediate node where the submission script launches executables to the compute nodes with the aprun command.
 
Note:
  • "Service nodes" is a general term for non-compute nodes. The service nodes which launch jobs are more specifically called “MOM” or “launch” nodes on Theta, and both terms are used below.
 
An overview of the process is:
 

Submitting Script Jobs

General information about submitting script jobs:

To run a script job, first a batch submission script is written containing Cobalt scheduler options (optional) and an aprun command, which launches a given executable to the compute nodes. Then the script is sent to the batch scheduler with qsub:

qsub -t <mins> -n <nodes> -q <queue> -A <project_name> --mode script myscript.sh

This is equivalent to

qsub -t <mins> -n <nodes> -q <queue> -A <project_name> myscript.sh
  • -t <mins> denotes the maximum time to run the job
  • -n <nodes> denotes the number of compute nodes to reserve
  • -q <queue> denotes the queue to run on
  • -A <project_name> charges the job to a particular project

For information on the available queues, time, and node limits of each queue please refer to XC40 Queues page. 

To see which projects you are a member of, type "sbank" in the command line:

   sbank

will show what projects you are a member of in the "Project" column.

You can use the environment variable “COBALT_PROJ” to set your default project. Setting qsub -A at submission time will override the COBALT_PROJ environment variable.

Cobalt requires that the script file has the execute bit set. If you try submitting a script and see the following:

command myscript.sh is not executable

then set the executable bit with the following, and resubmit:

chmod +x myscript.sh

Examples of submitting script jobs:

To run myscript.sh with 128 nodes for a maximum of 30 minutes in the default queue and charge the job to MyProject:

   qsub -q default -n 128 -t 30 -A MyProject myscript.sh

To run myscript.sh with 2 nodes for a maximum of 30 minutes in the debug queue for flat memory mode and quad numa mode and charge the job to MyProject:

   qsub -q debug-flat-quad -n 2 -t 30 -A MyProject myscript.sh

General information about writing submission scripts:

A very general example submission script is shown below, and specific examples are shown below that. Note that the script must include the aprun command to run the executable on the compute nodes:

#!/bin/bash
#COBALT -t 30
#COBALT -n 128
#COBALT -q default
#COBALT --attrs mcdram=cache:numa=quad
#COBALT -A Catalyst
echo "Starting Cobalt job script"
export n_nodes=$COBALT_JOBSIZE
export n_mpi_ranks_per_node=32
export n_mpi_ranks=$(($n_nodes * $n_mpi_ranks_per_node))
export n_openmp_threads_per_rank=4
export n_hardware_threads_per_core=2
export n_hardware_threads_skipped_between_ranks=4
aprun -n $n_mpi_ranks -N $n_mpi_ranks_per_node \
  --env OMP_NUM_THREADS=$n_openmp_threads_per_rank -cc depth \
  -d $n_hardware_threads_skipped_between_ranks \
  -j $n_hardware_threads_per_core \
  <executable> <executable args>

This script shows the aprun command and many of the flags that affect job launch. Using $COBALT_JOBSIZE and a $n_mpi_ranks_per_node factor allows the same script to be run at any supported size on Theta.

The -cc, -d, and -j arguments specify CPU affinity binding, which determines how the MPI ranks and OpenMP threads map to physical cores in the compute nodes. This can have a large impact on performance, but the defaults should provide generally reasonable performance.

  • -cc <depth|none|cpu_list> : This sets how the MPI ranks and OpenMP threads are bound to physical cores. Three common options are "none" (specifies that the threads are not bound to a core--in this case, OpenMP or Intel (KMP) environment variables can be used to control affiniity), "depth" (specifies affinity based on -d and -j flags below--this is probably the most common option), or explicitly listing the cores to bind the ranks/threads. See the aprun man page for more details and examples.
  • -n <total_number_ranks> : This specifies the number of MPI ranks to run.
  • -N <number_ranks_per_node> : This specifies the total number of MPI ranks per node to run.
  • -d <num_hardware_threads_per_rank> : This specifies the number of hardware threads to assign to each MPI rank. Often this is the number of OpenMP threads.
  • -j <number_hardware_threads_per_core> : This specifies the number of hardware threads to use per physical core. Since there are 4 hardware threads per physical core, the maximum value is 4.

Note that (<num_hardware_threads_per_rank> * <number_ranks_per_node>) is the total number of hardware threads that are allocated on a node. This number must be less than the maximum number of hardware threads per node (256 on Theta). For more information on the affinity bindings, see the examples below or see Affinity on Cray XC40.

Some environment variables that might be useful:

  • --env OMP_NUM_THREADS=<number_omp_threads> (to specify the number of OpenMP threads)
  • -cc none --env KMP_AFFINITY=<affinity> (to use Intel's thread affinity settings, similar to to the OpenMP environment variable OMP_PROC_BIND)

When using aprun and the -d argument, you should always use -cc depth. Enter aprun -h or man aprun from the command prompt on Theta for more information about aprun and its arguments.

Submission scripts must be executable from the MOM/launch nodes, and will run from a launch node.

To submit this script, you might use the following (on the login node):

qsub  myscript.sh

Any arguments passed on the command line to the script will override the arguments in #COBALT directives in the script itself with the exception of the “--env” argument, which will concatenate to the existing environment list.

For example, submitting

qsub -n 256 myscript.sh

will ask the batch scheduler for 256 compute nodes, regardless of what is listed in the "#COBALT -n" directive in the script.

Additional details:

  • The arguments to aprun are different than the qsub command; type "aprun -h” or “man aprun” for a complete list of arguments. Environment variables to aprun are specified by the “--env” flag. Multiple --env flags can be used to pass multiple variables, and they may also be passed in a ‘:’-delimited list to --env.
  • The number of nodes allocated for the job is determined by the qsub option "-n". This is different from the "-n" argument to aprun, which specifies the total number of ranks to launch. The -N argument specifies the number of ranks per node. The value given to the "-n" argument for aprun may specify any values for "-n" that, when combined with the ranks per node from the "-N" aprun argument, fits within the number of nodes requested with qsub.
  • The --cwd argument may be used to set the working directory within the script to be different than the working directory it was invoked from. Note that the argument to --cwd must be an absolute path (i.e., starting with "/"). Relative paths are not currently supported.
  • The job script will be executed on a dedicated MOM node (also known as launch node). These Intel Broadwell nodes are distinct from the KNL compute nodes. All script jobs share these nodes, so it is important to take into consideration these capabilities when deciding what to run in the script.
  • The entire time a job is running, a compute node partition is reserved exclusively for that job (regardless of whether aprun is executing or not). Important: the job charges are for the entire script job duration, not just the portion that actually runs with aprun.
  • Redirection of stdin, stdout, or stderr (e.g., ">") on the aprun command should behave as expected. However, only PE 0 will receive stdin.
  • The exit status of the script will determine whether Cobalt considers the job script execution successful or not. This is important if the --dependencies (see Job Dependencies) flag is used, since a dependent job will only start if the exit status of the job it's dependent on is 0. Normally, a script's exit status is the status of the last command executed, unless there is an explicit "exit" command.
  • Depending on the memory mode desired, the job submission command may have extra arguments. The general case is described here, but for more details see XC40 Memory Modes.

Example of MPI-only submission script:

The following script is an example of a script that launches an MPI-only job. It requests 8192 ranks ("-n 8192"), with 64 ("-N 64") on each node. Based on the affinity settings, each MPI rank will be placed on a separate physical core.

#!/bin/bash
#COBALT -t 30
#COBALT -n 128
#COBALT -q default
#COBALT --attrs mcdram=cache:numa=quad
#COBALT -A Catalyst
echo "Starting Cobalt job script on 128 nodes with 64 ranks on each node,"
echo "for a total of 8192 ranks"

aprun -n 8192 -N 64 -cc depth -d 1 -j 1 myprogram.exe 

Example of MPI/OpenMP submission script:

The following script is an example of a script that launches a MPI/OpenMP job. It requests 8192 ranks ("-n 8192"), with 64 ("-N 64") on each node, and 4 OpenMP threads per rank. Based on the affinity settings, each MPI rank will be placed on a separate physical core, and the 4 OpenMP threads on each MPI rank will be assigned to the 4 hardware threads on that physical core.

#!/bin/bash
#COBALT -t 30
#COBALT -n 128
#COBALT -q default
#COBALT --attrs mcdram=cache:numa=quad
#COBALT -A Catalyst
echo "Starting Cobalt job script on 128 nodes with 64 ranks on each node"
echo "with 4 OpenMP threads per rank, for a total of 8192 ranks and 32768 OpenMP threads"

aprun -n 8192 -N 64 --env OMP_NUM_THREADS=4 -cc depth -d 4 -j 4 myprogram.exe 

For details on how to display how the threads and ranks map to physical cores (the affinity), see Affinity on Cray XC40.

Submitting Interactive Jobs

To interactively run aprun and launch executables onto the compute nodes, the -I flag can be passed to qsub (and the executable script omitted). These jobs will provide you with a shell on the launch node where you can run aprun to launch jobs to the compute nodes. You can call aprun with up to the number of resources requested in the initial qsub. When you are finished with your interactive runs, you may end your job by exiting the shell that Cobalt spawned on your behalf. The shell will not terminate at the end of the allocation time, although all currently running apruns will be terminated and other apruns from that session will fail. This allows you to take whatever actions are needed at the end of your session.

This is useful if you have many small debugging runs, and don't want to submit each to the batch system.

Example:

qsub -A <project_name> -t <mins> -q <queue> -n <nodes> -I

When Cobalt starts your job, you will get a command prompt on a launch node, from which you may issue aprun commands:

frodo@thetalogin6:~ $ qsub -I -n 1 -t 10 -q <debug-cache-quad|debug-flat-quad> -A Project
Connecting to thetamom1 for interactive qsub...
Job routed to queue "<debug-cache-quad|debug-flat-quad>".
Wait for job 97442 to start...
Opening interactive session to 3208
frodo@thetamom1:/gpfs/theta-fs1/home/frodo $ aprun -n 1 -N 1 myprogram.exe <arguments>

Bundling Multiple Runs into a Script Job

There are several ways to bundle many jobs together in a single script, and then submit that script to the batch scheduler. The advantage of this process is that you wait in the queue only once.

1. Running many jobs one after another

The simplest way of bundling many apruns in script is simply to list one after another. The apruns will run one at a time sequentially. Each aprun can use up to the number of nodes that were requested in the initial qsub. See the previous section about basic Cobalt script submission and its restrictions. The following code is an example of completing multiple runs within a script, where each aprun requests the same number of nodes:

#!/bin/bash
echo "Starting Cobalt job script"
aprun -n 128 -N 64 myprogram.exe arg1
aprun -n 128 -N 64 myprogram.exe arg2
aprun -n 128 -N 64 myprogram.exe arg3

The aprun command blocks until task completion, at which point it exits, providing a convenient way to run multiple short jobs together. In addition, if a subset of nodes is requested, aprun will place jobs on nodes in the script’s reservation until the pool of inactive nodes is exhausted. If the number of nodes requested by an aprun exceeds the number of nodes reserved by the batch scheduler for the job (through the qsub command), that aprun will fail to execute and an error will be returned.

2. Running many jobs at the same time

Multiple simultaneous apruns can be launched by backgrounding the aprun commands in the script and then waiting for completion. A short sleep between apruns is recommended to avoid a potential race condition during a large number of aprun starts. As an example, the following script will launch 3 simultaneous apruns, which execute on the compute nodes at the same time. The first aprun listed runs on 3 nodes (192/64), the second on 4 nodes (256/64), and the last one 1 node (64/64). Since the apruns are backgrounded (as denoted by the &), the script must have a "wait" command at the end so that it does not exit before the apruns complete.

#!/bin/bash
echo "Starting Cobalt job script"
aprun -n 192 -N 64 myprogram.exe arg1 &
sleep 1
aprun -n 256 -N 64 myprogram.exe arg1 &
sleep 1
aprun -n 64 -N 64 myprogram.exe arg1 &
wait

Submit the job using qsub:

qsub -A <project_name> -q <queue> -t 60 -n 8 myjob.sh

Since the three apruns in the above example will run simultaneously on a separate set of nodes, 8 total nodes are requested in the qsub command.

Note:

  • Each of the apruns will run on a separate set of nodes. It's currently not possible to run multiple apruns on the same node at the same time.
  • There is a system limitation of 1,000 simultaneous aprun invocations in a job script. If this limit is hit, you will see the error:
apsched: no more claims allowed for this reservation (max 1000)

3. Using a Workflow Manager

There are a variety of workflow managers that can assist bundling jobs together. A few are listed below:

Deleting a Script Job or Interactive Job

To delete a job from the queue, use the qdel command.

Cancel job 34586:

   qdel 34586

Depending on the stage of a job’s lifetime, qdel may not complete immediately, especially if the delete is issued during startup on a job that is changing memory modes and rebooting a node. If the job does not ultimately terminate, contact support@alcf.anl.gov with the jobid so that an administrator can take appropriate cleanup actions and administratively terminate the job.

Querying Partition Availability

To determine which partitions are currently available to the scheduler, use the nodelist command. This command provides a list of node ids, names, queue, and state as well as any backfill windows. For example:

% nodelist
Node_id  Name         Queues   Status             MCDRAM  NUMA   Backfill
================================================================================
[...]
20       c0-0c0s5n0   default  cleanup-pending    flat    quad   4:59:44
21       c0-0c0s5n1   default  cleanup-pending    flat    quad   4:59:44
22       c0-0c0s5n2   default  busy               flat    quad   4:59:44
24       c0-0c0s6n0   default  busy               flat    quad   4:59:44
25       c0-0c0s6n1   default  busy               flat    quad   4:59:44
26       c0-0c0s6n2   default  busy               flat    quad   4:59:44
27       c0-0c0s6n3   default  busy               flat    quad   4:59:44
28       c0-0c0s7n0   default  idle               flat    quad   4:59:44
29       c0-0c0s7n1   default  idle               flat    quad   4:59:44
30       c0-0c0s7n2   default  idle               flat    quad   4:59:44
31       c0-0c0s7n3   default  idle               flat    quad   4:59:44
32       c0-0c0s8n0   default  idle               flat    quad   4:59:44
33       c0-0c0s8n1   default  idle               flat    quad   4:59:44
34       c0-0c0s8n2   default  idle               flat    quad   4:59:44
[...]

Job Settings

Environment Variables

Pre-defined

The following environment variables are set in the Cobalt script job environment:

COBALT_PARTNAME - physical nodes assigned by cobalt (e.g., "340-343" from a 4-node run)
COBALT_PARTSIZE - on XC40, identical to COBALT_JOBSIZE
COBALT_JOBSIZE - number of nodes requested

The following environment variables are set in the Cobalt script environment, as well as in the compute node environment:

COBALT_JOBID - the job ID assigned by cobalt (e.g., 130850)

User-defined

Pass a single variable:

qsub -t 30 -n 128 --env VAR=value myjob.sh

Pass more than one environment variable using colon separation:

qsub -t 30 -n 128 --env VAR1=value1:VAR2=value2:VAR3=value3 myjob.sh

Note that multiple  --env arguments are additive:

qsub -t 30 -n 128 --env VAR1=value1 --env VAR2=value2 myjob.sh

Remember to place this argument and the other Cobalt arguments before your executable name.

Within a script mode job, use the -e argument to aprun, as shown in the following example:

# Pass a single variable
aprun -n 64 -N 64 -e VAR=value myprogram.exe

# Pass more than one variable using multiple -e arguments
aprun -n 64 -N 64 -e VAR1=value1 -e VAR2=value2 myprogram.exe

Another way to set environment variables is by setting them in the submission script (using bash as an example):

#!/bin/bash
export VAR=value
aprun -n 64 -N 64 myprogram.exe

Script Environment

The script job will receive your non-interactive login shell environment as it is set at the time the script job is launched. Any changes needed from your default login environment must be placed in the script. Note that any changes made in the configuration files of the default environment after job submission but prior to job execution will show up in the executed job.

Program and Argument Length Limit

The total length of the executable and all arguments to your program may not be longer than 4,096 characters (this is a system limitation). The executable must be no more than 512 characters.

Job Dependencies

Cobalt’s job dependency feature can be used to declare that the submitted job will not run until all jobs listed after "--dependencies" in the qsub command finish running and exit with a status of 0. In the following example, the job will not begin to run until jobs with COBALT_JOBIDs 305998 and 305999 have exited with a status of 0.

qsub -q default -n 128 -t 30 -A MyProject myscript.sh --dependencies 305998:305999

If a job terminates abnormally, any jobs depending on that one will not be released and will remain in the dep_hold state. To clear this hold:

qalter --dependencies none

System Sizes and Limitations

On Theta, job sizes from a single node to all nodes are supported, barring queue limitations (See Job Scheduling Policy for XC40 Systems for information about queue limitations). A job must have sufficient nodes available and idle in its queue to run. Within a Cobalt job, there is a limitation of 1,000 simultaneous aprun invocations permitted due to a system limitation in Cray’s ALPS software stack to prevent resource starvation on the launch nodes. When running many small jobs, it is highly advised to mitigate startup costs and resource startup time by bundling them together into a script. This provides more efficient use of node-hours, as well as making the job’s score accrual more favorable due to the larger overall job size.

Requesting Local SSD Requirements

Theta's compute nodes are equipped with SSDs that are available to projects that request the usage of them.  You may indicate that your job requires the usage of local SSD storage during your job, and indicate the amount of free space on the SSDs that your job requires. SSD storage is only for use during your job, and all data written to the local SSDs will be deleted when your Cobalt job terminates.  There is no automated backup of any data on the local node SSDs.  If your project has requested the use of local SSDs, the storage is located at /local/scratch.

If your project has been granted access to the local SSDs, you can request the use of local SSD storage by adding ​ssds=required to the --attrs argument of your qsub command.  You may indicate the minimum amount of free space on local SSDs required by adding ssd_size=N, where N is the required size in GB to your --attrs argument.  Any job with these settings will be be run on nodes that have enabled SSDs.  If there are insufficient SSD-enabled nodes available on the system for a job's nodecount, the job will not run.  Currently the maximum size of SSD available on Theta is 128GB.  
 

For example:

--attrs ssds=required:ssd_size=128

Requesting Specific Memory or Clustering Modes on the Compute Nodes

The Intel Xeon Phi compute nodes on Theta can be booted into different memory or cluster modes. The different memory and clustering modes can have an effect on performance. Booting into a specific memory/clustering mode can be selected during job submission using the --attrs option to qsub. See XC40 Memory Modes for more information and examples.

Requesting Ability to SSH into the Compute Nodes

To be able to ssh from the MOM/launch nodes on Theta to the compute nodes, pass enable_ssh=1 as part of the --attrs argument (see the example below). Once the job begins on the MOM/launch node, you can ssh (or scp, etc.) from the MOM node to the compute nodes. The compute node name can be found by reading the $COBALT_PARTNAME number, and prepending "nid" with the appropriate number of 0s to reach 5 digits.

For example, for an interactive job:

n@thetalogin4:~/> qsub -I -n 1 -t 20 --attrs enable_ssh=1 -A project -q debug-cache-quad
Connecting to thetamom1 for interactive qsub...
Job routed to queue "debug-cache-quad".
Memory mode set to cache quad for queue debug-cache-quad
Wait for job 266815 to start...
Opening interactive session to 3835
n@thetamom1:/gpfs/mira-home/user> echo $COBALT_PARTNAME
3835
n@thetamom1:/gpfs/mira-home/user> ssh nid03835
n@nid03835:~> hostname
nid03835