Running Jobs and Submission Scripts

Help Desk

Theta and ThetaGPU

Job Submission

The batch scheduler used on Theta is Cobalt. Below is a basic introduction to submitting jobs using Cobalt. More details on Cobalt options and how to query, alter, and delete submitted jobs can be found in the section on Theta Cobalt Job Control (Cray XC40). For information on queues on Theta, see Theta queues. For information on priority and scheduling, see Theta priority and scheduling.

NOTE: Everywhere in this document that aprun is referenced, that applies to the KNL nodes. 
For the GPU nodes, replace aprun with mpirun (or mpiexec) and everything else should be the same.

To request for an allocation on ThetaGPU fill out this form: Allocation request. ThetaGPU is listed under Theta on the form.

There are two main types of jobs one can submit: script and interactive.

  • In a script job, a script is given to Cobalt and when scheduled, the script is run and can call "aprun <executable>" (on KNL Nodes) or “mpirun <executable>” (on GPU nodes) to run executables on the compute nodes via aprun.
  • In an interactive job, a "-I" is passed to Cobalt and when scheduled, you are given a shell on a service node (KNL) or the head node (GPU). You can then execute aprun (KNL) or mpirun (GPU) to launch jobs to the compute nodes directly. This is useful for rapid debugging.

Overview of Where Jobs Run on Theta

To understand how jobs run, it is useful to keep in mind the different types of nodes (login, service, and compute) that make up Theta, and on which nodes the jobs run. When a user ssh's into Theta, they are given a shell on a login node. The login nodes are typically where a user compiles code and submits jobs to the batch scheduler. To run a job on Theta, the user submits a script (or interactive job) to the batch scheduler from the login node.

The rest of the process depends on whether you are running on the KNL nodes or the GPU nodes:

KNL Nodes: The script (or shell in an interactive job) does not run directly on the compute nodes--it first runs on a service node. Like the login nodes, service nodes are not the compute nodes that make up the main computational resources of the machine, but are an intermediate node where the submission script launches executables to the compute nodes with the aprun command.

GPU Nodes: The script (or shell in an interactive job) is executed on the “head node” of the allocated resources. The head node is the first node listed in the nodefile and the location of the nodefile can be obtained from the environment variable COBALT_NODEFILE.  The file is formatted so that it can be passed to mpiexec via the -f option.


  • "Service nodes" is a general term for non-compute nodes. The service nodes which launch jobs are more specifically called “MOM” or “launch” nodes on Theta, and both terms are used below.


An overview of the process is:

running jobs

Submitting Script Jobs

General information about submitting script jobs:

To run a script job, first a batch submission script is written containing Cobalt scheduler options (optional) and an aprun (KNL)/mpiexec(GPU) command, which launches a given executable to the compute nodes. Then the script is sent to the batch scheduler with qsub:

qsub -t <mins> -n <nodes> -q <queue> -A <project_name> --mode script

This is equivalent to

qsub -t <mins> -n <nodes> -q <queue> -A <project_name>
  • -t <mins> denotes the maximum time to run the job
  • -n <nodes> denotes the number of compute nodes to reserve
  • -q <queue> denotes the queue to run on
  • -A <project_name> charges the job to a particular project

For information on the available queues, time, and node limits of each queue please refer to Theta Queues.

To see which projects you are a member of, type "sbank" in the command line:

will show what projects you are a member of in the "Project" column.

You can use the environment variable “COBALT_PROJ” to set your default project. Setting qsub -A at submission time will override the COBALT_PROJ environment variable.

Cobalt requires that the script file has the execute bit set. If you try submitting a script and see the following:

command is not executable

then set the executable bit with the following, and resubmit:

chmod +x

Examples of submitting script jobs:

To run with 128 nodes for a maximum of 30 minutes in the default queue and charge the job to MyProject:

qsub -q default -n 128 -t 30 -A MyProject

To run with 2 nodes for a maximum of 30 minutes in the debug queue for flat memory mode and quad numa mode and charge the job to MyProject:

qsub -q debug-flat-quad -n 2 -t 30 -A MyProject

General information about writing submission scripts:

A very general example submission script is shown below, and specific examples are shown below that. Note that the script must include the aprun command to run the executable on the compute nodes:

#COBALT -t 30 
#COBALT -n 128
#COBALT -q default 
#COBALT --attrs mcdram=cache:numa=quad 
#COBALT -A Catalyst 
echo "Starting Cobalt job script on 128 nodes with 64 ranks on each node" 
echo "with 4 OpenMP threads per rank, for a total of 8192 ranks and 32768 OpenMP threads" 
aprun -n 8192 -N 64 --env OMP_NUM_THREADS=4 -cc depth -d 4 -j 4 myprogram.exe

For details on how to display how the threads and ranks map to physical cores on teh KNL nodes (the affinity), see Affinity on KNL nodes.

Submitting Interactive Jobs

To interactively run aprun and launch executables onto the compute nodes, the -I flag can be passed to qsub (and the executable script omitted). These jobs will provide you with a shell on the launch node where you can run aprun to launch jobs to the compute nodes. You can call aprun with up to the number of resources requested in the initial qsub. When you are finished with your interactive runs, you may end your job by exiting the shell that Cobalt spawned on your behalf. The shell will not terminate at the end of the allocation time, although all currently running apruns will be terminated and other apruns from that session will fail. This allows you to take whatever actions are needed at the end of your session.

This is useful if you have many small debugging runs, and don't want to submit each to the batch system.


qsub -A <project_name> -t <mins> -q <queue> -n <nodes> -I

When Cobalt starts your job, you will get a command prompt on a launch node, from which you may issue aprun (KNL)/mpiexec(GPU) commands:

frodo@thetalogin6:~ $ qsub -I -n 1 -t 10 -q <debug-cache-quad|debug-flat-quad> -A Project
Connecting to thetamom1 for interactive qsub...
Job routed to queue "<debug-cache-quad|debug-flat-quad>".
Wait for job 97442 to start...
Opening interactive session to 3208
frodo@thetamom1:/gpfs/theta-fs1/home/frodo $ aprun -n 1 -N 1 myprogram.exe <arguments>

Bundling Multiple Runs into a Script Job

There are several ways to bundle many jobs together in a single script, and then submit that script to the batch scheduler. The advantage of this process is that you wait in the queue only once.

1. Running many jobs one after another

The simplest way of bundling many apruns in script is simply to list one after another. The apruns will run one at a time sequentially. Each aprun can use up to the number of nodes that were requested in the initial qsub. See the previous section about basic Cobalt script submission and its restrictions. The following code is an example of completing multiple runs within a script, where each aprun requests the same number of nodes:

echo "Starting Cobalt job script"
aprun -n 128 -N 64 myprogram.exe arg1
aprun -n 128 -N 64 myprogram.exe arg2
aprun -n 128 -N 64 myprogram.exe arg3

The aprun command blocks until task completion, at which point it exits, providing a convenient way to run multiple short jobs together. In addition, if a subset of nodes is requested, aprun will place jobs on nodes in the script’s reservation until the pool of inactive nodes is exhausted. If the number of nodes requested by an aprun exceeds the number of nodes reserved by the batch scheduler for the job (through the qsub command), that aprun will fail to execute and an error will be returned.

2. Running many jobs at the same time

Multiple simultaneous apruns can be launched by backgrounding the aprun commands in the script and then waiting for completion. A short sleep between apruns is recommended to avoid a potential race condition during a large number of aprun starts. As an example, the following script will launch 3 simultaneous apruns, which execute on the compute nodes at the same time. The first aprun listed runs on 3 nodes (192/64), the second on 4 nodes (256/64), and the last one 1 node (64/64). Since the apruns are backgrounded (as denoted by the &), the script must have a "wait" command at the end so that it does not exit before the apruns complete.

 echo "Starting Cobalt job script" 
aprun -n 192 -N 64 myprogram.exe arg1 &
sleep 1
aprun -n 256 -N 64 myprogram.exe arg1 & 
sleep 1 aprun -n 64 -N 64 myprogram.exe arg1 & 

Submit the job using qsub:

qsub -A <project_name> -q <queue> -t 60 -n 8

Since the three apruns in the above example will run simultaneously on a separate set of nodes, 8 total nodes are requested in the qsub command.


  • Each of the apruns will run on a separate set of nodes. It's currently not possible to run multiple apruns on the same node at the same time.
  • There is a system limitation of 1,000 simultaneous aprun invocations in a job script. If this limit is hit, you will see the error:
apsched: no more claims allowed for this reservation (max 1000)

3. Using a Workflow Manager

There are a variety of workflow managers that can assist bundling jobs together. A few are listed below:

Deleting a Script Job or Interactive Job

To delete a job from the queue, use the qdel command.

Cancel job 34586:

qdel 34586

Depending on the stage of a job’s lifetime, qdel may not complete immediately, especially if the delete is issued during startup on a job that is changing memory modes and rebooting a node. If the job does not ultimately terminate, contact with the jobid so that an administrator can take appropriate cleanup actions and administratively terminate the job.

Querying Partition Availability

To determine which partitions are currently available to the scheduler, use the nodelist command. This command provides a list of node ids, names, queue, and state as well as any backfill windows. For example:

% nodelist Node_id  Name         Queues Status           MCDRAM NUMA Backfill 
20       c0-0c0s5n0 default cleanup-pending flat quad 4:59:44 
21       c0-0c0s5n1 default cleanup-pending flat quad 4:59:44 
22       c0-0c0s5n2 default busy flat quad 4:59:44 
24       c0-0c0s6n0 default busy flat quad 4:59:44 
25       c0-0c0s6n1 default busy flat quad 4:59:44 
26       c0-0c0s6n2 default busy flat quad 4:59:44 
27       c0-0c0s6n3 default busy flat quad 4:59:44 
28       c0-0c0s7n0 default idle flat quad 4:59:44 
29       c0-0c0s7n1 default idle flat quad 4:59:44 
30       c0-0c0s7n2 default idle flat quad 4:59:44 
31       c0-0c0s7n3 default idle flat quad 4:59:44 
32       c0-0c0s8n0 default idle flat quad 4:59:44 
33       c0-0c0s8n1 default idle flat quad 4:59:44 
34       c0-0c0s8n2 default idle flat quad 4:59:44

Job Settings

Environment Variables


The following environment variables are set in the Cobalt script job environment for KNL nodes:

COBALT_PARTNAME - physical nodes assigned by cobalt (e.g., "340-343" from a 4-node run) 
COBALT_PARTSIZE - on KNL nodes, identical to COBALT_JOBSIZE 
COBALT_JOBSIZE - number of nodes requested

The following environment variables are set in the Cobalt script environment for GPU nodes:

COBALT_NODEFILE – pathname for file containing list of hostnames assigned to the job

The following environment variables are set in the Cobalt script environment, as well as in the compute node environment on both KNL and GPU nodes:

COBALT_JOBID - the job ID assigned by cobalt (e.g., 130850)


Pass a single variable:

qsub -t 30 -n 128 --env VAR=value

Pass more than one environment variable using colon separation:

qsub -t 30 -n 128 --env VAR1=value1:VAR2=value2:VAR3=value3

Note that multiple  --env arguments are additive:

qsub -t 30 -n 128 --env VAR1=value1 --env VAR2=value2

Remember to place this argument and the other Cobalt arguments before your executable name.

Within a script mode job, use the -e argument to aprun, as shown in the following example:

# Pass a single variable
aprun -n 64 -N 64 -e VAR=value myprogram.exe
# Pass more than one variable using multiple -e arguments
aprun -n 64 -N 64 -e VAR1=value1 -e VAR2=value2 myprogram.exe

Another way to set environment variables is by setting them in the submission script (using bash as an example):

export VAR=value
aprun -n 64 -N 64 myprogram.exe

Script Environment

The script job will receive your non-interactive login shell environment as it is set at the time the script job is launched. Any changes needed from your default login environment must be placed in the script. Note that any changes made in the configuration files of the default environment after job submission but prior to job execution will show up in the executed job.

Program and Argument Length Limit

The total length of the executable and all arguments to your program may not be longer than 4,096 characters (this is a system limitation). The executable must be no more than 512 characters.

Job Dependencies

Cobalt’s job dependency feature can be used to declare that the submitted job will not run until all jobs listed after "--dependencies" in the qsub command finish running and exit with a status of 0. In the following example, the job will not begin to run until jobs with COBALT_JOBIDs 305998 and 305999 have exited with a status of 0.

qsub -q default -n 128 -t 30 -A MyProject --dependencies 305998:305999

If a job terminates abnormally, any jobs depending on that one will not be released and will remain in the dep_hold state. To clear this hold:

qalter --dependencies none

System Sizes and Limitations

On Theta, job sizes from a single node to all nodes are supported, barring queue limitations (See Job Scheduling Policy for Theta for information about queue limitations). A job must have sufficient nodes available and idle in its queue to run. Within a Cobalt job, there is a limitation of 1,000 simultaneous aprun invocations permitted due to a system limitation in Cray’s ALPS software stack to prevent resource starvation on the launch nodes. When running many small jobs, it is highly advised to mitigate startup costs and resource startup time by bundling them together into a script. This provides more efficient use of node-hours, as well as making the job’s score accrual more favorable due to the larger overall job size.

Requesting Local SSD Requirements

Theta's compute nodes are equipped with SSDs that are available to projects that request the usage of them.  You may indicate that your job requires the usage of local SSD storage during your job, and indicate the amount of free space on the SSDs that your job requires. SSD storage is only for use during your job, and all data written to the local SSDs will be deleted when your Cobalt job terminates.  There is no automated backup of any data on the local node SSDs.  If your project has requested the use of local SSDs, the storage is located at /local/scratch.

If your project has been granted access to the local SSDs, you can request the use of local SSD storage by adding ​ssds=required to the --attrs argument of your qsub command.  You may indicate the minimum amount of free space on local SSDs required by adding ssd_size=N, where N is the required size in GB to your --attrs argument.  Any job with these settings will be be run on nodes that have enabled SSDs.  If there are insufficient SSD-enabled nodes available on the system for a job's nodecount, the job will not run.  Currently the maximum size of SSD available on Theta is 128GB.  

For example:

--attrs ssds=required:ssd_size=128

Requesting Specific Memory or Clustering Modes on the Compute Nodes

The Intel Xeon Phi compute nodes on Theta can be booted into different memory or cluster modes. The different memory and clustering modes can have an effect on performance. Booting into a specific memory/clustering mode can be selected during job submission using the --attrs option to qsub. See KNL Memory Modes for more information and examples.

Requesting Ability to SSH into the Compute Nodes on KNL Nodes

To be able to ssh from the MOM/launch nodes on Theta to the compute nodes, pass enable_ssh=1 as part of the --attrs argument (see the example below). Once the job begins on the MOM/launch node, you can ssh (or scp, etc.) from the MOM node to the compute nodes. The compute node name can be found by reading the $COBALT_PARTNAME number, and prepending "nid" with the appropriate number of 0s to reach 5 digits.

For example, for an interactive job:

n@thetalogin4:~/> qsub -I -n 1 -t 20 --attrs enable_ssh=1 -A project -q debug-cache-quad 
Connecting to thetamom1 for interactive qsub... 
Job routed to queue "debug-cache-quad". 
Memory mode set to cache quad for queue debug-cache-quad Wait for job 266815 to start... 
Opening interactive session to 3835 n@thetamom1:/gpfs/mira-home/user> echo $COBALT_PARTNAME 
ssh nid03835 n@nid03835:~> hostname 

Specifying Filesystems

On Theta and other systems running Cobalt at the ALCF, your job submission should specify which filesystems your job will be using.  In the event that a filesystem becomes unavailable, this information is used to preserve jobs that would use that filesystem while allowing other jobs that are not using an affected filesystem to proceed to run normally. You may specify your filesystem by adding filesystems=<list of filesystems> to the --attrs argument of qsub in Cobalt. Valid filesystems are home, eagle, grand, and theta-fs0. The list is comma-delimited. For example, to request the home and eagle filesystems for your job you would add filesystems=home,eagle to your qsub command. If this is not specified a warning will be printed and then the job will be tagged as requesting all filesystems and may be held unnecessarily if a filesystem is not currently available. The warnings are written to stderr of qsub and qalter commands that change the value of the --attrs flag.  Scripts that are parsing stderr from these utilities may encounter errors from the additional warnings if filesystems are not specified in these commands.

If a job is submitted while a filesystem it requested is marked down, the job will automatically be placed into a user_hold and a warning message will be printed, but the job will be otherwise queued. The job is also placed into admin_hold by a sysadmin script. Once the affected filesystem has been returned to normal operation, the admin_hold is released. You are responsible for releasing the user_hold once you receive the message that the affected filesystem has been returned to normal operation. The job cannot run until both the holds are released.

If a job requesting a filesystem that is marked down is already in the queue, it will be placed on admin_hold and will be released once the filesystem is operational.

An example of a job requesting filesystems:

qsub -n 128 -t 30 -q default --attrs filesystems=home,grand -A Project ./

To update the filesystems list for your job, use qalter. Note that qalter --attrs is a replace and not an update operation. This means that you should once again specify all the attributes that you had in the original qsub command.

qalter --attr filesystems=home,eagle:mcdram=cache:numa=quad <jobid>

To release user hold:

qrls <jobid>