The batch scheduler used on Theta is Cobalt. Below is a basic introduction to submitting jobs using Cobalt. More details on Cobalt options and how to query, alter, and delete submitted jobs can be found in the section on Theta Cobalt Job Control (Cray XC40). For information on queues on Theta, see Theta queues. For information on priority and scheduling, see Theta priority and scheduling.
NOTE: Everywhere in this document that aprun is referenced, that applies to the KNL nodes.
For the GPU nodes, replace aprun with mpirun (or mpiexec) and everything else should be the same.
To request for an allocation on ThetaGPU fill out this form: Allocation request. ThetaGPU is listed under Theta on the form.
There are two main types of jobs one can submit: script and interactive.
- In a script job, a script is given to Cobalt and when scheduled, the script is run and can call "aprun <executable>" (on KNL Nodes) or “mpirun <executable>” (on GPU nodes) to run executables on the compute nodes via aprun.
- In an interactive job, a "-I" is passed to Cobalt and when scheduled, you are given a shell on a service node (KNL) or the head node (GPU). You can then execute aprun (KNL) or mpirun (GPU) to launch jobs to the compute nodes directly. This is useful for rapid debugging.
Overview of Where Jobs Run on Theta
To understand how jobs run, it is useful to keep in mind the different types of nodes (login, service, and compute) that make up Theta, and on which nodes the jobs run. When a user ssh's into Theta, they are given a shell on a login node. The login nodes are typically where a user compiles code and submits jobs to the batch scheduler. To run a job on Theta, the user submits a script (or interactive job) to the batch scheduler from the login node.
The rest of the process depends on whether you are running on the KNL nodes or the GPU nodes:
KNL Nodes: The script (or shell in an interactive job) does not run directly on the compute nodes--it first runs on a service node. Like the login nodes, service nodes are not the compute nodes that make up the main computational resources of the machine, but are an intermediate node where the submission script launches executables to the compute nodes with the aprun command.
GPU Nodes: The script (or shell in an interactive job) is executed on the “head node” of the allocated resources. The head node is the first node listed in the nodefile and the location of the nodefile can be obtained from the environment variable COBALT_NODEFILE. The file is formatted so that it can be passed to mpiexec via the -f option.
- "Service nodes" is a general term for non-compute nodes. The service nodes which launch jobs are more specifically called “MOM” or “launch” nodes on Theta, and both terms are used below.
An overview of the process is:
Submitting Script Jobs
General information about submitting script jobs:
To run a script job, first a batch submission script is written containing Cobalt scheduler options (optional) and an aprun (KNL)/mpiexec(GPU) command, which launches a given executable to the compute nodes. Then the script is sent to the batch scheduler with qsub:
qsub -t <mins> -n <nodes> -q <queue> -A <project_name> --mode script myscript.sh
This is equivalent to
qsub -t <mins> -n <nodes> -q <queue> -A <project_name> myscript.sh
- -t <mins> denotes the maximum time to run the job
- -n <nodes> denotes the number of compute nodes to reserve
- -q <queue> denotes the queue to run on
- -A <project_name> charges the job to a particular project
For information on the available queues, time, and node limits of each queue please refer to Theta Queues.
To see which projects you are a member of, type "sbank" in the command line:
will show what projects you are a member of in the "Project" column.
You can use the environment variable “COBALT_PROJ” to set your default project. Setting qsub -A at submission time will override the COBALT_PROJ environment variable.
Cobalt requires that the script file has the execute bit set. If you try submitting a script and see the following:
command myscript.sh is not executable
then set the executable bit with the following, and resubmit:
chmod +x myscript.sh
Examples of submitting script jobs:
To run myscript.sh with 128 nodes for a maximum of 30 minutes in the default queue and charge the job to MyProject:
qsub -q default -n 128 -t 30 -A MyProject myscript.sh
To run myscript.sh with 2 nodes for a maximum of 30 minutes in the debug queue for flat memory mode and quad numa mode and charge the job to MyProject:
qsub -q debug-flat-quad -n 2 -t 30 -A MyProject myscript.sh
General information about writing submission scripts:
A very general example submission script is shown below, and specific examples are shown below that. Note that the script must include the aprun command to run the executable on the compute nodes:
#!/bin/bash #COBALT -t 30 #COBALT -n 128 #COBALT -q default #COBALT --attrs mcdram=cache:numa=quad #COBALT -A Catalyst echo "Starting Cobalt job script on 128 nodes with 64 ranks on each node" echo "with 4 OpenMP threads per rank, for a total of 8192 ranks and 32768 OpenMP threads" aprun -n 8192 -N 64 --env OMP_NUM_THREADS=4 -cc depth -d 4 -j 4 myprogram.exe
For details on how to display how the threads and ranks map to physical cores on teh KNL nodes (the affinity), see Affinity on KNL nodes.
Submitting Interactive Jobs
To interactively run aprun and launch executables onto the compute nodes, the -I flag can be passed to qsub (and the executable script omitted). These jobs will provide you with a shell on the launch node where you can run aprun to launch jobs to the compute nodes. You can call aprun with up to the number of resources requested in the initial qsub. When you are finished with your interactive runs, you may end your job by exiting the shell that Cobalt spawned on your behalf. The shell will not terminate at the end of the allocation time, although all currently running apruns will be terminated and other apruns from that session will fail. This allows you to take whatever actions are needed at the end of your session.
This is useful if you have many small debugging runs, and don't want to submit each to the batch system.
qsub -A <project_name> -t <mins> -q <queue> -n <nodes> -I
When Cobalt starts your job, you will get a command prompt on a launch node, from which you may issue aprun (KNL)/mpiexec(GPU) commands:
frodo@thetalogin6:~ $ qsub -I -n 1 -t 10 -q <debug-cache-quad|debug-flat-quad> -A Project Connecting to thetamom1 for interactive qsub... Job routed to queue "<debug-cache-quad|debug-flat-quad>". Wait for job 97442 to start... Opening interactive session to 3208 frodo@thetamom1:/gpfs/theta-fs1/home/frodo $ aprun -n 1 -N 1 myprogram.exe <arguments>
Bundling Multiple Runs into a Script Job
There are several ways to bundle many jobs together in a single script, and then submit that script to the batch scheduler. The advantage of this process is that you wait in the queue only once.
1. Running many jobs one after another
The simplest way of bundling many apruns in script is simply to list one after another. The apruns will run one at a time sequentially. Each aprun can use up to the number of nodes that were requested in the initial qsub. See the previous section about basic Cobalt script submission and its restrictions. The following code is an example of completing multiple runs within a script, where each aprun requests the same number of nodes:
#!/bin/bash echo "Starting Cobalt job script" aprun -n 128 -N 64 myprogram.exe arg1 aprun -n 128 -N 64 myprogram.exe arg2 aprun -n 128 -N 64 myprogram.exe arg3
The aprun command blocks until task completion, at which point it exits, providing a convenient way to run multiple short jobs together. In addition, if a subset of nodes is requested, aprun will place jobs on nodes in the script’s reservation until the pool of inactive nodes is exhausted. If the number of nodes requested by an aprun exceeds the number of nodes reserved by the batch scheduler for the job (through the qsub command), that aprun will fail to execute and an error will be returned.
2. Running many jobs at the same time
Multiple simultaneous apruns can be launched by backgrounding the aprun commands in the script and then waiting for completion. A short sleep between apruns is recommended to avoid a potential race condition during a large number of aprun starts. As an example, the following script will launch 3 simultaneous apruns, which execute on the compute nodes at the same time. The first aprun listed runs on 3 nodes (192/64), the second on 4 nodes (256/64), and the last one 1 node (64/64). Since the apruns are backgrounded (as denoted by the &), the script must have a "wait" command at the end so that it does not exit before the apruns complete.
#!/bin/bash echo "Starting Cobalt job script" aprun -n 192 -N 64 myprogram.exe arg1 & sleep 1 aprun -n 256 -N 64 myprogram.exe arg1 & sleep 1 aprun -n 64 -N 64 myprogram.exe arg1 & wait
Submit the job using qsub:
qsub -A <project_name> -q <queue> -t 60 -n 8 myjob.sh
Since the three apruns in the above example will run simultaneously on a separate set of nodes, 8 total nodes are requested in the qsub command.
- Each of the apruns will run on a separate set of nodes. It's currently not possible to run multiple apruns on the same node at the same time.
- There is a system limitation of 1,000 simultaneous aprun invocations in a job script. If this limit is hit, you will see the error:
apsched: no more claims allowed for this reservation (max 1000)
Depending on the stage of a job’s lifetime, qdel may not complete immediately, especially if the delete is issued during startup on a job that is changing memory modes and rebooting a node. If the job does not ultimately terminate, contact firstname.lastname@example.org with the jobid so that an administrator can take appropriate cleanup actions and administratively terminate the job.
Querying Partition Availability
To determine which partitions are currently available to the scheduler, use the nodelist command. This command provides a list of node ids, names, queue, and state as well as any backfill windows. For example:
% nodelist Node_id Name Queues Status MCDRAM NUMA Backfill ================================================================================ [...] 20 c0-0c0s5n0 default cleanup-pending flat quad 4:59:44 21 c0-0c0s5n1 default cleanup-pending flat quad 4:59:44 22 c0-0c0s5n2 default busy flat quad 4:59:44 24 c0-0c0s6n0 default busy flat quad 4:59:44 25 c0-0c0s6n1 default busy flat quad 4:59:44 26 c0-0c0s6n2 default busy flat quad 4:59:44 27 c0-0c0s6n3 default busy flat quad 4:59:44 28 c0-0c0s7n0 default idle flat quad 4:59:44 29 c0-0c0s7n1 default idle flat quad 4:59:44 30 c0-0c0s7n2 default idle flat quad 4:59:44 31 c0-0c0s7n3 default idle flat quad 4:59:44 32 c0-0c0s8n0 default idle flat quad 4:59:44 33 c0-0c0s8n1 default idle flat quad 4:59:44 34 c0-0c0s8n2 default idle flat quad 4:59:44 [...]
The following environment variables are set in the Cobalt script job environment for KNL nodes:
COBALT_PARTNAME - physical nodes assigned by cobalt (e.g., "340-343" from a 4-node run) COBALT_PARTSIZE - on KNL nodes, identical to COBALT_JOBSIZE COBALT_JOBSIZE - number of nodes requested
The following environment variables are set in the Cobalt script environment for GPU nodes:
COBALT_NODEFILE – pathname for file containing list of hostnames assigned to the job
The following environment variables are set in the Cobalt script environment, as well as in the compute node environment on both KNL and GPU nodes:
COBALT_JOBID - the job ID assigned by cobalt (e.g., 130850)
Pass a single variable:
qsub -t 30 -n 128 --env VAR=value myjob.sh
Pass more than one environment variable using colon separation:
qsub -t 30 -n 128 --env VAR1=value1:VAR2=value2:VAR3=value3 myjob.sh
Note that multiple --env arguments are additive:
qsub -t 30 -n 128 --env VAR1=value1 --env VAR2=value2 myjob.sh
Remember to place this argument and the other Cobalt arguments before your executable name.
Within a script mode job, use the -e argument to aprun, as shown in the following example:
# Pass a single variable aprun -n 64 -N 64 -e VAR=value myprogram.exe
# Pass more than one variable using multiple -e arguments aprun -n 64 -N 64 -e VAR1=value1 -e VAR2=value2 myprogram.exe
Another way to set environment variables is by setting them in the submission script (using bash as an example):
#!/bin/bash export VAR=value aprun -n 64 -N 64 myprogram.exe
The script job will receive your non-interactive login shell environment as it is set at the time the script job is launched. Any changes needed from your default login environment must be placed in the script. Note that any changes made in the configuration files of the default environment after job submission but prior to job execution will show up in the executed job.
Program and Argument Length Limit
The total length of the executable and all arguments to your program may not be longer than 4,096 characters (this is a system limitation). The executable must be no more than 512 characters.
Cobalt’s job dependency feature can be used to declare that the submitted job will not run until all jobs listed after "--dependencies" in the qsub command finish running and exit with a status of 0. In the following example, the job will not begin to run until jobs with COBALT_JOBIDs 305998 and 305999 have exited with a status of 0.
qsub -q default -n 128 -t 30 -A MyProject myscript.sh --dependencies 305998:305999
If a job terminates abnormally, any jobs depending on that one will not be released and will remain in the dep_hold state. To clear this hold:
qalter --dependencies none
System Sizes and Limitations
On Theta, job sizes from a single node to all nodes are supported, barring queue limitations (See Job Scheduling Policy for Theta for information about queue limitations). A job must have sufficient nodes available and idle in its queue to run. Within a Cobalt job, there is a limitation of 1,000 simultaneous aprun invocations permitted due to a system limitation in Cray’s ALPS software stack to prevent resource starvation on the launch nodes. When running many small jobs, it is highly advised to mitigate startup costs and resource startup time by bundling them together into a script. This provides more efficient use of node-hours, as well as making the job’s score accrual more favorable due to the larger overall job size.
Requesting Local SSD Requirements
Theta's compute nodes are equipped with SSDs that are available to projects that request the usage of them. You may indicate that your job requires the usage of local SSD storage during your job, and indicate the amount of free space on the SSDs that your job requires. SSD storage is only for use during your job, and all data written to the local SSDs will be deleted when your Cobalt job terminates. There is no automated backup of any data on the local node SSDs. If your project has requested the use of local SSDs, the storage is located at /local/scratch.
If your project has been granted access to the local SSDs, you can request the use of local SSD storage by adding ssds=required to the --attrs argument of your qsub command. You may indicate the minimum amount of free space on local SSDs required by adding ssd_size=N, where N is the required size in GB to your --attrs argument. Any job with these settings will be be run on nodes that have enabled SSDs. If there are insufficient SSD-enabled nodes available on the system for a job's nodecount, the job will not run. Currently the maximum size of SSD available on Theta is 128GB.
Requesting Specific Memory or Clustering Modes on the Compute Nodes
The Intel Xeon Phi compute nodes on Theta can be booted into different memory or cluster modes. The different memory and clustering modes can have an effect on performance. Booting into a specific memory/clustering mode can be selected during job submission using the --attrs option to qsub. See KNL Memory Modes for more information and examples.
Requesting Ability to SSH into the Compute Nodes on KNL Nodes
To be able to ssh from the MOM/launch nodes on Theta to the compute nodes, pass enable_ssh=1 as part of the --attrs argument (see the example below). Once the job begins on the MOM/launch node, you can ssh (or scp, etc.) from the MOM node to the compute nodes. The compute node name can be found by reading the $COBALT_PARTNAME number, and prepending "nid" with the appropriate number of 0s to reach 5 digits.
For example, for an interactive job:
n@thetalogin4:~/> qsub -I -n 1 -t 20 --attrs enable_ssh=1 -A project -q debug-cache-quad Connecting to thetamom1 for interactive qsub... Job routed to queue "debug-cache-quad". Memory mode set to cache quad for queue debug-cache-quad Wait for job 266815 to start... Opening interactive session to 3835 n@thetamom1:/gpfs/mira-home/user> echo $COBALT_PARTNAME 3835 n@thetamom1:/gpfs/mira-home/user> ssh nid03835 n@nid03835:~> hostname nid03835
On Theta and other systems running Cobalt at the ALCF, your job submission should specify which filesystems your job will be using. In the event that a filesystem becomes unavailable, this information is used to preserve jobs that would use that filesystem while allowing other jobs that are not using an affected filesystem to proceed to run normally. You may specify your filesystem by adding filesystems=<list of filesystems> to the --attrs argument of qsub in Cobalt. Valid filesystems are home, eagle, grand, and theta-fs0. The list is comma-delimited. For example, to request the home and eagle filesystems for your job you would add filesystems=home,eagle to your qsub command. If this is not specified a warning will be printed and then the job will be tagged as requesting all filesystems and may be held unnecessarily if a filesystem is not currently available. The warnings are written to stderr of qsub and qalter commands that change the value of the --attrs flag. Scripts that are parsing stderr from these utilities may encounter errors from the additional warnings if filesystems are not specified in these commands.
If a job is submitted while a filesystem it requested is marked down, the job will automatically be placed into a user_hold and a warning message will be printed, but the job will be otherwise queued. The job is also placed into admin_hold by a sysadmin script. Once the affected filesystem has been returned to normal operation, the admin_hold is released. You are responsible for releasing the user_hold once you receive the message that the affected filesystem has been returned to normal operation. The job cannot run until both the holds are released.
If a job requesting a filesystem that is marked down is already in the queue, it will be placed on admin_hold and will be released once the filesystem is operational.
An example of a job requesting filesystems:
qsub -n 128 -t 30 -q default --attrs filesystems=home,grand -A Project ./my_job.sh
To update the filesystems list for your job, use qalter. Note that qalter --attrs is a replace and not an update operation. This means that you should once again specify all the attributes that you had in the original qsub command.
qalter --attr filesystems=home,eagle:mcdram=cache:numa=quad <jobid>
To release user hold: