Running Jobs on BG/Q


Job Submission

The batch scheduler used on Mira is Cobalt.  Below is a basic introduction to submitting jobs using Cobalt. More details on Cobalt options and how to query, alter, and delete submitted jobs can be found in the section on Cobalt Job Control.

There are two main types of jobs one can submit: executable and script. In an executable job, the executable is given to Cobalt for it to run directly on the compute nodes. In a script job, a script is given to Cobalt and when scheduled the script is run on one of the service nodes. The script can then run executables on the compute nodes. This method requires a bit more setup, but can be much more flexible.

Here’s an example for submitting an executable (not a script job):

qsub -A <project_name> -t <mins> -n <nodes> --proccount <total_number_of_ranks> \
  --mode c<1|2|4|8|16|32|64> <executable> <args…>
  • A node is a 16-core chip and its memory. One rack of BG/Q has 1024 nodes.
  • Think of c<1|2|4|8|16|32|64> as denoting the number of MPI ranks per node.
  • To run on the compute nodes, the executable must be cross-compiled for BG/Q.

Example: To run 8192 MPI processes on 512 nodes for 30 minutes, you could submit the following:

qsub -A <project_name> -t 30 -n 512 --proccount 8192 --mode c16 <executable> <args…>

NOTE: In this case the “--proccount 8192” option could have been omitted since it will default to the maximum available which is the number of nodes (512) times the ranks per node (16). It needs to be specified only if running with fewer than the maximum number of ranks. In that case, some of the cores will be left idle.

Submitting a Script Job

If an executable program is invoked from within a script, use the special script mode when invoking qsub:

qsub -A <project_name> -t <mins> -n <nodes> --mode script

The script must include the runjob command to run the executable on the compute nodes:

#!/bin/sh
echo "Starting Cobalt job script"
runjob --np 8192 -p 16 --block $COBALT_PARTNAME --verbose=INFO : <executable> <args…>

(NOTE: To run a job with fewer than 128 nodes, read about sub-block job scripts (below).)

Here, --np is the total number of MPI ranks in the job, and -p is the number of ranks per node (analogous to specifying “--mode c##” when running an executable job). So, --np 8192 -p 16 implies 16 ranks per node, and fits onto 512 nodes. Enter “runjob -h” or “man runjob” from the command prompt for more information about its arguments.

To submit this script, you might use the following:

qsub -A <project_name> -t 30 -n 512 --mode script

Additional details:

  • The arguments to runjob are different than the qsub command. A user may want to specify the following terms: --np, -p, --envs, --cwd, --verbose, --label [NOTE: on the login node, type "runjob -h” or “man runjob” for a complete list of arguments.]  Environment variables are specified by the “--envs” flag. Multiple --envs flags can be used to pass multiple variables.
  • The runjob --np or -n argument will specify the number of ranks. The number of nodes allocated for the job is determined by the qsub option "-n"; the runjob may specify any values for "--np" and "-p" that fits within that number of nodes.
  • The --cwd argument may be used to set the working directory within the runjob to be different than the working directory it was invoked from. Note that the argument to --cwd must be an absolute path (i.e., starting with "/"). Relative paths are not currently supported.
  • Verbose level INFO (4) is recommended (Refer to the basic script example above).
  • The job script will be executed on a dedicated login node, which is a 12-core 3.7Ghz PPC64 with 64GB RAM. All script jobs share this node, so it is important that the user take into consideration these capabilities when deciding what to run in the script.
  • The entire time a job is running, a compute node partition is reserved exclusively for the user (regardless of whether the user is executing runjob or not). Important: The job charges are for the entire script job duration, not just the portion that actually runs with runjob.
  • The environment within which the user script runs may not be the same as the default shell environment. Refer to the section below on Script environment.
  • Redirection of stdin, stdout, or stderr (e.g., ">") on the runjob command should behave as expected. However, only one rank will receive stdin. The rank that receives stdin can be changed with the --stdinrank option.
  • The exit status of the script will determine whether Cobalt considers the job script execution successful or not. This is important if the user is using --dependencies (see Job dependencies). Normally, a script's exit status is the status of the last command executed, unless there is an explicit "exit" command.

Sub-block Script Jobs

Cobalt is not currently configured to schedule jobs at the sub-block level.  However, within a job script, you may manually run multiple runjobs each targetted to a sub-block of the job's partition.  Please see additional documentation at https://www.alcf.anl.gov/presentations/ensemble-job-submission-blue-gene...

Multiple Consecutive Runs within a Script

If jobs all require the same size partition, the user can submit a single Cobalt job script and conduct multiple runs within the script. The advantage of this process is that the user waits in the queue only once. Users should reference the previous section about basic Cobalt script submission and its restrictions. The following code is an example of doing multiple runs within the script:

	
-----myjob.sh------
#!/bin/sh
echo "Starting Cobalt job script"
runjob --np 8192 -p 16 --block $COBALT_PARTNAME --verbose=INFO : run1.exe arg1
runjob --np 8192 -p 16 --block $COBALT_PARTNAME --verbose=INFO : run2.exe arg2
runjob --np 8192 -p 16 --block $COBALT_PARTNAME --verbose=INFO : run3.exe arg3
-----end myjob.sh---

Submit the job using script mode:

qsub -A <project_name> -t 60 -n 512 --mode script myjob.sh

Job Settings

Environment Variables

Pre-defined

The following environment variables are set in the Cobalt script job environment:

COBALT_PARTNAME - physical partition assigned by cobalt (e.g. MIR-XXXXX-YYYYY-512)
COBALT_PARTSIZE - size of the partition assigned by cobalt (e.g. 512)
COBALT_JOBSIZE - number of nodes requested (e.g. 512, can be less than COBALT_PARTSIZE)

The following environment variables are set in the Cobalt script environment as well as in the compute node environment:

COBALT_JOBID - the job ID assigned by cobalt (e.g. 130850)

User-defined

# Pass a single variable
qsub -t 30 -n 64 --env VAR=value myprogram.exe

# Pass more than one environment variable using colon separation
qsub -t 30 -n 64 --env VAR1=value1:VAR2=value2:VAR3=value3 myprogram.exe

# Do not use more than one --env argument (only the last one will have effect)
qsub -t 30 -n 64 --env VAR1=value1 --env VAR2=value2 myprogram.exe  # THIS WILL NOT WORK

Remember to place this argument and the other Cobalt arguments before your executable name.

Within a script mode job, use the --envs argument, as shown in the following example:

# Pass a single variable
runjob --np 64 --envs VAR=value : myprogram.exe

# Pass more than one variable using multiple --env arguments or space separation
runjob --np 64 --envs VAR1=value1 --envs VAR2=value2 : myprogram.exe
runjob --np 64 --envs VAR1=value1 VAR2=value2 : myprogram.exe

Additional details:The environment variables set in the shell, as seen by qsub, will not be passed to the job. In the following example, when myprogram.exe runs, it will not have VAR set in its environment:

export VAR=value                        # incorrect
qsub -t 10 -n 64 myprogram.exe

Likewise, the environment variables set in a script job's shell will not be passed into myprogram.exe:

#!/bin/sh
export VAR=value                        # incorrect
runjob --np 64 : myprogram.exe

The maximum size of all environment data is 8191 characters. The back-end job will fail to start if it is larger. This is a Blue Gene system software limitation.

Script Environment

The script job will receive your non-interactive login shell environment as it is set at the time the script job is launched. Any changes needed from your default login environment must be placed in the script. Note that any changes made in the configuration files of the default environment after job submission but prior to job execution will show up in the executed job.

Program and Argument Length Limit

The total length of the executable and all arguments to your program may not be longer than 4096 characters (this is a system limitation). The executable must be no more than 512 characters.

Job Dependencies

Cobalt’s job dependency feature is described in the qsub manpage:

--dependencies <jobid1>:<jobid2>
Set the dependencies for the  job  being  submitted. This job
won't run until jobs having ids jobid1 and jobid2 have finished successfully.

If a job terminates abnormally, any jobs depending on that one will not be released and will remain in the dep_hold state. To clear this hold:

qalter --dependencies none

Thread Stack Size

For threaded code, the child threads' stacks are fixed-size and allocated in the heap; there is no overflow detection for them. For code built with the XL compilers, the child stack size is set by the runtime environment variable XLSMPOPTS=stack=NNN (where NNN is the size in bytes for a single child thread stack). For example, while running in c1 mode with 16 threads in a process, the initial thread will continue to use the main stack while the 15 child threads will consume a total of 15*NNN bytes from the heap.

The default value for thread stack size is 8MB.

Verbose Setting for Runjob

When submitting an executable to Cobalt directly, you can set the verbose level for the underlying runjob via the environment variable RUNJOB_VERBOSE:

qsub -q prod -t 10 -n 64 --env RUNJOB_VERBOSE=4 a.out

When using a Cobalt script, the setting is given directly to runjob:

#!/bin/bash
runjob --np 64 --verbose=4 : a.out

In most cases, level 4 will provide enough diagnostic output.

How do I get each line of the output labeled with the MPI rank that it came from?

Include --env RUNJOB_LABEL=short in your qsub command.

Mapping of MPI Tasks to Cores

How do I change the physical layout (mapping) of MPI tasks in my job?

The Blue Gene/Q has a five-dimensional torus network topology (for partitions smaller than 512 node,the torus degenerates to a mesh, but the concepts discussed here remain the same). This network topology can be visualized as a five-dimensional hypercube with nodes distributed evenly throughout the hypercube and labeled A, B, C, D, and E. Each node is connected to its ten nearest neighbors (+/-A, +/-B, +/-C, +/-D, +/-E) by a network link. Communication with a nearest neighbor involves traversing just one link, while communicating with more distant nodes requires traversing multiple links. A node’s location in the torus can be described using five coordinates <A,B,C,D,E>. In addition, each node contains 16 cores, each with four hardware threads for up to 64 hardware threads per node. Thus the location of a process can be described within the torus network and hardware threads using six coordinates <A,B,C,D,E,T>, where T represents the hardware thread id.

When a job is started, each MPI process is placed at a particular <A,B,C,D,E,T> coordinate within the torus. Depending on the communication pattern of the application, different mapping of MPI ranks across the torus can produce different performance results. A default mapping, along with several pre-defined mappings, exists and an arbitrary mapping may be specified by the user.

Pre-defined mappings are described by the order in which the coordinates are incremented as the MPI ranks are placed on to nodes. For all predefined mappings, the first MPI rank is placed at coordinate <0,0,0,0,0,0>. For pre-defined mapping ABCDET, the T coordinate will then be incremented and the next rank placed; once the maximum T value has been reached, the E coordinate is increment and the T coordinate is set to zero. Predefined mappings are arbitrary permutations of ABCDET.

The mapping to be used is specified by setting the environment variable RUNJOB_MAPPING; if not set, the default mapping is used.  The default assignment of MPI tasks to processors is ABCDET.

The use of a custom mapfile provides more flexibility. The mapfile is a text file that contains one line for each MPI rank (in rank order), with each line having six integers separated by spaces. The six integers specify the A B C D E T coordinates, in that order, for the corresponding rank. For example, following the ABCDET mapping order, the mapfile for a four-node job in c2 mode might look like this:

0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 1 0
0 0 0 0 1 1
0 0 0 1 0 0
0 0 0 1 0 1
0 0 0 1 1 0
0 0 0 1 1 1

In this case, following the ABCDET mapping, a mapfile isn’t necessary, but with a mapfile the coordinates can be placed in any order allowing arbitrary mappings to the MPI ranks.

The name of the file is specified by the RUNJOB_MAPPING environment variable. For example:

qsub -n 10 -t 15 --mode c2 --env RUNJOB_MAPPING=my_mapping_file.txt exe1

What are the sizes and dimensions of the partitions on the system?

The smallest partition size generally available on Mira is 512 nodes. The largest possible partition on Mira is 49152 nodes. On Cetus and Vesta, the job size can go down to a single node, though the smallest partition is 128 nodes (Cetus) or 32 nodes (Vesta). Jobs smaller than the minimum partition size can either run exclusively on a within the partition or run as a subblock job within the partition that could be shared with other users’ subblock jobs. The table below lists the partition sizes generally available, though the exact number and sizes of the partitions available on the systems may vary according to how the system is currently configured. Information about current configuration of the partitions on the systems can be found using the command partlist. This command will list the partitions that are online at any given time, usually with the last number in the partition name indicating its size.

For a given partition the network used for point-to-point communication between nodes on the Blue Gene/Q will appear as a 5-D mesh or torus. The table below lists the dimensions of the 5-D mesh or torus for a given partition size, and indicates whether the partitions of that size are capable of forming a torus. Partitions of a size capable of producing a torus will generally provide a torus network, but in some instances may be limited to a mesh. Note that there are two distinct partitions available at 1024, 4096, and 16384 nodes.

Nodes in Partition

A

B

C

D

E

Torus

128

2

2

4

4

2

CDE

256

4

2

4

4

2

ACDE

512

4

4

4

4

2

All

1024 4 4 4 8 2 ABCE

1024

4

4

4

8

2

All

2048

4

4

4

16

2

All

4096

8 4 4 16 2 All

4096

4

4

8

16

2

All

8192

4

4

16

16

2

All

12288

8

4

12

16

2

All

16384

4

8

16

16

2

All

16384

8

4

16

16

2

All

24576

4

12

16

16

2

All

32768

8

8

16

16

2

All

32768

8 8 16 16 2 ACDE

49152

8

12

16

16

2

All