Machine Partitions on BG/Q

Mira

In the prod-capability queue, partition sizes of 8192, 12288, 16384, 24576, 32768, and 49152 nodes are available. All partitions have a full torus network except for the 32768 partition that in the default queue is mesh in one dimension. The full torus network for the 32768 partition is available by the special queue: prod-32768-torus. The max runtime of jobs submitted to the prod-capability and the prod-32768-torus queues is 24 hours.

The prod-short and prod-long queues support partition sizes of 512, 1024, 2048 and 4096 nodes. All partitions have a full torus network except for the 1024 partition that in the default queue is mesh in one dimension. The full torus network for the 1024 partition is available by the special queue: prod-1024-torus. The max runtime of jobs submitted to the prod-short is 6 hours, compared to 12 hours for jobs submitted to the prod-long and prod-1024-torus queues.

Vesta

Vesta provides users with an environment for the testing of code on the BG/Q platform. Vesta provides three queues, each with partition sizes of 32, 64, 128, 256, 512, and 1024 nodes (note that the 2048 partition is not active). These queues are named ‘default,’ ‘low,’ and ‘singles.’ As its name implies, the default queue is provided by default, has a max runtime of two hours, and allows the running of a maximum of eight simultaneous jobs. The low queue has the same max runtime, but allows 12 jobs to be simultaneously run. Score for jobs submitted to the low queue accrues at a slower rate, so they will spend more time in the queue before being run. This makes the low queue ideal for users who wish to run many jobs, but don’t wish to monopolize the machine. Finally, the singles queue allows the running of only one job simultaneously, and is ideal for users with IO intensive applications.

Cetus

Cetus is provided to help our users test and scale applications intended for larger runs on Mira.  Partition sizes of 128, 256, 512, and 1024 nodes are available from the default and low queues. These are similar in function to the queues on Vesta. The max runtime is one hour for the default and the low queues. Both allow eight simultaneous runs.

Viewing Partitions

You can see the partitions in the output of partlist, along with whether they are idle, busy, or blocked by other partitions:

[vestalac1 ~]$ partlist
Name                  Queue                  State                          Backfill  Geometry
=================================================================================================
VST-00000-33771-2048  off                    blocked (VST-00440-31771-256)  -         4x4x8x8x2
VST-00000-33371-1024  default:singles:low    idle                           -         4x4x4x8x2
VST-00000-33731-1024  default:singles        idle                           -         4x4x8x4x2 
VST-00040-33771-1024  default:singles        idle                           -         4x4x4x8x2 
VST-00400-33771-1024  default:singles        blocked (VST-00440-31771-256)  -         4x4x4x8x2
VST-00000-33331-512   default:singles:low    idle                           -         4x4x4x4x2
VST-00040-33371-512   default:singles:low    idle                           -         4x4x4x4x2
VST-00400-33731-512   default:singles:low    idle                           -         4x4x4x4x2
VST-00440-31771-256   default:singles:low    busy                           -         4x2x4x4x2

Some of the partitions are nested. For instance, VST-00000-33771-2048 contains VST-00440-31771-256. Therefore, when a job is running on VST-00440-31771-256, VST-00000-33771-2048 is blocked.

Be sure to give your job enough time to account for partition boot. Boot time increases substantially with partition size, and your requested run time must account for this boot time:

nodes

<= 2048

4096

8192

16K

32K

48K

seconds

60

140

240

240

300

360

 

Number of Nodes,  Cores, and MPI Ranks

In your qsub command, -n specifies the number of nodes you are requesting. However, each node has 16 cores. The --mode parameter of qsub is used to specify how many MPI ranks per node. The values are specified as cN where N is number of ranks per node. In the default case, you will run in c1 mode, with each process getting a single node with one core and 16GB (minus overhead) of memory. To run one MPI task per core (with only ~1GB of memory each), specify --mode c16. An alternate method of controlling the number of ranks is to use the --proccount flag, the value of which must be <= nodecount * mode.

Nodes

Cores available

Mode

Tasks available

Memory per task

1

16

c1

1

~16GB

1

16

c2

2

~8GB

1

16

c4

4

~4GB

1

16

c8

8

~2GB

1

16

c16

16

~1GB

1

16

c32

32

~512MB

1

16

c64

64

~256MB

 

If you are running in c8 mode or below, the other cores will not be used, but you will still be charged for their core-hours because they are not usable by anyone else on the system. Also, you will get the smallest partition that accommodates your number of nodes requested. Therefore, a request for 512 nodes will be satisfied by a 512-node partition, while a request for 513 nodes will be satisfied by a 1024-node partition.

Network Topology of Partitions

The Blue Gene/Q has a five-dimensional torus network topology. One dimension (labeled ’E’) is confined to the individual nodeboards, and can be ignored for practical purposes on Mira. The four remaining dimensions, labeled A, B, C, and D, can be visualized as a pair of three-dimensional rectangular prisms. Each node is connected to its eight nearest neighbors (plus two on the nodeboard-confined dimension) (+1 A, -1 A, +1 B, -1 B, etc.) by a network link. Communication with a nearest neighbor involves traversing just one link, while communicating with more distant nodes requires traversing multiple links. A midplane’s location in the torus can be easily described using only the first four coordinates, and the set of <A,B,C,D,E,T> (with T being the processor location) describes the position of an individual processor or rank.

When a job is started, each MPI process is placed at a particular coordinate within the torus. Depending on the communication pattern of the application, different mappings of MPI ranks across the torus can produce different performance results. A default mapping, along with several predefined mappings, exists and an arbitrary mapping may be specified by the user. When placing ranks under any mapping, the rightmost dimension of length greater-than one is incremented first, followed by the second rightmost, and so on.

For example, using the default mapping of ABCDET and specifying one process per node, the first three ranks will be <0,0,0,0,0,0>, <0,0,0,0,1,0> and <0,0,0,1,0,0>.  For all predefined mappings, the 1st MPI rank is placed at coordinate <0,0,0,0,0,0>. In the above example, the E coordinate will first be incremented and the next rank placed, and once the maximum E value has been reached the D coordinate is increment and the E coordinate set to zero. All predefined mappings are permutations of ABCDET where T is placed in either the leftmost or rightmost position (CEADBT, TADECB, and so on).

The mapping to be used can be specified by either setting the environment variable RUNJOB_MAPPING (ex. qsub ... --env RUNJOB_MAPPING=TEDCBA) or by using the --mapping flag with runjob (script mode jobs only).  If not set, the default mapping ABCDET is used.

The use of custom maps provides more flexibility. The map is a text file that contains one line for each MPI rank (in rank order), with each line having six integers separated by spaces:

    A B C D E T

where A, B, etc., are the torus coordinates (origin at 0,0,0,0,0) and T is the processor number on the node (0-16). If you’re running in c1 mode (the default), all of your T values would be 0; in c2 mode, 0-1, and so on.

The name of the file is specified by the RUNJOB_MAPPING environment variable. For example:

qsub -n 10 -t 15 --mode c1 --env RUNJOB_MAPPING=my_mapping_file.txt exe1