Torus Network on a BG/Q System

Network Topology of Partitions

The Blue Gene/Q has a five-dimensional torus, or mesh network topology—for partitions smaller than the 512 node, the torus degenerates to a mesh, but the concepts discussed here remain the same—with direct links between the nearest neighbors in the ±A, ±B, ±C, ±D, and ±E directions. Note that the E dimension is always 2. Each node is therefore directly connected to nine neighbors.

Communication with a nearest neighbor involves traversing just one link, while communication with more distant nodes requires traversing multiple links. A node location in the torus can be described using all five coordinates <A,B,C,D,E>. In addition, each node contains 16 cores and can run multiple processes in powers of 2: mode c1 denotes 1 MPI process per node; mode c2 denotes 2 MPI processes, and so on up to a maximum of 64 MPI processes per node. Thus the location of a process can be described within the torus network and the multi-core node using six coordinates <A,B,C,D,E,T>, where T represents the core number.

When a job is started, each MPI process is placed at a particular <A,B,C,D,E,T> coordinate within the torus. Depending on the communication pattern of the application, different mapping of MPI ranks across the torus can produce different performance results. A default mapping exists, along with several predefined mappings, and the user can specify an arbitrary mapping.

Predefined mappings are described by the order in which the coordinates are incremented as the MPI ranks are placed on the nodes. For all predefined mappings, the first MPI rank is placed at the coordinate <0,0,0,0,0,0>. For predefined default mapping ABCDET, the rightmost direction will increment first until the maximal value for this dimension is reached. Therefore, the ranks will first get placed on the same node (incrementing T dimension), then E, then D, then C, then B, and finally A. Predefined mappings are all possible permutations of ABCDET.

The mapping to be used is specified by setting the environment variable RUNJOB_MAPPING—if it is not set, the default mapping is used.

As mentioned above, the default assignment of MPI tasks to processors is ABCDET with rightmost dimension incrementing first:

 Torus connectivity = <0,0,1,1,1> Summary from each MPI rank: rank A B C D E core comm(s) elapsed(s) mem(MB) 0 0 0 0 0 0 0 0.000 0.000 124.469 1 0 0 0 0 0 8 0.000 0.000 125.410 2 0 0 0 0 1 0 0.000 0.000 124.406 3 0 0 0 0 1 8 0.000 0.000 125.410 4 0 0 0 1 0 0 0.000 0.000 124.406 5 0 0 0 1 0 8 0.000 0.000 125.410 6 0 0 0 1 1 0 0.000 0.000 124.406 7 0 0 0 1 1 8 0.000 0.000 125.410 8 0 0 0 2 0 0 0.000 0.000 124.406 9 0 0 0 2 0 8 0.000 0.000 125.410 10 0 0 0 2 1 0 0.000 0.000 124.406 11 0 0 0 2 1 8 0.000 0.000 125.410 12 0 0 0 3 0 0 0.000 0.000 124.406 13 0 0 0 3 0 8 0.000 0.000 125.410 14 0 0 0 3 1 0 0.000 0.000 124.406 15 0 0 0 3 1 8 0.000 0.000 125.410 16 0 0 1 0 0 0 0.000 0.000 124.406

In order to request a TABCDE mapping, or any other allowed permutation of dimensions, the environment variable RUNJOB_MAPPING needs to be specified in the job submission command.

  qsub -n 128 -t 15 --mode c2 --env RUNJOB_MAPPING=TABCDE exe1
Data for MPI rank 0 of 256 
BGQ Partition shape = <2,2,4,4,2> 
Torus connectivity  = <0,0,1,1,1> 
Summary from each MPI rank:
  rank     A     B     C     D     E    core       comm(s) elapsed(s)      mem(MB) 
     0     0     0     0     0     0     0           0.000 0.000      124.469 
     1     0     0     0     0     1     0           0.000 0.000      124.406 
     2     0     0     0     1     0     0           0.000 0.000      124.406 
     3     0     0     0     1     1     0           0.000 0.000      124.406 
     4     0     0     0     2     0     0           0.000 0.000      124.406 
     5     0     0     0     2     1     0           0.000 0.000      124.406 
     6     0     0     0     3     0     0           0.000 0.000      124.406 
     7     0     0     0     3     1     0           0.000 0.000      124.406 
     8     0     0     1     0     0     0           0.000 0.000      124.406 
     9     0     0     1     0     1     0           0.000 0.000      124.406 
    10     0     0     1     1     0     0           0.000 0.000      124.406 
    11     0     0     1     1     1     0           0.000 0.000      124.406 
    12     0     0     1     2     0     0           0.000 0.000      124.406 
    13     0     0     1     2     1     0           0.000 0.000      124.406 
    14     0     0     1     3     0     0           0.000 0.000      124.406 
    15     0     0     1     3     1     0           0.000 0.000      124.406 
    16     0     0     2     0     0     0           0.000 0.000      124.406   

The use of custom maps provides more flexibility. The map is a text file that contains one line for each MPI rank (in rank order), with each line having six integers separated by spaces:

A B C D E T

where A, B, C, D, and E are the torus coordinates (origin at 0,0,0.0,0) and T is the process number (MPI rank) on the node (a value 0..N-1, where N is the number of ranks per node set by the mode). If you are running in c1 mode (the default), then all of your T values would be 0.

The name of the file is specified by the RUNJOB_MAPPING environment variable. For example:

  qsub -n 10 -t 15 --mode c2 --env RUNJOB_MAPPING=my_mapping_file.txt exe1