Threading on BG/Q


Performance Considerations

Like Blue Gene/P, Blue Gene/Q is a true SMP (symmetric multiprocessor) , as opposed to a NUMA machine . NUMA is common in multisocket or multidie Intel or AMD (Advanced Micro Devices, Inc.) systems. The practical consequence of this is that thread scaling is possible across the entire node; on a NUMA system, it is usually observed that thread scaling beyond a NUMA domain is challenging. As a result of the lack of NUMA, careful memory allocation and placement through the use of Linux's first-touch policy is not necessary. All hardware threads see all the main memory equally and thus there is no performance hit associated with having different threads initialize and compute with the same memory.

Blue Gene/Q has a special form of four-way simultaneous multithreading (SMT), which means that four hardware threads share a single 16 KB L1 cache. If you are running 16 MPI ranks per node and four OpenMP threads, you should make your OpenMP loops very small so that they do not thrash L1. There is a performance impact associated with cache sharing among the four hardware threads of a single core, but this does not affect how threads see main memory, so it is still proper to call Blue Gene/Q an SMP.

Thread Placement

This information is from the Blue Gene/Q Application Development Manual.

Breadth-first is the default thread layout algorithm. This algorithm corresponds to setting the environment variable BG_THREADLAYOUT=1. With breadth-first assignment, the hardware thread-selection algorithm progresses across the cores that are defined within the process before selecting additional threads within a given core. Depth-first placement can be prescribed using BG_THREADLAYOUT=2. With depth-first assignment, the hardware thread-selection algorithm progresses within each core before moving to another core defined within the process.

Unsaturated Thread Placement

If you request a fewer number of threads than the total hardware threads available, the aforementioned placement options are inadequate. For example, one process-per-node (ppn) and 48 OpenMP threads (denoted as 1x48), 2x24, 4x12, and 8x6 does not permit a placement wherein the threads are spread evenly across the cores and are numbered consecutively therein. Breadth-first placement will lead to an even distribution but with non-contiguous numbering within a core, while depth-first placement will, at least in the previously noted examples, lead to 25% of the cores being idle.

To be specific, in the case of 48 OpenMP threads per node, the following is what the placement will be for the two scenarios, where core = C:T indicates a thread on the T-th hardware thread of core C. The SPI (system programming interface) calls required to determine hardware thread placement are demonstrated below in bgq_threadid.o.

BG_THREADLAYOUT =  1
OMP_MAX_NUM_THREADS = 48
MPI rank =  0 OpenMP thread =  0 of 48 core =  0:0 
MPI rank =  0 OpenMP thread =  1 of 48 core =  1:0 
MPI rank =  0 OpenMP thread =  2 of 48 core =  2:0 
MPI rank =  0 OpenMP thread =  3 of 48 core =  3:0 
MPI rank =  0 OpenMP thread =  4 of 48 core =  4:0 
MPI rank =  0 OpenMP thread =  5 of 48 core =  5:0 
MPI rank =  0 OpenMP thread =  6 of 48 core =  6:0 
MPI rank =  0 OpenMP thread =  7 of 48 core =  7:0 
MPI rank =  0 OpenMP thread =  8 of 48 core =  8:0 
MPI rank =  0 OpenMP thread =  9 of 48 core =  9:0 
MPI rank =  0 OpenMP thread = 10 of 48 core = 10:0 
MPI rank =  0 OpenMP thread = 11 of 48 core = 11:0 
MPI rank =  0 OpenMP thread = 12 of 48 core = 12:0 
MPI rank =  0 OpenMP thread = 13 of 48 core = 13:0 
MPI rank =  0 OpenMP thread = 14 of 48 core = 14:0 
MPI rank =  0 OpenMP thread = 15 of 48 core = 15:0 
MPI rank =  0 OpenMP thread = 16 of 48 core =  0:2 
MPI rank =  0 OpenMP thread = 17 of 48 core =  1:2 
MPI rank =  0 OpenMP thread = 18 of 48 core =  2:2 
MPI rank =  0 OpenMP thread = 19 of 48 core =  3:2 
MPI rank =  0 OpenMP thread = 20 of 48 core =  4:2 
MPI rank =  0 OpenMP thread = 21 of 48 core =  5:2 
MPI rank =  0 OpenMP thread = 22 of 48 core =  6:2 
MPI rank =  0 OpenMP thread = 23 of 48 core =  7:2 
MPI rank =  0 OpenMP thread = 24 of 48 core =  8:2 
MPI rank =  0 OpenMP thread = 25 of 48 core =  9:2 
MPI rank =  0 OpenMP thread = 26 of 48 core = 10:2 
MPI rank =  0 OpenMP thread = 27 of 48 core = 11:2 
MPI rank =  0 OpenMP thread = 28 of 48 core = 12:2 
MPI rank =  0 OpenMP thread = 29 of 48 core = 13:2 
MPI rank =  0 OpenMP thread = 30 of 48 core = 14:2 
MPI rank =  0 OpenMP thread = 31 of 48 core = 15:2 
MPI rank =  0 OpenMP thread = 32 of 48 core =  0:1 
MPI rank =  0 OpenMP thread = 33 of 48 core =  1:1 
MPI rank =  0 OpenMP thread = 34 of 48 core =  2:1 
MPI rank =  0 OpenMP thread = 35 of 48 core =  3:1 
MPI rank =  0 OpenMP thread = 36 of 48 core =  4:1 
MPI rank =  0 OpenMP thread = 37 of 48 core =  5:1 
MPI rank =  0 OpenMP thread = 38 of 48 core =  6:1 
MPI rank =  0 OpenMP thread = 39 of 48 core =  7:1 
MPI rank =  0 OpenMP thread = 40 of 48 core =  8:1 
MPI rank =  0 OpenMP thread = 41 of 48 core =  9:1 
MPI rank =  0 OpenMP thread = 42 of 48 core = 10:1 
MPI rank =  0 OpenMP thread = 43 of 48 core = 11:1 
MPI rank =  0 OpenMP thread = 44 of 48 core = 12:1 
MPI rank =  0 OpenMP thread = 45 of 48 core = 13:1 
MPI rank =  0 OpenMP thread = 46 of 48 core = 14:1 
MPI rank =  0 OpenMP thread = 47 of 48 core = 15:1 

BG_THREADLAYOUT =  2
OMP_MAX_NUM_THREADS = 48
MPI rank =  0 OpenMP thread =  0 of 48 core =  0:0 
MPI rank =  0 OpenMP thread =  1 of 48 core =  0:1 
MPI rank =  0 OpenMP thread =  2 of 48 core =  0:2 
MPI rank =  0 OpenMP thread =  3 of 48 core =  0:3 
MPI rank =  0 OpenMP thread =  4 of 48 core =  1:0 
MPI rank =  0 OpenMP thread =  5 of 48 core =  1:1 
MPI rank =  0 OpenMP thread =  6 of 48 core =  1:2 
MPI rank =  0 OpenMP thread =  7 of 48 core =  1:3 
MPI rank =  0 OpenMP thread =  8 of 48 core =  2:0 
MPI rank =  0 OpenMP thread =  9 of 48 core =  2:1 
MPI rank =  0 OpenMP thread = 10 of 48 core =  2:2 
MPI rank =  0 OpenMP thread = 11 of 48 core =  2:3 
MPI rank =  0 OpenMP thread = 12 of 48 core =  3:0 
MPI rank =  0 OpenMP thread = 13 of 48 core =  3:1 
MPI rank =  0 OpenMP thread = 14 of 48 core =  3:2 
MPI rank =  0 OpenMP thread = 15 of 48 core =  3:3 
MPI rank =  0 OpenMP thread = 16 of 48 core =  4:0 
MPI rank =  0 OpenMP thread = 17 of 48 core =  4:1 
MPI rank =  0 OpenMP thread = 18 of 48 core =  4:2 
MPI rank =  0 OpenMP thread = 19 of 48 core =  4:3 
MPI rank =  0 OpenMP thread = 20 of 48 core =  5:0 
MPI rank =  0 OpenMP thread = 21 of 48 core =  5:1 
MPI rank =  0 OpenMP thread = 22 of 48 core =  5:2 
MPI rank =  0 OpenMP thread = 23 of 48 core =  5:3 
MPI rank =  0 OpenMP thread = 24 of 48 core =  6:0 
MPI rank =  0 OpenMP thread = 25 of 48 core =  6:1 
MPI rank =  0 OpenMP thread = 26 of 48 core =  6:2 
MPI rank =  0 OpenMP thread = 27 of 48 core =  6:3 
MPI rank =  0 OpenMP thread = 28 of 48 core =  7:0 
MPI rank =  0 OpenMP thread = 29 of 48 core =  7:1 
MPI rank =  0 OpenMP thread = 30 of 48 core =  7:2 
MPI rank =  0 OpenMP thread = 31 of 48 core =  7:3 
MPI rank =  0 OpenMP thread = 32 of 48 core =  8:0 
MPI rank =  0 OpenMP thread = 33 of 48 core =  8:1 
MPI rank =  0 OpenMP thread = 34 of 48 core =  8:2 
MPI rank =  0 OpenMP thread = 35 of 48 core =  8:3 
MPI rank =  0 OpenMP thread = 36 of 48 core =  9:0 
MPI rank =  0 OpenMP thread = 37 of 48 core =  9:1 
MPI rank =  0 OpenMP thread = 38 of 48 core =  9:2 
MPI rank =  0 OpenMP thread = 39 of 48 core =  9:3 
MPI rank =  0 OpenMP thread = 40 of 48 core = 10:0 
MPI rank =  0 OpenMP thread = 41 of 48 core = 10:1 
MPI rank =  0 OpenMP thread = 42 of 48 core = 10:2 
MPI rank =  0 OpenMP thread = 43 of 48 core = 10:3 
MPI rank =  0 OpenMP thread = 44 of 48 core = 11:0 
MPI rank =  0 OpenMP thread = 45 of 48 core = 11:1 
MPI rank =  0 OpenMP thread = 46 of 48 core = 11:2 
MPI rank =  0 OpenMP thread = 47 of 48 core = 11:3 

Thread Scheduling and Affinity

Section 3.10 of the Blue Gene/Q Application Development Manual provides additional information. As most users are using OpenMP and not over-subscribing the 64 hardware threads, this information is relevant only to a limited set of users who program directly in Pthreads.

Open MP

Specification

See the OpenMP Homepage for the latest specification.

On Blue Gene/Q, the IBM XL compilers provide support for OpenMP v3.1. The GCC version 4.4.4 compilers that are currently installed provide support for OpenMP v3.0. LLVM does not yet support OpenMP on any platform.

More information about OpenMP support in GCC can be found at http://gcc.gnu.org/wiki/openmp and http://gcc.gnu.org/projects/gomp/. For more information about OpenMP functionality in those compilers, see the XL compiler documentation.

Documentation

To learn how to use OpenMP, refer to the LLNL OpenMP page.

Information about OpenMP on Blue Gene/Q can be found in the Blue Gene/Q Application Development Manual and the XL Compiler Documentation. Relevant information for OpenMP can be found by searching these PDFs.

Transactional Memory (TM)

The following code is derived from the example in the XL compiler documentation. The performance of TM in this case is not good, in part because the conflict granularity is smaller than a cache line, so it is only meant to illustrate the TM pragma syntax.

You can compile this code with mpixlc_r -g -O3 -qtm -qsmp=omp:speculative page380.c -o page380.x.

#include <stdio.h>
#include <omp.h>
#include <hwi/include/bqc/A2_inlines.h>

#define SIZE 400

int main(void)
{
    int v, w, z;
    int a[SIZE], b[SIZE];

    printf("omp_get_max_threads() = %d \n",  omp_get_max_threads() );

    int r;
    for (r=0; r<10; r++)
    {
        uint64_t t0, t1;

        for (v=0; v<SIZE; v++)
        {
            a[v] = v;
            b[v] = -v;
        }

        t0 = GetTimeBase();
        #pragma omp parallel for private(v,w,z)
        for (v=0; v<SIZE; v++)
          for (w=0; w<SIZE; w++)
            for (z=0; z<SIZE; z++)
            {
              #pragma tm_atomic
              {
                a[v] = a[w] + b[z];
              }
            }

        t1 = GetTimeBase();
        printf("tm_atomic    ran in %llu cycles \n", t1-t0);

        for (v=0; v<SIZE; v++)
        {
            a[v] = v;
            b[v] = -v;
        }

        t0 = GetTimeBase();
        #pragma omp parallel for private(v,w,z)
        for (v=0; v<SIZE; v++)
          for (w=0; w<SIZE; w++)
            for (z=0; z<SIZE; z++)
            {
              #pragma omp critical
              {
                a[v] = a[w] + b[z];
              }
            }

        t1 = GetTimeBase();
        printf("omp critical ran in %llu cycles \n", t1-t0);
    }

    printf("page380 completed successfully \n");

    return 0;
}

To learn how to use Pthreads, refer to the LLNL Pthreads page.

See the Pthreads main page and related entries for specific Pthreads API calls for detailed information about their behavior. However, note that the Linux-specific extensions to POSIX denoted with _np are not necessarily provided on Blue Gene/Q.

Thread Building Blocks

Intel Thread Building Blocks, or TBB, is not supported on ALCF systems. However, an unofficial and unsupported implementation of TBB for Blue Gene/P and Blue Gene/Q exists and can be obtained as described below for experimental purposes.

While ALCF was involved in the porting of TBB to PowerPC systems and has performed some testing on Blue Gene/P, Blue Gene/Q, and POWER7 processors, we cannot certify that the implementation executes without errors due to impossibility of testing every line of code in any software package. Users must perform their own verification and validation against the official Intel version of TBB executing on x86 processors before using the code in any context where accuracy is important.

Documentation

See the TBB homepage for more information. There is a TBB book on Amazon that describes the design and usage in detail.

Source Code

Users can download TBB from https://repo.anl-external.org/repos/BlueTBB/. Many patches to Intel's open-source implementation of TBB have been upstreamed.

Mixing Thread Models

Some users may wish to use more than one threading model at the same time. Both vertical and horizontal composition of OpenMP and Pthreads have been verified to work. Horizontal composition just means that there are OpenMP and POSIX threads coexisting, e.g. OpenMP threads are used in a compute kernel while a Pthread is doing communication. Vertical composition means that Pthreads spawn OpenMP threads. This is much more challenging from an implementation perspective and ALCF worked with IBM to ensure that the XL OpenMP runtime supported this model in the same manner as GCC's GOMP, which allows each Pthread to spawn its own OpenMP thread pool.

Due to the possibility that the vertically composed model is not portable to all compilers, users are encouraged to consider OpenMP nested parallelism or other advanced features thereof as an alternative. On the other hand, the horizontally composed model should be quite portable and there is no reason to try to merge OpenMP and Pthread usage there.

ALCF has not performed any testing of TBB with OpenMP or with direct use of Pthreads.

Example Code

This is a simple example that prints out information about MPI, Pthreads and OpenMP used together.

Users should submit this job with different values of the environment variables POSIX_NUM_THREADS and OMP_NUM_THREADS. On Blue Gene/Q, this test will output where the threads are executing. On other systems, the hardware affinity information is null.

Makefile

CC      = mpicc
COPT    = -g -O2 -std=gnu99 -fopenmp
LABEL   = gnu

LD      = $(CC)
CFLAGS  = $(COPT)
LDFLAGS = $(COPT) bgq_threadid.o -lm -lpthread

all: mpi_omp_pthreads.x

%.$(LABEL).x: %.o bgq_threadid.o
	$(LD) $(LDFLAGS) $< -o $@

%.o: %.c
	$(CC) $(CFLAGS) -c $< -o $@

clean:
	$(RM) $(RMFLAGS) *.o *.lst 

realclean: clean
	$(RM) $(RMFLAGS) *.x

bgq_threadid.c

#ifdef __bgq__
#include </bgsys/drivers/ppcfloor/spi/include/kernel/location.h>
#endif

/*=======================================*/
/* routine to return the BGQ core number */
/*=======================================*/
int get_bgq_core(void)
{
#ifdef __bgq__
    int core = Kernel_ProcessorCoreID();
    return core;
#else
    return -1;
#endif
}

/*==========================================*/
/* routine to return the BGQ hwthread (0-3) */
/*==========================================*/
int get_bgq_hwthread(void)
{
#ifdef __bgq__
    int hwthread = Kernel_ProcessorThreadID();
    return hwthread;
#else
    return -1;
#endif
}

/*======================================================*/
/* routine to return the BGQ virtual core number (0-67) */
/*======================================================*/
int get_bgq_vcore(void)
{
#ifdef __bgq__
    int hwthread = Kernel_ProcessorID();
    return hwthread;
#else
    return -1;
#endif
}

mpi_omp_pthreads.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <unistd.h>
#include <pthread.h>
#include <omp.h>
#include <mpi.h>

/* this is to ensure that the threads overlap in time */
#define NAPTIME 3

#define MAX_POSIX_THREADS 64

static pthread_t thread_pool[MAX_POSIX_THREADS];

static int mpi_size, mpi_rank;
static int num_posix_threads;

void* foo(void* dummy)
{
    int i, my_pth = -1;
    pthread_t my_pthread = pthread_self();

    for (i=0 ; i<num_posix_threads ; i++)
        if (my_pthread==thread_pool[i]) my_pth = i;
    
    sleep(NAPTIME);

    int my_core = -1, my_hwth = -1;
    int my_omp, num_omp;
    #pragma omp parallel private(my_core,my_hwth,my_omp,num_omp) shared(my_pth)
    {
        sleep(NAPTIME);

        my_core = get_bgq_core();
        my_hwth = get_bgq_hwthread();

        my_omp  = omp_get_thread_num();
        num_omp = omp_get_num_threads();
        fprintf(stdout,"MPI rank = %2d Pthread = %2d OpenMP thread = %2d of %2d core = %2d:%1d \n",
                       mpi_rank, my_pth, my_omp, num_omp, my_core, my_hwth);
        fflush(stdout);

        sleep(NAPTIME);
    }

    sleep(NAPTIME);

    pthread_exit(0);
}

void bar()
{
    sleep(NAPTIME);

    int my_core = -1, my_hwth = -1;
    int my_omp, num_omp;
    #pragma omp parallel private(my_core,my_hwth,my_omp,num_omp)
    {
        sleep(NAPTIME);

        my_core = get_bgq_core();
        my_hwth = get_bgq_hwthread();

        my_omp  = omp_get_thread_num();
        num_omp = omp_get_num_threads();
        fprintf(stdout,"MPI rank = %2d OpenMP thread = %2d of %2d core = %2d:%1d \n",
                       mpi_rank, my_omp, num_omp, my_core, my_hwth);
        fflush(stdout);

        sleep(NAPTIME);
    }
    sleep(NAPTIME);
}

int main(int argc, char *argv[])
{
    int i, rc;
    int provided;
 
    MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided);
    if ( provided != MPI_THREAD_MULTIPLE ) exit(1);
 
    MPI_Comm_size(MPI_COMM_WORLD,&mpi_size);
    MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
 
    MPI_Barrier(MPI_COMM_WORLD);
 
    sleep(NAPTIME);

#ifdef __bgq__
    int bg_threadlayout = atoi(getenv("BG_THREADLAYOUT"));
    if (mpi_rank==0) fprintf(stdout,"BG_THREADLAYOUT = %2d\n", bg_threadlayout);
#endif

    num_posix_threads = atoi(getenv("POSIX_NUM_THREADS"));
    if (num_posix_threads<0)                 num_posix_threads = 0;
    if (num_posix_threads>MAX_POSIX_THREADS) num_posix_threads = MAX_POSIX_THREADS;

    if (mpi_rank==0) fprintf(stdout,"POSIX_NUM_THREADS = %2d\n", num_posix_threads);
    if (mpi_rank==0) fprintf(stdout,"OMP_MAX_NUM_THREADS = %2d\n", omp_get_max_threads());
    fflush(stdout);

    if ( num_posix_threads > 0 ) {
        //fprintf(stdout,"MPI rank %2d creating %2d POSIX threads\n", mpi_rank, num_posix_threads); fflush(stdout);
        for (i=0 ; i<num_posix_threads ; i++){
            rc = pthread_create(&thread_pool[i], NULL, foo, NULL);
            assert(rc==0);
        }

        MPI_Barrier(MPI_COMM_WORLD);

        sleep(NAPTIME);

        for (i=0 ; i<num_posix_threads ; i++){
            rc = pthread_join(thread_pool[i],NULL);
            assert(rc==0);
        }
        //fprintf(stdout,"MPI rank %2d joined %2d POSIX threads\n", mpi_rank, num_posix_threads); fflush(stdout);
    } else {
        bar();
    }

    MPI_Barrier(MPI_COMM_WORLD);

    sleep(NAPTIME);

    MPI_Finalize();

    return 0;
}