#### Performance Optimization I: Single Core/Node Vectorization, Memory - Overview and BG/Q



#### Hal Finkel

hfinkel@anl.gov

Leadership Computing Facility

ALCF Computational Performance Workshop - May 2, 2017



You want to know how to make me compute quickly...



- Some trends in HPC architectures
- How you can optimize your code for these architectures (specifically the IBM BG/Q (Mira) and the Intel Xeon Phi (Theta)







Argonne



Argonne 🛆

# Compiling

When compiling your programs, please use our MPI wrappers (these are the softenv keys)...

(generally best performance)

- +mpiwrapper-xl.legacy
- +mpiwrapper-xl
- +mpiwrapper-bgclang.legacy
- +mpiwrapper-bgclang
- +mpiwrapper-gcc.legacy
- +mpiwrapper-gcc

The "legacy" MPI gives the best performance unless you're using MPI\_THREAD\_MULTIPLE

bgclang has better C++ support than xI and gcc, but has no Fortran support (yet)

(generally worst performance)



# Compiling

Basic optimization flags...

- -O3 Generally aggressive optimizations (try this first: it is typically the best tested of all compiler optimization levels)
- -g Always include debugging symbols (really, always! when your run crashes at scale after running for hours, you want the core file to be useful)
- -qsmp=omp (xl) -fopenmp (bgclang and gcc) Enable OpenMP (the pragmas will be ignored without this)
- -qnostrict (xl) -ffast-math (bgclang and gcc) Enable "fast" math optimizations (most people don't need strict IEEE floating-point semantics). xl enables this by default at -O3 and above and you need to pass -qstrict to turn it off.

If you don't use -O<n> to turn on some optimizations, most of the previous material is irrelevant!



# What programs do...





- Read data from memory
- Compute using that data
- Write results back to memory
- Communicate with other nodes and the outside world



How fast can you go...

The speed at which you can compute is bounded by:

(the clock rate of the cores) x (the amount of parallelism you can exploit)

BG/Q: Fixed 1.66 GHz KNL: 1.30 GHz (dynamically scaled)

> Kepler: 0.8 GHZ Pascal: 1.30 GHz

Your hard work goes here...



#### There is only one sock



Image source: https://computing.llnl.gov/tutorials/linux\_clusters/

Commodity HPC node with four sockets

Has nonuniform memory access (NUMA): each core has DRAM to which it is closer (running multiple MPI ranks per node, one per socket, is probably best)







#### There is only one socket



A BG/Q node has only one "socket" with one CPU

#### All memory is equally close: **No** NUMA (running one MPI rank per node works well)



Image source: https://computing.llnl.gov/tutorials/linux\_clusters/





A BG/Q Node has:

- I PowerPC A2Q CPU
- 16 GB DDR3 DRAM



#### There are 16 cores per node



Image source: https://computing.llnl.gov/tutorials/linux\_clusters/















#### There are four hardware threads per core





# Vectorization: The Quad-Processing eXtension (QPX)

- On the BG/Q, only QPX vector instructions are supported!
- Only <4 x double>, <4 x float> and <4 x bool> operations are provided.
- The only advantage of single precision over double precision is decreased memory bandwidth/footprint.





gvfxmadd:  $QRT0 \leftarrow [(QRA0) \times (QRC0)] + (QRB0)$  $QRT1 \leftarrow [(QRA0) \times (QRC1)] + (QRB1)$  $QRT2 \leftarrow [(QRA2) \times (QRC2)] + (QRB2)$  $QRT3 \leftarrow [(QRA2) \times (QRC3)] + (QRB3)$ 

qvfxxnpmadd:  $ORT0 \leftarrow -([(ORA1)\times(QRC1)] - (QRB0))$  $QRT1 \leftarrow [(QRA0) \times (QRC1)] + (QRB1)$ QRT2 ← - ([(QRA3)×(QRC3)] - (QRB2) )  $QRT3 \leftarrow [(QRA2) \times (QRC3)] + (QRB3)$ 

And a few like these with built-in permutations:





There are some FP (vector) instructions that combine both a multiply and an add/subtract into one instruction! Many variants like these:

# Fused Multiply Add Instructions (FMA)

#### Putting it all together...

You must vectorize to achieve The peak FLOP rate (on future machines, this factor will be even larger) You can only achieve the peak FLOP rate using FMAs (usually true on commodity hardware too)

Peak FLOPS: (1.66 GHz) x (16 cores) x (4 vector lanes) x (2 operations per FMA) = 212.48 GFLOPS/node.

Remember you must use at least two hardware threads (or processes) or else you won't be able to saturate the floating-point pipeline in practice Note: this is an order of magnitude (on future machines, it will be nearly two orders of magnitude)



#### Memory

L1 cache and L1P internal buffer (per core)

Commodity HPC cores often also have an L3 cache; we don't. However, they have an L2 cache that is only hundreds of KB.





DDR3 DRAM

(2 controllers)



# Types of parallelism

- Parallelism across nodes (using MPI, etc.)
- Parallelism across sockets within a node [Not applicable to the BG/Q, KNL, etc.]
- Parallelism across cores within each socket
- Parallelism across pipelines within each core (i.e. instruction-level parallelism)
- Parallelism across vector lanes within each pipeline (i.e. SIMD)
- Using instructions that perform multiple operations simultaneously (e.g. FMA)



Hardware threads tie in here too!



#### **Computer Architecture**

# Traditional computers are built to:

- Move data
- Make decisions
- Compute polynomials (of relatively-low order)

$$f(x) = a_0 + a_1x + a_2x^2 + a_3x^3 + a_4x^4$$





# **Computer Architecture**







# **Computer Architecture**

But this is not good...

t0 = fma(a4, x, a3)Waiting... Waiting... Waiting... Waiting... Waiting... t1 = fma(t0, x, a2). . . t2 = fma(t1, x, a1)

```
t3 = fma(t2, x, a0)
```

```
return t3
```

A lot of computer architecture revolves around this question:





# Hardware Threads

One way is to use hardware threads...

t0 = fma(a4, x, a3) [thread 0] t0 = fma(a4, x, a3) [thread 1] t0 = fma(a4, x, a3) [thread 2] t0 = fma(a4, x, a3) [thread 3] t0 = fma(a4, x, a3) [thread 4] t0 = fma(a4, x, a3) [thread 5] t1 = fma(t0, x, a2)

These can be OpenMP threads, pthreads, or, on a CPU, different processes.

```
t2 = fma(t1, x, a1)
...
t3 = fma(t2, x, a0)
```

return t3

How many threads do we need? How much latency do we need to hide?



# Time Scales in Computing

Latency Comparison Numbers

| L1 cache reference                 | 0.5         | 5 ns |         |    |        |              |
|------------------------------------|-------------|------|---------|----|--------|--------------|
| Branch mispredict                  | 5           | ns   |         |    |        |              |
| L2 cache reference                 | 7           | ns   |         |    |        |              |
| Mutex lock/unlock                  | 25          | ns   |         |    |        |              |
| Main memory reference              | 100         | ns   |         |    |        |              |
| Compress 1K bytes with Zippy       | 3,000       | ns   | 3       | us |        |              |
| Send 1K bytes over 1 Gbps network  | 10,000      | ns   | 10      | us |        |              |
| Read 4K randomly from SSD*         | 150,000     | ns   | 150     | us |        | ~1GB/sec SSD |
| Read 1 MB sequentially from memory | 250,000     | ns   | 250     | us |        |              |
| Round trip within same datacenter  | 500,000     | ns   | 500     | us |        |              |
| Read 1 MB sequentially from SSD*   | 1,000,000   | ns   | 1,000   | us | 1 ms   | ~1GB/sec SSD |
| Disk seek                          | 10,000,000  | ns   | 10,000  | us | 10 ms  |              |
| Read 1 MB sequentially from disk   | 20,000,000  | ns   | 20,000  | us | 20 ms  | 80x memory   |
| Send packet CA->Netherlands->CA    | 150,000,000 | ns   | 150,000 | us | 150 ms |              |

Latency Numbers Every Programmer Should Know: https://gist.github.com/jboner/2841832

Argonne 🕰

# The IBM BG/Q network is fast...

- Each A/B/C/D/E link bandwidth: 4 GB/s
- Bisection bandwidth (32 racks): 13.1 TB/s
- HW latency
  - Best: 80 ns (nearest neighbor)
  - Worst: 3 µs (96-rack 20 PF system, 31 ho
- $\, \prime \,$  MPI latency (zero-length, nearest-neighbor): 2.2  $\mu s$



MPI does add overhead which is generally minimal. If you're sensitive to it, you can use PAMI (or the SPI interface) directly



# Loop Unrolling

CPUs have a fixed register file per thread, and the compiler can use that to hide latency...

If you need to tune this yourself, most compilers have a '#pragma unroll' feature.



# **CPU Registers**

You can't unroll enough to completely hide anything but "on core" latencies (e.g. L1 cache hits and from FP pipeline) – you just don't have enough registers!

- x86\_64 has 16 general-purpose registers (GPRs) for scalar integer data, pointers, etc. – and 16 floatingpoint/vector registers
- With AVX-512 (e.g. with Knights Landing) there are 32 floating-point/vector registers
- AVX-512 also adds 8 operation mask registers
- PowerPC has 32 GPRs, 32 scalar floating-point registers and 32 vector registers (modern cores with VSX effectively combine these into 64 floating-point/vector registers)



# OOO Execution and Loops

- CPUs, including Intel's Knights Landing, use out-of-order (OOO) execution to hide latency
- So to say that there are only 16 GPRs, for example, isn't the whole story: there are just 16 GPRs that the compiler can name

Processor can predict this will be true, and can start issuing instructions for multiple iterations at a time!



# OOO Execution

- Importing to exploiting instruction-level parallelism (ILP) – each core's multiple pipelines
- Combined with branch prediction, can effectively provide a kind of dynamic loop unrolling
- Limited by the number of "rename buffer entries" (72 on Knights Landing)
- Limited by the number of "reorder buffer entries" (72 on Knights Landing)
- Mispredicted branches can lead to wasted work!





# **KNL** Pipeline



http://ieeexplore.ieee.org/xpls/abs\_all.jsp?arnumber=7453080





Argonne

#### SIMD: What does it mean?





# Vectors Have Many Types

- A 512-bit vector can hold 8 double-precision numbers, 16 single-precision numbers, etc.
- Different assembly instructions have different assumptions about the data types
- Except on the IBM BG/Q (where only FP is supported), both integer and FP types are supported
   The same vector register

|               |               |               |               |   | can be divided in different |
|---------------|---------------|---------------|---------------|---|-----------------------------|
| SD/UD/        | MD/DP 0       | SD/UD/        | MD/DP 1       |   | can be unded in unerent     |
|               |               |               |               |   | MONC                        |
| SW/UW/MW/SP 0 | SW/UW/MW/SP 1 | SW/UW/MW/SP 2 | SW/UW/MW/SP 3 |   | ways                        |
| 0             | 32            | 64            | 96 127        | , |                             |

(This diagram is from the IBM POWER ISA manual, showing the 128-bit VSX registers)

|         |           | Vector Length |                   |     |  |  |  |
|---------|-----------|---------------|-------------------|-----|--|--|--|
|         |           | 128           | 2 <mark>56</mark> | 512 |  |  |  |
|         | Byte      | 16            | 32                | 64  |  |  |  |
|         | Word      | 8             | 16                | 32  |  |  |  |
| element | Dw ord/SP | 4             | 8                 | 16  |  |  |  |
| size    | Qw ord/DP | 2             | 4                 | 8   |  |  |  |





#### MKL, cuBLAS, ESSL, etc.

Vendors provide optimized math libraries for each system (BLAS for linear algebra, FFTs, and more).

- MKL on Intel systems, ESSL on IBM systems, cuBLAS (and others) for NVIDIA GPUs
- For FFTs, there is often an optional FFTW-compatible interface.





#### ESSL

IBM provides ESSL: A library of optimized math functions (BLAS for linear algebra, FFTs, and more). For FFTs, there is an optional FFTW-compatible interface.

- ESSL is installed in /soft/libraries/essl/current
- You can choose either -lesslbg or -lesslsmpbg (the 'smp' version uses OpenMP internally to take advantage of multiple threads)

ESSL is on IBM PowerPC systems what MKL is on Intel systems.



# Memory partitioning

Using threads vs. multiple MPI ranks per node: it's about...

- Memory
  - Sending data between ranks on the same node often involves "unnecessary" copying (unless using MPI-3 shared memory windows)
  - Similarly, your application may need to manage "unnecessary" ghost regions
  - MPI (and underlying components) have data structures that grow linearly (at best) with the total number of ranks
- And Memory
  - When threads can work together they can share resources instead of competing (cache, memory bandwidth, etc.)
  - Each process only gets a modest amount of memory per core
- And parallelism
  - You'll likely see the best overall results from the scheme that exposes the most parallelism



#### Avoid central coordinators



A scheme like this is highly unlikely to scale!



# Load Balancing

- Keep "work units" being distributed between ranks as large as possible, but try hard to keep everything load balanced.
- Think about load balancing early in your application design: it is the largest impediment to scaling on large systems.



This is not good; rank 0 has much more work.



# schedule(dynamic) can be your friend...



(a) Unbalanced assignment of tasks to threads

(b) Balanced assignment of tasks to threads

https://software.intel.com/en-us/articles/load-balance-and-parallel-performance



# #pragma omp simd

Starting with OpenMP 4.0, OpenMP also supports explicit vectorization...



https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40



# Some final advice...

Don't guess! Profile! (We'll have several talks about how to do that.) Your performance bottlenecks on the BG/Q might be very different from those on other systems.

And don't be afraid to ask questions...



