## Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="https://www.intel.com/benchmarks">www.intel.com/benchmarks</a>.

Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804





# PERFORMANCE OPTIMIZATION VTUNE & ADVISOR

Paulius Velesko

**Application Engineer** 

paulius.velesko@intel.com

## Tuning at Multiple Hardware Levels

Exploiting all features of modern processors requires good use of the available resources

- Core
  - Vectorization is critical with 512bit FMA vector units (32 DP ops/cycle)
  - Cache use needed to feed vector units
- Socket
  - Using all cores in a processor requires parallelization (MPI, OMP, ...)
  - Using coherent, shared socket caches
- Node
  - Minimize remote memory access (control memory affinity)
  - Minimize resource sharing (tune local memory access, disk IO and network traffic)



## Intel® Software Development Tools for Tuning

- Compiler Optimization Reports Key to identify issues preventing automated optimization
- Intel® VTune™ Application Performance Snapshot Overall performance
- Intel® Advisor Core and socket performance (vectorization and threading)
- Intel® VTune™ Amplifier Node level performance (memory and more)
- Intel® Trace Analyzer and Collector Cluster level performance (network)

#### Get the tools

Intel profiling tools are now FREE:

https://software.intel.com/en-us/vtune/choose-download

https://software.intel.com/en-us/advisor/choose-download



# **NBODY DEMONSTRATION**

The naïve code that could

## Nbody gravity simulation

forked from https://github.com/fbaru-dev/nbody-demo (Dr. Fabio Baruffa)

Let's consider a distribution of point masses located at r\_1,...,r\_n and have masses m\_1,...,m\_n.

We want to calculate the position of the particles after a certain time interval using the Newton law of gravity.

```
struct Particle
{
  public:
    Particle() { init();}
    void init()
    {
       pos[0] = 0.; pos[1] = 0.; pos[2] = 0.;
       vel[0] = 0.; vel[1] = 0.; vel[2] = 0.;
       acc[0] = 0.; acc[1] = 0.; acc[2] = 0.;
       mass = 0.;
    }
    real_type pos[3];
    real_type vel[3];
    real_type acc[3];
    real_type mass;
};
```

# INTEL® COMPILER REPORTS

FREE\* performance metrics

#### Compile with -qopt-report=5

- Which loops were vectorized
  - Vector Length
  - Estimated Gain
  - Alignment
  - Scatter/Gather

- Prefetching
- Issues preventing vectorization
- Inline reports
- Interprocedural optimizations
- Register Spills/Fills

```
LOOP BEGIN at ../src/timestep.F(4835,13)
   remark #15389: vectorization support: reference nbd_(i) has unaligned access [ ../src/timestep.F(4836,16) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
   remark #15329: vectorization support: irregularly indexed store was emulated for the variable <coefd (nbd (i))>, part of index is read from memory
   remark #15305; vectorization support; vector length 2
   remark #15399: vectorization support: unroll factor set to 4
   remark #15309: vectorization support: normalized vectorization overhead 0.139
   remark #15450: unmasked unaligned unit stride loads: 1
   remark #15463: unmasked indexed (or scatter) stores: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 4
   remark #15477: vector cost: 4.500
   remark #15478: estimated potential speedup: 0.880
   remark #15488: --- end vector cost summary ---
   remark #25439: unrolled with remainder by 2
LOOP END
```



#### The Basic Tuning Cycle



Infinite cycle only broken by external constraints (time, papers, releases ... )

Procedures for measuring performance and validating results are critical

**Automation** and **environment** control are key for **consistency** 

Where do I start?

/soft/perftools/intel/advisor/advixe.qsub /soft/perftools/intel/vtune/amplxe.qsub

#### amplxe.qsub Script

- Copy and customize the script from /soft/perftools/intel/vtune/amplxe.qsub
- All-in-one script for profiling
  - Job size ranks, threads, hyperthreads, affinity
  - Attach to a single, multiple or all ranks
  - Binary as arg#1, input as arg#2
    - qsub amplxe.qsub ./your\_exe ./inputs/inp
  - Binary and source search directory locations
  - Timestamp + binary name + input name as result directory
  - Save cobalt job files to result directory





## **Version Optimizations**

- Ver0
  - Initial implementation
- Ver1
  - -xMIC-AVX512
- Ver2
  - Use only floats
- Ver3
  - AoS -> SoA + SIMD Reduce



# INTEL® ADVISOR

Vectorization and Static Analysis

https://www.alcf.anl.gov/user-guides/advisor-xc40

## Intel® Advisor - Vectorization Optimization

#### Faster Vectorization Optimization:

- Vectorize where it will pay off most
- Quickly ID what is blocking vectorization
- Tips for effective vectorization
- Safely force compiler vectorization
- Optimize memory stride

#### Roofline model analysis:

- Automatically generate roofline model
- Evaluate current performance
- Identify boundedness





http://intel.ly/advisor-xe

Add Parallelism with Less Effort, Less Risk and More Impact

#### Cache-Aware Roofline

**FLOPS** 

**Next Steps** 

#### If under or near a memory roof...

- Try a MAP analysis. Make any appropriate cache optimizations.
- If cache optimization is impossible, try reworking the algorithm to have a higher Al.

#### If Under the Vector Add Peak

Check "Traits" in the Survey to see if FMAs are used. If not, try altering your code or compiler flags to induce FMA usage.



#### If just above the Scalar Add Peak

Check **vectorization efficiency** in the Survey. Follow the recommendations to improve it if it's low.

#### If under the Scalar Add Peak...

Check the Survey Report to see if the loop vectorized. If not, try to get it to vectorize if possible. This may involve running Dependencies to see if it's safe to force it.

## Typical Vectorization Optimization Workflow

There is no need to recompile or relink the application, but the use of **-g** is recommended.

Note: if you're using Theta run out of /projects rather than /home

- 1. Collect survey (overhead ~5%) advixe-cl -c survey
  - Basic info (static analysis) ISA, time spent, etc.
- 2. Collect Tripcounts and Flops (overhead 1-10x) advixe-cl -c tripcounts -flop
  - Investigate application place within roofline model
  - Determine vectorization efficiency and opportunities for improvement
- 3. Collect dependencies (overhead 5-1000x) advixe-cl -c dependencies
  - Differentiate between real and assumed issues blocking vectorization
- 4. Collect Memory Access Patterns advixe-cl -c map



#### Use -h Option!

#### advixe-cl -h collect

#### Generate Advisor Command Lines from the GUI



#### Collect survey and tripcounts (roofline)

```
$ module load advisor
$ cd /projects/intel/pvelesko/nody-demo/ver0
$ make
$ cp /soft/perftools/intel/advisor/advixe.qsub ./
$ qsub ./advixe.qsub ./nbody.x 2000 500
```

#### View Result

X-forwarding is not recommended.

Tar the result along with sources (if you want to be able to view them)

or

Generate a snapshot:

\$ advixe-cl --snapshot --pack --cache-sources --cache-binaries

then scp to your local machine

## **Summary Report**



Summary provides overall performance characteristics

Top time consuming loops are listed individually

Vectorization efficiency is based on used ISA

#### Survey Report (Source Tab)



#### Notice the following:

- Higher ISA available
- Type conversion
- Use of square root

All of these elements may affect performance

#### Survey Report (Code Analytics Tab)



#### Analytics tab contains a wealth of information

- Instruction set
- Instruction mix
- Traits (sqrt, type conversions, unpacks)
- Vector efficiency
- Floating point statistics

And explanations on how they are measured or calculated - expand the box or hover over the question marks.

## CARM (Cache-aware roofline model) Analysis



Using single threaded roof

Code vectorized, but performance on par with scalar add peak?

- Irregular memory access patterns force gather operations.
- Overhead of setting up vector operations reduces efficiency.

Next step is clear: perform a Memory Access Pattern analysis



## Memory Access Pattern Analysis (Refinement)

Modify advixe.qsub to collect "survey" followed by "map" qsub ./advixe.qsub ./nbody.x 2000 500



Storage of particles is in an Array Of Structures (AOS) style

This leads to regular, but non-unit strides in memory access

- 33% unit
- 33% uniform, non-unit
- 33% non-uniform

Re-structuring the code into a Structure Of Arrays (SOA) may lead to unit stride access and more effective vectorization

## Vectorization: gather/scatter operation

The compiler might generate gather/scatter instructions for loops automatically vectorized where memory locations are not contiguous

```
struct Particle
{
  public:
    ...
    real_type pos[3];
    real_type vel[3];
    real_type acc[3];
    real_type mass;
};
```



## Memory access pattern analysis

How should I access data?



For B, 1 cache line load computes 4 DP

#### Unit stride access are faster

for (i=0; i<N; i++)
A[i] = B[i]\*d

#### Constant stride are more complex

for (i=0; i<N; i+=2)
A[i] = B[i]\*d

# Non predictable access are usually bad

for (i=0; i<N; i++)
A[i] = B[C[i]]\*d



For B, 2 cache line loads compute 4 DP with reconstructions



For B, 4 cache line loads compute 4 DP with reconstructions, prefetching might not work

## Performance After Data Structure Change

In this new version (version 3 in GitHub sample) we introduce the following change:

 Change particle data structures from AOS to SOA

#### Note changes in report:

- Performance is lower
- Main loop is no longer vectorized
- Assumed vector dependence prevents automatic vectorization



Next step is clear: perform a Dependencies analysis

## Dependencies Analysis (Refinement)

Modify advixe.qsub to collect "survey" followed by "dependencies" qsub advixe.qsub ./ver3/nbody.x



## Dependencies analysis has high overhead:

Run on reduced workload

#### **Advisor Findings:**

- RAW dependency
- Multiple reduction-type dependencies

#### Recommendations

Memory Access Patterns Report

Dependencies Report | Recommendations

All Advisor-detectable issues: C++ | Fortran

#### Recommendation: Resolve dependency

The Dependencies analysis shows there is a real (proven) dependency in the loop. To fix: Do one of the following:

• If there is an anti-dependency, enable vectorization using the directive #pragma omp simd safelen(length), where length is smaller than the distance between dependent iterations in anti-dependency. For example:

```
#pragma omp simd safelen(4)
for (i = 0; i < n - 4; i += 4)
   a[i + 4] = a[i] * c;
```

#### ISSUE: PROVEN (REAL) DEPENDENCY PRESENT

The compiler assumed there is an anti-dependency (Write after read - WAR) or true dependency (Read after write - RAW) in the loop. Improve performance by investigating the assumption and handling accordingly.



Resolve dependency

• If there is a reduction pattern dependency in the loop, enable vectorization using the directive #pragma omp simd reduction(operator:list). For example:

```
#pragma omp simd reduction (+:sumx)
for (k = 0; k < size2; k++)
    sumx += x[k]*b[k];
```

#### Performance After Resolved Dependencies



New memory access pattern plus vectorization produces much improved performance! What's next?

\*Other names and brands may be claimed as the property of others.

Advisor Roofline – How much further can we go?



$$FMA\ Ratio = \frac{3}{29} = 10\%$$

Peak = SP Vector ADD \* (1+ FMA Ratio) Peak = 40 \* (1 + 0.1) = 44 GFLOPS



## **Vectorization Efficiency?**





#### **Complex Operations?**



#### Poor Cache Utilization?



# INTEL® VTUNE™ AMPLIFIER

Core-level hardware metrics

https://www.alcf.anl.gov/user-guides/vtune-xc40

### Intel® VTune™ Amplifier

#### VTune Amplifier is a full system profiler

- Accurate
- Low overhead
- Comprehensive (microarchitecture, memory, IO, treading, ... )
- Highly customizable interface
- Direct access to source code and assembly
- User-mode driverless sampling
- Event-based sampling

Analyzing code access to shared resources is critical to achieve good performance on multicore and manycore systems

### **Predefined Collections**

#### Many available analysis types:

uarch-exploration
 General microarchitecture exploration

hpc-performance
 HPC Performance Characterization

memory-accessMemory Access

disk-ioDisk Input and Output

concurrencygpu-hotspotsGPU Hotspots

gpu-profiling GPU In-kernel Profiling

hotspots Basic Hotspots

locksandwaits Locks and Waits

memory-consumption Memory Consumption

system-overviewSystem Overview

• ...

**Python Support** 



### Collect uarch-exploration

```
cd /projects/intel/pvelesko/nody-demo/ver7
module load vtune
vim Makefile # edit to add -dynamic
cp /soft/perftools/intel/advisor/amplxe.qsub ./
vim amplxe.qsub # edit collection to "uarch-exploration"
qsub ./advixe.qsub ./nbody.x 2000 500
```

scp result back to your local machine



### Hotspots analysis for nbody demo (ver7: threaded)

qsub amplxe.qsub ./your\_exe ./inputs/inp





Lots of spin time indicate issues with load balance and synchronization

Given the short OpenMP region duration it is likely we do not have sufficient work per thread

Let's look a the timeline for each thread to understand things better...

### Bottom-up Hotspots view



There is not enough work per thread in this particular example.

Double click on line to access source and assembly.

Notice the filtering options at the bottom, which allow customization of this view.

Next steps would include additional analysis to continue the optimization process.

















### Viewing the result

- Text file reports:
  - amplxe-cl -help report
     How do I create a text report?
  - amplxe-cl -help report hotspots
     What can I change
  - amplxe-cl -R hotspots -r ./res\_dir -column=? Which columns are available?
  - Ex: Report top 5% of loops, Total time and L2 Cache hit rates
    - amplxe-cl -R hotspots -loops-only
    - -limit=5 -column="L2\_CACHE\_HIT, Time Self (%)"
- Vtune GUI
  - unset LD\_PRELOAD; amplxe-gui



### Poor Cache Utilization?



### Using Vtune to

General Exploration Microarchitecture

Analysis Configuration Collection Log Summa

Grouping: Function / Call Stack

Function / Call Stack

GSimulation::start

apic\_timer\_interrupt native\_write\_msr\_safe

GSimulation∷start

lanic nevt deadline

Optimization Notice

Grouping: Function / Call Stack

Function / Call Stack

MS Assists: 0.1% of Clockticks FP Assists: 0.0% of Clockticks

Total Thread Count: 1

Copyright © 2018, Intel Corporation, All rights reserved. \*Other names and brands may be claimed as the property of others. MUX Reliability: 0.992
Front-End Bound: 1.5% of Pipeline Slots
ITLB Overhead: 0.0% of Clockticks
BACLEARS: 0.1% of Clockticks
MS Entry: 0.0% of Clockticks ICache Line Fetch: 1.0% of Clockticks Bad Speculation: 0.2% of Pipeline Slots Branch Mispredict: 0.2% of Clockticks SMC Machine Clear: 0.0% of Clockticks MO Machine Clear Overhead: 0.0% of Clockticks Back-End Bound: 56.2% of Pipeline Slots A significant proportion of pipeline slots are remaining empty. When operations take too long in the back-end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work to be retired per cycle than the machine is capable of supporting. This opportunity cost results in slower execution. Long-latency operations like divides and memory operations can cause this, as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support). L1 Hit Rate: 60.2% The Ll cache is the first, and shortest-latency, level in the memory hierarchy. This metric provides the ratio of demand load requests that hit the L1 cache to the total number of demand load L2 Hit Rate: 98.8% L2 Hit Bound: 100.0% of Clockticks Issue: A significant portion of cycles is being spent on data fetches that miss the L1 but hit the L2. This metric includes coherence penalties for shared data. 1. If contested accesses or data sharing are indicated as likely issues, address them first. Otherwise, consider the performance tuning applicable to an L2-missing workload: reduce the data working set size, improve data access locality, consider blocking or partitioning your working set so that it fits into the L1, or better exploit hardware prefetchers. Consider using software prefetchers, but note that they can interfere with normal loads, potentially increasing latency, as well as increase pressure on the memory system. L2 Miss Bound: 36.2% of Clockticks Issue: A high number of CPU cycles is being spent waiting for L2 load misses to be serviced. 1. Reduce the data working set size, improve data access locality, blocking and consuming data in chunks that fit into the L2, or better exploit hardware prefetchers. 2. Consider using software prefetchers but note that they can increase latency by interfering with normal loads, as well as increase pressure on the memory system. UTLB Overhead: 4.0% of Clockticks SIMD Compute-to-L1 Access Ratio: 1.490 SIMD Compute-to-L2 Access Ratio: 4.003 This metric provides the ratio of SIMD compute instructions to the total number of memory loads that hit the L2 cache. On this platform, it is important that this ratio is large to ensure efficient usage of compute resources. Contested Accesses (Intra-Tile): 0.0%
Page Walk: 4.9% of Clockticks Memory Reissues Split Loads: 0.0% Split Stores: 0.0% Loads Blocked by Store Forwarding: 0.0% Retiring: 42.1% of Pipeline Slots VPU Utilization: 99.9% of Clockticks Divider: 0.0% of Clockticks

/mplxe: Using result path `/gpfs/jlse-fs0/users/pvelesko/nbody-demo/ver5/amplxe\_knl\_nodiv\_60k

amplxe: Executing actions 75 % Generating a report

Clockticks: 405,093,000,000 Instructions Retired: 342,199,000,000

CPI Rate: 1.184

d Speculation 🐃

0.1%

0.0%

0.0%

Elapsed Time: 280.549s

Me mory Late ncy L2 Hit Bound UTLB Overhead L2 Miss Bound 0.0% 0.0%

Back-End Bound 29

41.3%

46.7%

60.0%



0.0%

Retiring [\*]

58.6%

0.0%

0.0%

### Memory Performance

```
for (i = 0: i < n: i++)// update acceleration
#ifdef ASALIGN
      assume aligned(particles->pos x, alignment);
      assume aligned(particles->pos y, alignment);
      _assume_aligned(particles->pos_z, alignment);
      _assume_aligned(particles->acc_x, alignment);
      assume aligned(particles->acc y, alignment);
      assume aligned(particles->acc z, alignment);
assume_aligned(particles->acc_z, alignment);
#endif
    real type ax_i = particles >acc_x[i]
    real type ay i = particles->acc v[i];
    real type az i = particles >acc z[i]
#pragma omp simd simdlen(16) redu
                                      ax i, ay i, az i)
     for (j = 0; j < n; j++)
        real type dx, dy, dz;
     real type distanceSqr = 0.0f;
     real type distanceInv = 0.0f;
    dx = particles >pos_x[j]
                             particles-pos_x[i]
    dy = particles ->pos v[i]
                              particles >pos v[i];
    dz = particles >pos z[j]
                             particles-pos z[i]
    distanceSqr = dx*dx + dy*dy + dz*dz + softeningSquared;
    distanceInv = 1.0f / sqrtf(distanceSqr):
    ax_i+= dx * G * particles(>mass[j]) distanceInv * distanceInv * distanceInv; //6flops
    particles->acc x[i] = ax i:
    particles->acc y[i] = ay i;
    particles->acc_z[i] = az_i;
```

Maximum N before we lose caching? KNL L1-32kB L2-1MB (1 tile/2cores) 32k/(4\*4) = 2k (L1)1MB/(7\*4) = 35.7k(L2)

#### **GFLOPs** vs N



### Microarchitecture Exploration - Caches

| S                  | 2k   | 2.5k  | 30k   | 35k   | 50k   | 60k   |
|--------------------|------|-------|-------|-------|-------|-------|
| L1 Hit %           | 100% | 63.9% | 62.4% | 48.5% | 57.5% | 60.2% |
| L2 Hit %           | 0%   | 100%  | 100%  | 100%  | 99.2% | 98.8% |
| L2 Hit<br>Bound %  | 0%   | 100%  | 100%  | 100%  | 100%  | 100%  |
| L2 Miss<br>Bound % | 0%   | 0%    | 0%    | 0%    | 28.6% | 36.2% |





# PROFILING PYTHON & ML APPLICATIONS

### **Python**

Profiling Python is straightforward in VTune™ Amplifier, as long as one does the following:

- The "application" should be the full path to the python interpreter used
- The python code should be passed as "arguments" to the "application"

In Theta this would look like this:

### Simple Python Example on Theta

```
aprun -n 1 -N 1 amplxe-cl -c hotspots -r vt_pytest \
-- /usr/bin/python ./cov.py naive 100 1000
```



Naïve implementation of the calculation of a covariance matrix

#### Summary shows:

- Single thread execution
- Top function is "naive"

Click on top function to go to Bottomup view

### Bottom-up View and Source Code



Note that for mixed Python/C code a Top-Down view can often be helpful to drill down into the C kernels



We could use numpy to improve on this

# INTEL® VTUNE™ APPLICATION PERFORMANCE SNAPSHOT

Performance overview at you fingertips

### VTune™ Amplifier's Application Performance Snapshot

High-level overview of application performance

- Identify primary optimization areas
- Recommend next steps in analysis
- Extremely easy to use
- Informative, actionable data in clean HTML report
- Detailed reports available via command line
- Low overhead, high scalability



### Usage on Theta

Launch all profiling jobs from /projects rather than /home

No module available, so setup the environment manually:

```
$ module load vtune
```

Launch your job in interactive or batch mode:

```
$ aprun -N <ppn> -n <totRanks> [affinity opts] aps ./exe
```

Produce text and html reports:

```
$ aprun -report ./aps_result_ ....
```



### **APS HTML Report**



#### **Application Performance Snapshot**

Report creation date: 2017-08-01 12:08:48 Number of ranks: 144 Ranks per node: 18 OpenMP threads per rank: 2 HW Platform: Intel(R) Xeon(R) Processor code named Broadwell-EP Logical Core Count per node: 72

121.39s

Elapsed Time

50.98 0.68

SP FLOPS (MAX 0.81, MIN 0.65)

#### **MPI Time**

53.74% N of Elapsed Time (65.23s)

> MPI Imbalance 11.03% of Elapsed Time (13.39s)

| TOP 5 IVIPI FUNCTIONS | 70   |
|-----------------------|------|
| Waitall               | 37.3 |
| Isend                 | 6.48 |
| Barrier               | 5.52 |
| Irecv                 | 3.70 |
| Scatterv              | 0.00 |

#### I/O Bound

0.00% (AVG 0.00, PEAK 0.00)

OpenMP Imbalance 0.43% of Elapsed Time (0.52s)

#### Memory Footprint

Per node: Peak: 786.96 MB

Resident:

Average: 687.49 MB Per rank: Peak: 127.62 MB

Average: 38.19 MB

Per node: Peak: 9173.34 MB

Average: 9064.92 MB Per rank:

Peak: 566.52 MB Average: 503.61 MB

#### Your application is MPI bound.

This may be caused by high busy wait time inside the library (imbalance), nonoptimal communication schema or MPI library settings. Use MPI profiling tools like Intel® Trace Analyzer and Collector to explore performance bottlenecks.

|                  | Current run | Target | Ωelta |
|------------------|-------------|--------|-------|
| MPI Time         | 53.74%▶     | <10%   |       |
| OpenMP Imbalance | 0.43%       | <10%   |       |
| Memory Stalls    | 14.70%      | <20%   |       |
| FPU Utilization  | 0.30%       | >50%   |       |
| I/O Bound        | 0.00%       | <10%   |       |

#### **Memory Stalls**

14.70% of pipeline slots

Cache Stalls 12.84% of cycles

DRAM Stalls 0.18% of cycles

NUMA 31.79% of remote accesses

#### **FPU Utilization**

SP FLOPs per Cycle 0.08 Out of 32.00

Vector Capacity Usage 25.84%

FP Instruction Mix

% of Packed FP Instr.: 3.54% % of 128-bit: 3.54% % of 256-bit: 0.00% % of Scalar FP Instr.: 96.46%

FP Arith/Mem Rd Instr. Ratio

FP Arith/Mem Wr Instr. Ratio 0.30



# **COMMON ISSUES**

### **Fixes**

No call stack information/unknown stack frame

- Check finalization log
  - Make sure Vtune finds your binary along with libraries that you call

Incompatible database scheme when trying to open result in GUI

- Make sure your local Vtune is the same version or newer
- Vtune sampling driver.. using perf or errors mentioning PMU Resources
- Notify support@alcf.anl.gov or me pvelesko@anl.gov



## TIPS AND TRICKS

### Speeding up finalization

Advisor add `--no-auto-finalize` to the aprun

followed by `advixe-cl R survey ...` <u>without</u> <u>aprun</u> will cause to finalize on the momnode rather than KNL.

You can also finalize on thetalogin:

cd your\_src\_dir;

export SRCDIR=`pwd | xargs realpath`

advixe-cl -R survey --search-dir src:=\${SRCDIR} ..

Vtune

add `--finalization-mode=none` to aprun

followed by `amplxe-cl -R hotspots ...`

without aprun will cause to finalize on
momnode rather than KNL

You can also finalize on thetalogin:

cd your\_src\_dir;

export SRCDIR=`pwd | xargs realpath`

amplxe-cl -R hotspots --search-dir src:=\${SRCDIR} ..



### Managing overheads

Advisor Dependencies and MAP analyses can have huge overheads

If able, run on reduced problem size. Advisor just needs to figure out the execution flow.

Only analyze loops/functions of interest:

https://software.intel.com/en-us/advisor-user-guide-mark-up-loops

### When do I use Vtune vs Advisor?

#### Vtune

- What's my cache hit ratio?
- Which loop/function is consuming most time overall? (bottom-up)
- Am I stalling often? IPC?
- Am I keeping all the threads busy?
- Am I hitting remote NUMA?
- When do I maximize my BW?

#### Advisor

- Which vector ISA am I using?
- Flow of execution (callstacks)
- What is my vectorization efficiency?
- Can I safely force vectorization?
- Inlining? Data type conversions?
- Roofline

# **BACKUP**

### **VTune Cheat Sheet**

```
Compile with -g -dynamic amplxe-cl -c hpc-performance -flags -- ./executable
```

- --result-dir=./vtune\_output\_dir
- --search-dir src:=../src --search-dir bin:=./
- -knob enable-stack-collection=true -knob collect-memorybandwidth=false
- -knob analyze-openmp=true
- -finalization-mode=deferred if finalization is taking too long on KNL
- -data-limit=125 ← in mb
- -trace-mpi for MPI metrics on Theta
- amplxe-cl -help collect survey



### **Advisor Cheat Sheet**

Compile with -g -dynamic

advixe-cl -c roofline/depencies/map -flags -- ./executable

- --project-dir=./advixe\_output\_dir
- --search-dir src:=../src --search-dir bin:=./
- -no-auto-finalize if finalization is taking too long on KNL
- --interval 1 (sample at 1ms interval, helps for profiling short runs)
- -data-limit=125 ← in mb
- advixe-cl -help



