CrayPat-lite

Introduction

The Cray Performance Analysis Tool is a performance analysis tool that can be used to evaluate program behavior. This article focuses on CrayPat-lite, the simplified, basic version of CrayPat. CrayPat-lite supports both MPI and OpenMP, and can be used for identifying time-consuming routines and load imbalances.

Using CrayPat-lite

1. Environment Setup

To use CrayPat-lite, the appropriate module should be loaded, and darshan should be unloaded.

$ module unload darshan

$ module load perftools-lite

Note:

  • Darshan needs to be unloaded since it uses environment variables which CrayPat-lite also uses.
  • Note that the perftools-base module should already be listed under “module list” If not, load perftools base with “module load perftools-base”. This provides access to man pages and help system, and does not instrument code. (Should be loaded by default.)
  • “module load perftools-lite” loads performance instrumentation module, and will instrument programs when they are compiled.

Tip:

  • To verify that the correct modules are loaded, run "module list" and check that perftools-lite and perftools-base are loaded and that darshan is not present.

2. Compile the Code to Use CrayPat-lite

Rebuild code as usual. When the program is built, there should be output like below, noting that CrayPat-lite was used.

$ make clean
$ make

ftn -O3 -qopt-report=5 -g  -align array64byte test.f90 -o test

INFO: creating the CrayPat-instrumented executable 'test' (sample_profile) ...OK

$

Tip:

  • To verify that the executable was instrumented with CrayPat-lite, use the following command to search the executable for CrayPat strings:
$ strings test | grep "CrayPat/X"

CrayPat/X:  Version 7.0.0 Revision 5c29ce2  12/11/17 15:26:24

3. Run the Code

Run the code as usual

$ qsub -n 8 -t 30 -A project jobscript.sh

$ cat jobscript.sh

#!/bin/bash

aprun -n 512 -N 64 test

4. Output

After the code finishes executing, CrayPat-lite output should be printed to stdout (likely at the end of the jobid.output file). More output will be saved in .rpt files and .ap2 files, which might be under a new directory created in the directory the run occurred in.

An example of output is shown below. Note that it is truncated after Table 1, since this is just for illustration purposes.

 

$ cat 309226.output

…

Normal program output

…

#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

 
CrayPat/X:  Version 7.0.0 Revision 5c29ce2  12/11/17 15:26:24
Experiment:                  lite  lite/sample_profile
Number of PEs (MPI ranks):    512
Numbers of PEs per Node:       64  PEs on each of  8  Nodes
Numbers of Threads per PE:      1
Number of Cores per Socket:    64
Execution start time:  Thu Mar 22 18:09:16 2018
System name and speed:  nid00690  1.301 GHz (nominal)
Intel Knights Landing CPU  Family:  6  Model: 87  Stepping:  1
MCDRAM: 7.2 GHz, 16 GiB available as quad, cache (100% cache)

 

Avg Process Time:           74.06 secs
High Memory:             21,298.6 MBytes     41.6 MBytes per PE
Instr per Cycle:             1.16
Observed CPU cycle rate:     1.39 GHz
I/O Read Rate:           1.899058 MBytes/sec
I/O Write Rate:          0.447113 MBytes/sec

Table 1:  Profile by Function
 

  Samp% |    Samp |  Imb. |  Imb. | Group
        |         |  Samp | Samp% |  Function=[MAX10]
        |         |       |       |   PE=HIDE
 100.0% | 7,355.9 |    -- |    -- | Total
|-------------------------------------------------------------
|  67.0% | 4,925.5 |    -- |    -- | USER
||------------------------------------------------------------
||  48.7% | 3,582.1 | 338.9 |  8.7% | genral_
||   9.9% |   726.7 |  85.3 | 10.5% | xyzint_
||   6.5% |   481.6 |  80.4 | 14.3% | rt123_
||============================================================
|  21.5% | 1,580.0 |    -- |    -- | BLAS
||------------------------------------------------------------
||   8.6% |   634.0 |  70.0 | 10.0% | gotoblas_dgetrf_single_knl
||   5.7% |   417.1 |  64.9 | 13.5% | gotoblas_dgemm_kernel_knl
||   2.3% |   168.5 |  34.5 | 17.0% | gotoblas_dlaswp_plus_knl
||   2.0% |   143.8 |  34.2 | 19.2% | gotoblas_dgemv_n_knl
||============================================================
|   8.7% |   637.3 |    -- |    -- | MPI
||------------------------------------------------------------
||   8.3% |   610.0 | 443.0 | 42.2% | MPI_ALLREDUCE
||============================================================
|   2.8% |   206.5 |    -- |    -- | ETC
||------------------------------------------------------------
||   2.6% |   193.8 |  60.2 | 23.7% | __svml_exp8_mask_b3
|=============================================================

...

By default, the sampling rate for CrayPat-lite is 100 times per second (or once every 104 microseconds) (check the runtime environment variable PAT_RT_SAMPLING_INTERVAL (given in microseconds)to verify the sampling rate).

Samp% is the percent of total samples taken which occurred in the given routine, averaged over all processes.

Samp is the number of samples which occurred in the given routine, averaged over all processes.

Imb. Samp is (maximum number of samples taken in given routine by one process - number of samples taken in given routine, averaged over all processes).

Imb. Samp% is (Imb. Samp) / (maximum number of samples taken in given routine by one process) * (number of processes / (number of processes - 1)) * 100%

Based on the above, 48.7% of total samples occurred in the genral_ routine, averaged over all processes, and the most samples taken in genral_ by one process process differed from the average by 338.9 samples. (If it was perfectly load balanced, Imb. Samp would be 0.) Additionally, 21.1% samples occurred in BLAS routines, averaged over all processes.

We can also see that this was run on 8 KNL nodes with 64 MPI ranks/node, for a total of 512 MPI processes.

More details explaining the output and how to get more information can be found in the References and man page.

References

Presentations:

Manual pages (after module is loaded):

General information about perftools-lite:

$ man perftools-lite

Detailed information about the output report:

$ man pat_report