VTune on XC40

VTune is an advanced profiling tool which helps you to optimize your code on KNL architecture. It allows you to track how well your code is threaded and vectorized to take advantage of multiple CPUs/FPUs and how well the code is utilizing the non-uniform memory architectures and caches.

Step-by-step guide

1. Build your target application with all optimizations enabled e.g. -O3 -xMICAVX512 and set debugging -g flag.
2. Submit your job using sample batch script

#COBALT -n 2
#COBALT -t 3
#COBALT -q debug-cache-quad
#COBALT -A Intel
# -- Set working directory
cd /home/username/working-dir
# -- Source the version of Vtune you want
source /opt/intel/vtune_amplifier_xe_2017.3.0.510739/amplxe-vars.sh
# -- Run analysis using aprun
aprun -e LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/vtune_amplifier_xe_2017.3.0.510739/lib64 \
 -e PATH=$PATH:/opt/intel/vtune_amplifier_xe_2017.3.0.510739/bin64 \
 -e PMI_NO_FORK=1 \
-n 8 \
-N 4 \
-cc depth \
-d 1 \
-j 1  \
 amplxe-cl -collect advanced-hotspots -r xyz ./a.out


  • There are many command line options such as collect hotspots 'amplxe-cl -collect advanced-hotspots', general exploration 'amplxe-cl -collect general-exploration' and memory access 'amplxe-cl -collect memory-access'. Results will be collected in a directory called vtune-result-dir. It is recommended to add the –no-auto-finalize flag to collections that will be creating large results. The finalization step is compute intensive and runs serially which may take a long time on the KNL. Finalization can be done on another machine after copying the results off of the KNL. The data collected may be very large for longer runs with many threads active. If you find that you are reaching the data limit, use the flag -data-limit=<integer>. The default limit is 500MB. The integer specifies the size in MB. Use –data-limit=0 for no limit.
  • Sourcing the amplxe-vars.sh sets up all the environment variables required for easy use of VTune. This is explicitly done in the script shown above.
  • For jobs with a large number of MPI ranks, one may not want to profile every rank. Rather, one may wish to selectively profile ranks. To do this, one has to make the following modifications to the job submission script shown above:
    • Replace the entire aprun command with aprun –n 8 –n 4 –cc depth –d 1 –j 1 ./runscrpt.sh
    • Then create a second script called runscript.sh
    • A sample runscrpt.sh (to profile a single rank) can be structured as follows:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/vtune_amplifier_xe_2017.3.0.510739/lib64
export PATH=$PATH:/opt/intel/vtune_amplifier_xe_2017.3.0.510739/bin64
export PMI_NO_FORK=1
if [ "$PE_RANK" == "0" ];then
amplxe-cl -collect advanced-hotspots -analyze-system \
-finalization-mode=none \

  1. After step 2 has been completed, i.e. a results file has been created you can conveniently finalize the results by doing the following:  
$> amplxe-cl -finalize -r <vtune-result-dir> -search-dir ./


  1. The finalized results can be examined in either the GUI or the command line interface.
  • To examine the results using the GUI interface do the following:
    • Copy the results directory to a machine of your choice (on which you have already installed the VTune GUI)
    • Launch the GUI
    • Click on the “Open Result” link.
  • While the GUI is very convenient, the command line interface provides a quick way to generate reports directly on Theta. To examine the results with the command line, do the following:


Command line options and Help system

After setting up the VTune environment variables you can view the set of options using 'amplxe-cl -h' The available actions are:


Type 'amplxe-cl -help <action>' for help on a specific action.

Analysis types in VTune

Intel® VTune Analyzer supports two types of profiling: Time Based profiling and Event Based profiling. Time based profiling utilizes a system clock and reports how time is spent in various parts of a program. This is the “traditional” method of profiling. Event Based profiling utilizes hardware counters to count the number of events generated by various parts of a program. Events one may want to track are, for example, cache hits and cache misses at various levels of cache.

VTune organizes its various analysis types into templates. The templates named “Hotspots”, “Concurrency” and “Locks and Waits” are all time based analyses. The rest of the templates are Event Based analyses.  The simplest event based analysis one can run is called “Advanced Hotspots”. This analysis tracks the “Clockticks” and the “Instructions Retired” events only. Users new to Event Based analysis should start with this template. A more complex, and complete, Event Based analysis type is called “General Exploration”. A short description of “Advanced Hotspots” and “General Exploration” is provided below. For more details the user should consult the documentation system built into the tool. All VTune documentation is also available online.

Advanced hotspots analysis

The Advanced Hotspots Analysis will show where your application is spending its time, including information related to OpenMP parallelism. Ensure that the OpenMP runtime library used in the application (e.g. libiomp5.so) is available on the system doing the analysis. This is required to accurately analyze OpenMP overhead.

Using the GUI use the Bottom-up view to see time spent at various granularities; for example Function or Module granularities. This can be changed in the Grouping drop-down menu. Focus tuning efforts on the hot portions of your application.

General Exploration

KNL supports 512 bit vector instructions. To optimize for KNL, an application should take advantage of these large vector units with heavily vectorized code. Look at the metric VPU Utilization to determine the areas of high and low vectorization. The VPU Utilization metric is also available in the Bottom-up view of the General Exploration viewpoint. Locate hotspots with low VPU Utilization and try to improve their usage of the AVX512 capabilities.