gprof Profiling Tools


Introduction

Profiling is a technique for determining how a program uses processor resources. Principally, it is employed to determine how much of an application’s runtime is attributable to the various sections of the code, how often a routine is called, and the identities of its calling routines.

This page presents a discussion of the GNU-based profiling framework available on the Blue Gene/Q. This framework, referred to here as the gprof framework, consists of

  • An instrumentation library for collecting and recording data;
  • A compiler (either XL or GNU) that inserts calls to the instrumentation library into an application code; and
  • A post-processing tool called gprof for viewing the collected profile data.

When used with an application, the gprof framework provides the following information:

  • The approximate time spent in each routine.
  • The number of times a routine was invoked.
  • A list of the parent routines that invoke a given routine.
  • A list of the child routines a given routine invokes.
  • An estimate of the cumulative time spent in the child routines invoked by a given routine.

When using the gprof framework and interpreting its output, it is helpful to understand the concepts behind the framework. The section "Profiling Concepts" discusses these concepts; the "Profiling on the Blue Gene" section discusses how to use the framework; the "Issues in Interpreting Profile Data" section provides important caveats to keep in mind when interpreting the profile information; and the "References" section provides references for those seeking more information.

Profiling on the Blue Gene

Enabling Profiling

Using the gprof profiling framework simply requires linking, and optionally compiling, with the '-pg' compiler option for both XL and GNU compilers.

If an application is linked but the individual routines not compiled with the '-pg' option instruction, profiling information will be collected, but invocation profiling information will not be collected. This means only estimated routine times will be reported; no call graph information will be available. Following is an example of compiling this way:

   mpixlc -c main.c
   .
   .
   .
   mpixlc -c routine3.c
   mpixlc -pg -o gprof_test main.o routine1.o routine2.o routine3.o

Alternatively, compiling and linking with the 'pg' option enables collecting both types of profile information. Following is an example.

   mpixlc -pg -c main.c
   .
   .
   .
   mpixlc -pg -c routine3.c
   mpixlc -pg -o gprof_test main.o routine1.o routine2.o routine3.o

or

mpixlc -pg -o gprof_test main.c routine1.c routine2.c routine3.c

Collecting Profile Information

Profile information is collected by running a program that has been compiled so as to enable profiling. Files named gmon.out.<id> will be produced after the program is executed. The gmon.out files contain the raw profile information gathered by the profiling framework during the run. The gmon.out files are written to the current working directory for the program at the time that the program exits. If the program does not exit normally, no gmon.out files will be produced. On many systems multiple gmon.out files will be produced for an MPI program—typically one for each MPI rank with the suffix of the file denoting the rank. On the Blue Gene/Q the output has been modified such that gmon.out files are produced for only the first 32 ranks by default. The BG_GMON_RANK_SUBSET variable can be used to alter and control which ranks produce output as follows:

  • BG_GMON_RANK_SUBSET=N  - Only generate the gmon.out file for rank N
  • BG_GMON_RANK_SUBSET=N:M - Generate gmon.out files for all ranks from N to M.
  • BG_GMON_RANK_SUBSET=N:M:S  - Generate gmon.out files for all ranks from N to M,  skipping S

Additionally, generation of gmon..out files from individual ranks may be disabled by calling the function mondisable() from that rank.

Profiling Threaded Applications

The standard version of gprof does not support profiling of threaded code, however the BG/Q version has been modified to provide support for profilng threaded code. Thread profiling is not enabled by default and must be enabled by one of the following:

  • Setting the BG_GMON_START_THREAD_TIMERS environment variable. Setting this variable to "all" will enable profiling of all threads and setting it to "nocomm" will enable profiling of all threads except system communication threads.
  • Calling gmon_start_all_thread_timers() from the main thread of a process before the additional threads to be profiled are created. The prototype for this functions is in sys/gmon.h.
  • Calling gmon_thread_timer(int start) from the thread to be profiled. To start profiling pass an argument of 1, to stop profiling pass 0. The prototype for this functions is in sys/gmon.h.

Using gprof

The gprof utility is used to analyze and view the profile data contained in the gmon.out files. Two versions of gprof are available on the Blue Gene/Q system: the standard gprof (in /usr/bin) and powerpc64-bgq-linux-gprof (in /bgsys/drivers/ppcfloor/gnu-linux/bin). The latter version provides some additional functionality over the standard version, but both provide the same functionality for standard operations. Basic usage of gprof is:

   gprof <executable-file> gmon.out.<id> {gmon.out.<id> ...}

    or

   powerpc64-bgq-linux-gprof <executable-file> gmon.out.<id> {gmon.out.<id> ...}

When multiple gmon.out.<id> files are specified, the profile data is summed across all gmon.out.<id> files and then displayed. The most commonly used options for gprof are:

   -p : display flat profile
   -P : suppress display of flat profile
   -q : display call graph information
   -Q : suppress display of call graph information
   -l : displays instruction profile at source line level instead of function level (requires debug info)
   -C : display routine names and the execution counts obtained by invocation profiling
   -A : print annotated source code, only annotates with function call count (requires debug info)
   -s : sums all the profile data for the specified input files and writes the results to the file gmon.sum
   -I : search path for source files, used mostly for annotating source (can also use GPROF_PATH environment variable)
   -b : prevents gprof from printing the verbose descriptions of the output
   -d : shows gprof debugging information, can be used to access raw data in gmon files
   -S : A Blue Gene specific option that allows large numbers of gmon.out files to be combined. Sequentially sums
        the gmon.out files in a directory, starting with gmon.out.0 until the next in the sequence is not found.
        The result are placed in gmon.sum.

If invoked without any options, the default gprof behavior is the same as specifying the options '-p –q,' meaning a flat profile and call graph information will be displayed.

Five basic output forms can be produced by gprof:

  1. A routine level flat profile;
  2. A line level flat profile;
  3. Call graph information;,
  4. A routine execution count list; and
  5. Source code annotated with routine call counts.

These output forms are described below and share several common aspects and, in many cases, present the same information in different forms. As described earlier, time estimates are based on the number of executing instruction observed at an interrupt interval that belongs to a routine. By default, the sampling interval is 0.01 seconds, so each observation of an instruction represents 0.01 seconds of execution time in that routine. Therefore, all times reported extend to one-hundredth of a second. The number of observed instructions for a routine is the time reported for the routine divided by the sampling interval of 0.01 seconds

Some routines may execute with an execution time low enough that no executing instructions were observed. These routines will be estimated to have an execution time of zero seconds. If invocation profiling was not enabled for these routines or if the invocation count is zero, these routines will not be listed in the gprof output. If, however, the invocation count is non-zero, the routine will be listed even if the estimated execution time is zero.

Due to the sampling methodology employed, execution times are estimates with large relative error for small execution counts. Conversely, routine invocation counts are exact and will not vary between identical runs of a program. The calculated routine child times are subject to error due to error in the routine time estimates and due to the assumption that all invocations of a routine take the same amount of time.

Routine Level Flat Profile

The flat profile, generated with the '-p' option, presents a list of routines that were estimated to consume time while the program was running or were invoked at least once, sorted by the amount of time the routine consumes. The flat profile provides an easy way to see which routines consume the largest percentage of the program run time and are therefore good candidates for optimization. An example of a flat profile (for the example described in the Invocation Profiling section above) is shown below.

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 50.23      9.98     9.98       10     1.00     1.00  routine2
 24.91     14.93     4.95        1     4.95    13.91  routine1
 14.95     17.90     2.97        2     1.49     1.49  routine3
  9.91     19.87     1.97        1     1.97    19.87  main

The third column in the output above, labeled “self seconds,” shows the time estimated to be spent in the routine. The time reported may be estimated to be zero if no instructions for that routine were observed to be executing. The first column shows the self time as a percentage of the total run time. The second column, labeled “cumulative seconds,” shows an accumulation of the self time in the rows above; therefore, for the last entry it represents the total estimated run time of the program. The fourth column shows the number of times a routine was called, as recorded by the framework’s invocation profiling mechanism. If a routine was compiled without invocation profiling enabled (with '-pg'), this information is not recorded for that routine and the call count will be blank.

The fifth column, “self s/call,“ is simply the self seconds divided by the number of call to the routine and represents the average spent in the routine time per call. The sixth column, “total s/call,” is the self time plus the estimated child time divided by the number of calls to the routine. It represents the total time used per call to a routine, including the time in the routine and in the routines that it calls.

Line Level Flat Profile

The line level profile, generated with the '-l -p' option, presents the same information as the routine level flat profile but with times broken out at the source code line level instead of at the routine level. Generating this profile requires that the routine be compiled with the '-g' option so as to include the debugging information. Generating this profile for large programs can take significant time.

  %   cumulative   self             self     total           
 time   seconds   seconds   calls  Ts/call  Ts/call  name    
 36.49      7.25     7.25                            routine2 (routine2.c:5 @ 1001604)
 16.26     10.48     3.23                            routine1 (routine1.c:9 @ 10014e4)
 13.74     13.21     2.73                            routine2 (routine2.c:6 @ 10015f8)
 10.57     15.31     2.10                            routine3 (routine3.c:5 @ 10016c4)
  8.66     17.03     1.72                            routine1 (routine1.c:10 @ 10014d8)
  7.00     18.42     1.39                            main (gprof_test.c:13 @ 10013bc)
  4.38     19.29     0.87                            routine3 (routine3.c:6 @ 10016b8)
  2.92     19.87     0.58                            main (gprof_test.c:14 @ 10013b0)
  0.00     19.87     0.00      10     0.00     0.00  routine2 (routine2.c:1 @ 10015a0)
  0.00     19.87     0.00       2     0.00     0.00  routine3 (routine3.c:1 @ 1001660)
  0.00     19.87     0.00       1     0.00     0.00  main (gprof_test.c:9 @ 1001360)
  0.00     19.87     0.00       1     0.00     0.00  routine1 (routine1.c:5 @ 1001480)

Call Graph Analysis

The call graph analysis, generated with the -q option, displays information about a routine, its child routines, and its parent routines. Each section of the call graph analysis provides information pertaining to a specific routine.

The routine of interest is indicated by its index number in the first column and indentation from the lists of parent and child routines. The percentage of time listed in the second column indicates the percentage of program runtime accounted for by the routine and its children. The self column indicates times with different meanings depending on whether it is for the primary routine for that section or the routine's parent or child routines. For the primary routine, the time represents the time spent in that routine during the execution of the program. For a parent routine, it represents the amount of the primary routine’s self time being propagated to that parent, based on the percentage of calls to the primary routine made by that parent. For a child routine, it represents the amount of that child routine’s self time being propagated to the primary routine based on percentage of calls made to that routine by the primary routine.

Similarly for the “children” column, the times reported vary depending on the routine context. For the primary routine, it is the total amount of time from child routines propagated to that routine; for child routines, it is the amount of that routine’s child time being propagated to the primary routine; and for parents, it is the amount of the primary routine’s child time being propagated to that parent.

The ”called” columns indicate the number of times the primary routine was called, and the fraction of call made from parents and to children. Allocations of time from child routines to parent routines are based on the fraction of calls to the child routine made by the parent, using the (sometimes incorrect) assumption that each call to a routine takes the same amount of time. If a routine was not compiled with the '-pg' option to enable invocation profiling, its parent routines will be unknown and indicated as <spontaneous>. If no routines are compiled with invocation profiling enabled, no call graph analysis information will be available (call graph analysis output may need to be suppressed with the -Q option in order for gprof to run).

On the Blue Gene, the system libraries and other libraries (such as MPI) have not been compiled with invocation profiling enabled, so parent functions for those routines cannot be determined and the routines parent will be listed as <spontaneous>. The line-by-line option, '-l', may be used with the call graph analysis option, '-q', to present the call graph analysis information broken down on a source line basis.

index % time    self  children    called     name
    1.97   17.90       1/1           generic_start_main [2]
[1]    100.0    1.97   17.90       1         main [1]
                4.95    8.96       1/1           routine1 [3]
                3.99    0.00       4/10          routine2 [4]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00   19.87                 generic_start_main [2]
                1.97   17.90       1/1           main [1]
-----------------------------------------------
                4.95    8.96       1/1           main [1]
[3]     70.0    4.95    8.96       1         routine1 [3]
                5.99    0.00       6/10          routine2 [4]
                2.97    0.00       2/2           routine3 [5]
-----------------------------------------------
                3.99    0.00       4/10          main [1]
                5.99    0.00       6/10          routine1 [3]
[4]     50.2    9.98    0.00      10         routine2 [4]
-----------------------------------------------
                2.97    0.00       2/2           routine1 [3]
[5]     14.9    2.97    0.00       2         routine3 [5]

Routine Execution Count List

The routine execution count list, generated with the -C option, displays a simple list of routines and the execution counts for the routines as recorded by the invocation profiling mechanism. If a routine was not compiled with invocation profiling enabled, it will not appear in the list.

   /gpfs/gprof_tests/src/gprof_test.c:9: (main:0x1001360) 1 executions
   /gpfs/gprof_tests/src/routine1.c:5: (routine1:0x1001480) 1 executions
   /gpfs/gprof_tests/src/routine2.c:1: (routine2:0x10015a0) 10 executions
   /gpfs/gprof_tests/src/routine3.c:1: (routine3:0x1001660) 2 executions

Annotated Source Listing

The annotated source listing, generated with the -A option, displays the program source code annotated with the number of calls made to the routines. This output can only be produced if the program was compiled with the debugging information option, '-g.'

   *** File /gpfs/gprof_tests/src/routine2.c:
          10 -> void routine2(int *x) {
                  *x= *x+2; 
                
                  int a;
                  for(int j=0; j < 60600000; j++) {
                    a++;
                  }
                
                  return;
                  }

Issues in Interpreting Profile Data

Several considerations must be keep in mind when interpreting the information reported by gprof:

  • The times reported by gprof are estimates based on the instructions observed to be executing when the program is interrupted at the end of the sampling interval. A consequence of this sampling methodology is that time estimates may vary from run to run and contain high relative error when based on a small number of observations.
  • The expected error in a time estimate is the square root of the number of observations upon which it is based. For example, for an estimate based on 9 observations, the expected error is approximately 3 and the relative error is 33.3%. For an estimate based on 400 observations, the expected error is 20, or about 5%.
  • Sampling with gprof imposes some overhead on the execution of the program. This overhead is generally small for instruction profling (only linking with -pg) and is generally small, but can in some instances be signficant, for invocation profiling (compiling and linking with -pg). If a significant amount of time is reported in the mcount() (or _mcount or __mcount) routine turning off invocation profiling can reduce the overhead.
  • The time spent in all functions, including all functions called within a library, is reported by gprof. This can make the output hard to interpret as many of the reported functions may be low level library functions that are not otherwise visible user of the library. For example a call to MPI_Bcast() may spend relatively little time in the MPI_Bcast() routine itself and most of its time in supporting function (on the BG/Q these are often PAMI functions). As a result gprof may show time spent in numerous PAMI functions, some of which may have also been called from other MPI routines, and no time directly in MPI_Bcast(). It is therefore hard to determine the total time spent in specific library routines, such as MPI_Bcast(), that are called directly from user code.
  • Routines that execute quickly may register no observations, and therefore have a time estimate of 0.00 seconds, despite having an actual non-zero runtime.
  • Routines compiled without invocation profiling enabled (without the '-pg' option) will not have call counts reported and will not have identified parent routines. Parents will be labeled as <spontaneous>.
  • Routines with no observations and for which invocation profiling was not enabled (compiled without '-pg') will not be reported by gprof despite being called and having a (very small) runtime.
  • When attributing time from a child routine to a parent routine, it is assumed that all calls to that child routine, from the parent routine of interest or from any other caller of that child routine, take the same amount of time. If this assumption is incorrect, significant errors in child time estimates may occur.
  • gprof show only time spent in user code, not in system calls. Therefore, time spent performing IO and other tasks involving the operating system will not be reported. For codes spending significant time system call the total runtime reported by gprof will be noticeably lower than the actual runtime of the code.
  • Compiler optimization can produce errors in time estimates and call counts.
  • Recursive routines require special consideration in how their call graph analysis is generated. See the gprof documentation in the References section for details.

Profiling Concepts

The information presented by the gprof framework is derived from the two basic types of information

it can collect: 1) a record of which portion of the program was observed to executing when the program is interrupted at regular intervals, and 2) a count of how many times a routine is invoked along with a record of the location in the program from which that routine was invoked.

Programs in Memory

To understand the basic information the gprof framework collects, it is helpful to understand how a program looks after it is compiled and loaded into memory. Consider the routine:

   void test_routine() {
     int a=0;

     for(int j=0; j < 100; j++) {
       a++;
     }

     return;
     }

After it is compiled, linked, and loaded into memory, the routine becomes a series of machine instructions generated by the compiler from the source lines above that exist at a series of memory locations. An example of the routine on Blue Gene is shown below:

   00001000 <test_routine>:
     1000:   stwu    r1,-32(r1)
     1004:   li      r0,0  
     1008:   stw     r0,16(r1)
     100c:   li      r0,0  
     1010:   stw     r0,24(r1)
     1014:   cmpwi   r0,100
     1018:   bge-    3c <test_routine+0x3c>
     101c:   lwz     r3,16(r1)
     1020:   addi    r0,r3,1
     1024:   stw     r0,16(r1)
     1028:   lwz     r3,24(r1)
     102c:   addi    r0,r3,1
     1030:   stw     r0,24(r1)
     1034:   cmpwi   r0,100
     1038:   blt+    1c <test_routine+0x1c>
     103c:   b       40 <test_routine+0x40>
     1040:   addi    r1,r1,32
     1044:   blr   

The first line shows that the symbol <test_routine> is associated with the address 00001000. The following lines all contain three columns: 1) the address of the instruction, 2) the machine instruction, and 3) the operands of the instruction. In the simplest cases, when there is no significant re-ordering due to compiler optimization, all of the code in a routine becomes a sequential block of machine instructions that follow the symbol (here <test_routine>) that comes from the name of the routine. Each source code line generally maps to a sequence of machine instructions and each machine instruction maps to a source code line, usually with multiple instructions mapping to the same line of source code. Instruction Profiling

Instruction Profiling

The first set of information is collected when a program, which is linked so as to include the gprof framework, is executed. When the program begins, the framework starts the timer which interrupts the execution of the program at a fixed interval. Typically, this interval or sampling frequency is every 10 milli-seconds, or every 0.01 seconds. At each interrupt, the address at which the program is executing is noted and recorded in an array in memory. At the end of the program execution, this array contains a count of the number of times the various instruction addresses were observed executing at the time of the interrupt, and the array is written to a file named gmon.out.

Using this instruction count information, it is possible to estimate the amount of time a program spent in a given routine. From the example above, any instructions in the address range 1000 to 1044 observed to be executing would be attributable to the routine test_routine(). If a program containing the routine test_routine() ran for 10 seconds, approximately 1000 interrupts would occur during its execution and at each interrupt an execution address is recorded. If 200 of the 1000 observed addresses were attributable to the routine test_routine(), it can be estimated that 20% of the program runtime, or 2 seconds, were spent in the test_routine() routine. This instruction address sampling method has the advantage of limited impact upon the program’s runtime, and minimal side effects.

Conversely, it has the disadvantage of being inexact: the instruction addresses observed depend on the exact timing of the interrupt and two different runs of the same program can produce different observed address counts. Following the earlier example, the first run of the program may show 200 observed instruction addresses attributable to test_routine(), a second execution of the program may show 195 observations in test_routine(), and a third execution may show 205. This variation is inherent in the sampling methodology; the error in the estimated time is inversely proportional to the number of observations. For large observation counts, the relative error will be small. There is only a 2.5% difference between 200 and 205 observations. For small observation counts, relative error can be quite significant: there is a 50% difference between 2 and 3 observations for a routine.

Invocation Profiling

The second set of information, routine invocation counts and invoking locations, is collected when a program is executed. In addition to being linked to the gprof framework, routine invocation counts and invoking locations have routines compiled with compiler flags enabling routine invocation profiling. When a routine is compiled with routine invocation profiling enabled, the compiler inserts a call to a profiling library routine at the start of the routine. On the Blue Gene this routine is named _mcount(). An example of the earlier routine test_routine(), compiled with the routine invocation profiling enabled, the result is shown below:

00001000 <test_routine>:
  1000:   mflr    r0
  1004:   stw     r0,4(r1)
  1008:   lis     r12,0
  100c:   addi    r0,r12,0
  1010:   bl      <_mcount>          <- call to _mcount()
  1014:   stwu    r1,-32(r1)
  1018:   li      r0,0
  101c:   stw     r0,16(r1)
  1020:   li      r0,0
  1024:   stw     r0,24(r1)
  1028:   cmpwi   r0,100
  102c:   bge-    50 <test_routine+0x50>
  1030:   lwz     r3,16(r1)
  1034:   addi    r0,r3,1
  1038:   stw     r0,16(r1)
  103c:   lwz     r3,24(r1)
  1040:   addi    r0,r3,1
  1044:   stw     r0,24(r1)
  1048:   cmpwi   r0,100
  104c:   blt+    30 <test_routine+0x30>
  1050:   b       54 <test_routine+0x54>
  1054:   lwz     r0,36(r1)
  1058:   mtlr    r0
  105c:   addi    r1,r1,32
  1060:   blr

This result can be compared with the earlier version of the routine where invocation profiling was not used. The major difference is the instruction at the beginning is to set up for the call to _mcount() and the instruction "bl <_mcount>," which invokes the _mcount() routine (two additional instructions are also present at the end of the routine as a side effect of the call to _mcount()).

When the profiling library routine _mcount() is called, it unwinds the call stack and records the address from which it was called (in the above case, the address would be 1010). It also records the address from which the calling routine was invoked (the location from which test_routine() was called). The record of the two addresses is kept in memory. If the same two addresses are seen for multiple invocations of _mcount(), a counter for that pair of addresses is incremented. Unlike the counts obtained from instruction profiling, the counts from invocation profiling are not subject to error and will not vary from one execution of the same program to the next. The address pair count information is written to the gmon.out file at the end of the program execution.

As was the case with instruction profiling, the addresses recorded can be associated with program routines. The address pairs then resolves to pairs of routines that have a caller/callee relationship and represent links in a call graph. A call graph is a graph in which routines are nodes and a call between two routines is represented as a link between the nodes. An example call graph is shown below, where main() calls routine1() and routine2(), routine1() calls routine3() and also calls routine2().

             main()
             |  |
             |  |
             |  -> routine1()
             |       |  |
            \|/      |  |-> routine3()
        routine2() <-|

The collected address pairs and count information provides information on which links were active during the execution of a program and how many times a link was traversed. Using this information, the gprof framework can provide an estimate of the time spent not just in a routine, but also the cumulative time spent in all of the child routines invoked by that routine. For the call graph above, consider the example where the time spent in routines main(), routine1(), routine2(), and routine3() is estimated to be 2, 5, 10, and 3 seconds, respectively, from the instruction profile information.

Further consider that main() invokes routine1() 1 time and routine2() 4 times, and routine1() invokes routine2() six times and routine3() two times. The cumulative time spent in the child routines of routine1() is the sum of all of the time spent in routine3(), since it is only invoked by routine1(), and some fraction of the total time consumed by routine2() since routine2() is called by both main() and routine1().

The time attributable to routine1() is estimated based on the total fraction of calls to routine2() that came from routine1(). In this example, 6 of the 10 calls to routine2() originated in routine1() and therefore 6 of the 10 seconds consumed by routine2() are attributed to calls from routine1(). The child time for routine1() is then the 6 seconds from routine2() plus the 3 seconds from routine3() for a total of 9 seconds of child time. The self time for routine1() is the 5 seconds the routine itself consumes. The total time associated with the call to routine1() is then the self time plus the child time, or a total of 14 seconds.

Conclusion

Profiling with the gprof framework provides a low-impact method for gathering information about the execution of a program. Estimates of the time spent in each program routine are provided by instruction profiling. Invocation profiling collects information that can be used to determine the number of times a routine was invoked, which links in the call graph were active, how many times each link was activated, and an estimate of the time spent in child routines attributable to the parent routine. In order to have minimal impact, basic execution address counts and call address pair counts are collected during program execution and written to the gmon.out file at the end of program execution. Since the collected information consists entirely of counts of a finite set of items (instructions addresses and call graph links), the amount of profile information does not increase with the runtime of the program. All analysis, such as mapping addresses to routines and assembly of the call graph, is done after program execution using the post-processing utility gprof.