bgq_stack on BG/Q

Location

The bgq_stack script is located at:

   /soft/debuggers/scripts/bin/bgq_stack

or for user convenience, the SoftEnv key +utility_paths (that you get in your default environment by putting @default in your ~/.soft file) allows you to use directly:

   bgq_stack

List the possible options using -h:

   > bgq_stack -h

Using bgq_stack on BG/Q to decode core files

When a Blue Gene/Q program terminates abnormally, the system generates multiple core files, plain text files that can be viewed with the vi editor. Most of the detailed information provided in the core file is not of immediate use for determining why a program failed. But the core file does contain a function call stack record that can help identify what line of what routine was executing when the error occurred. The call stack record is at the end of the file bracketed by +++STACK and ---STACK:

   +++STACK
   Frame Address     Saved Link Reg
   0000001dbfff8c00  000000000138cd18
   0000001dbfff8dc0  0000000001057528
   0000001dbfffa780  000000000101cc1c
   0000001dbfffb860  0000000001004670
   0000001dbfffb980  0000000001000460
   0000001dbfffbaa0  00000000013844a8
   0000001dbfffbd80  00000000013847a4
   0000001dbfffbe40  0000000000000000
   ---STACK 

The call stack contains a list of instruction addresses. For these to be useful, the addresses need to be translated back to a source file and line. This may be done with the bgq_stack utility:

   > bgq_stack [corefile] 

Because the lightweight core files produced by the runtime system do not contain symbolic information in the backtrace (that is, the information needed to map instruction addresses to source file and line number), debugging information from the executable is also needed. The name of the executable is read from the core files and, if compiled with -g (to include the symbolic debugging information), it will help provide most detailed information of where the program failed. Note that if the executable is recompiled or in any way altered after the core file was produced, the output of bgq_stack may be incorrect.

For each of the instruction addresses, the source file and line number are displayed just after the name of the subroutine or function where the problem occurred. Here’s an example of output:

   > bgq_stack core.0 
   ------------------------------------------------------------------------
   Program   : /gpfs/vesta-home/<username>/a.out
   ------------------------------------------------------------------------
   +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN 

   000000000142c398
   abort
   /bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/stdlib/abort.c:77

   0000000001033c80
   MPID_Abort
   /bgsys/source/srcV1R2M0.26630/comm/lib/dev/mpich2/src/mpid/pamid/src/misc/mpid_abort.c:98

   000000000101f540
   MPIR_Err_return_comm
   /bgsys/source/srcV1R2M0.26630/comm/lib/dev/mpich2/src/mpi/errhan/errutil.c:486

   0000000001003e0c
   PMPI_Bcast
   /bgsys/source/srcV1R2M0.26630/comm/lib/dev/mpich2/src/mpi/coll/bcast.c:1530

   0000000001000460
   main
   /gpfs/vesta-home/<username>/example.c:19

   0000000001423b28
   generic_start_main
   /bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226

   0000000001423e24
   __libc_start_main
   /bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194

   0000000000000000
   ??
   ??:0

It is also possible to look up the information provided by bgq_stack using the standard Linux utility: addr2line. For a large set of core files, consider using the Coreprocessor tool.