Coreprocessor on BG/Q

Coreprocessor is a basic parallel debugging tool that can be used to debug problems at all levels (hardware, kernel, and application). It is particularly useful when working with a large set of core files since it reveals where processors aborted, grouping them together automatically (for example, 9 died here, 500 were here, etc.). See the instructions below for using the Coreprocessor tool.

References

The Coreprocessor tool (IBM System Blue Gene Solution: Blue Gene/Q System Administration, Chapter 22)

Location

The coreprocessor.pl script is located at:

   /bgsys/drivers/ppcfloor/coreprocessor/bin/coreprocessor.pl

or for user convenience, the SoftEnv key +utility_paths (that you get in your default environment by putting @default in your ~/.soft file) allows you to use directly:

   coreprocessor.pl

Using Coreprocessor on BG/Q

Core processor has two basic modes of use. One mode is connecting to a running job and getting some basic debug output and the second is to analyze the lightweight core files that the system generates in a failure. See Core File Settings for how to control core file generation.

Connecting to a Running Job

The coreprocessor.pl command has a reasonable help summary that describes it capabilities. Take a look to see what you can do.

> coreprocessor.pl -h

In order to connect to a running job you will need the Blue Gene job id. This job id is different from the Cobalt (scheduler) job id you normally work with. This Blue Gene job id can be found in the standard error log.

> grep "ibm.runjob.client.Job: job" <name_of_error_file>.error
2013-11-21 22:07:10.133 (INFO ) [0x40000a3b8f0] VST-22460-33771-64:667070:ibm.runjob.client.Job: job 667070 started

The number listed between 'job' and 'started' is what you want, 667070 in this case.

The following is an example for how to check a running job to see where some of the MPI ranks are during execution. This is useful if you think the job is hanging for some reason.

> coreprocessor.pl -b=test.x -nogui -mode=Survey -j=667070

The '-b=<binary name>' argument specifies the binary running. If you specify this, the stack trace will show the functions rather than addresses. The '-nogui' option will print the output to the screen rather than creating a X session. The '-mode=Survey' will sample a subset of the MPI ranks and print the stack trace information in groups and each group will be annotated by how many ranks are at that position. The '-j=<bg_jobid>' argument is where you put the Blue Gene job argument to connect to your running job.

Sample output

> coreprocessor.pl -b=test.x -nogui -mode=Survey -j=667076
driver path: /bgsys/drivers/V1R2M1/ppc64
command: /bgsys/drivers/V1R2M1/ppc64/coreprocessor/bin/sdebug --id=667076 --tool=/bgsys/drivers/V1R2M1/ppc64/coreprocessor/bin/sdebug_proxy 2>&1
line: Launching tool on job 667076
line: 2013-11-21 22:15:11.967 (INFO ) [0xfffa02f8ae0] ibm.runjob.AbstractOptions: using properties file /bgsys/local/etc/bg.properties
line: 2013-11-21 22:15:11.967 (INFO ) [0xfffa02f8ae0] ibm.runjob.AbstractOptions: max open file descriptors: 65536
line: 2013-11-21 22:15:11.968 (INFO ) [0xfffa02f8ae0] ibm.runjob.AbstractOptions: core file limit: 18446744073709551615
line: 2013-11-21 22:15:11.968 (INFO ) [0xfffa02f8ae0] ibm.runjob.commands.Options: BG/Q dump_proctable V1R2M1 (revision 60705) Jul 17 2013 18:03:32 pid 45350
line: 2013-11-21 22:15:11.968 (INFO ) [0xfffa02f8ae0] ibm.runjob.commands.Options: vestalac1
line: Number of IO nodes = 3
line: 2013-11-21 22:15:12.071 (INFO ) [0xfff823d9090] ibm.runjob.AbstractOptions: using properties file /bgsys/local/etc/bg.properties
line: 2013-11-21 22:15:12.072 (INFO ) [0xfff823d9090] ibm.runjob.AbstractOptions: max open file descriptors: 65536
line: 2013-11-21 22:15:12.072 (INFO ) [0xfff823d9090] ibm.runjob.AbstractOptions: core file limit: 18446744073709551615
line: tool 4 started on 3 I/O nodes for 1024 ranks.
line: Connecting to ionode 172.25.71.1
line: Connecting to ionode 172.25.72.1
line: Connecting to ionode 172.25.73.1
line: COMMAND:
Reading disassembly for test.x
touchlocation
setstate(halt)
command: halt 0
touchiar
command: iar 0
touchlocation
setstate(halt)
command: halt 0
touchstackdata
command: stacks 0
0 :Node (1024)
1 :    0000000000000000 (1024)
2 :        .__libc_start_main (1024)
3 :            .generic_start_main (1024)
4 :                .main (1024)
5 :                    .__sleep (1024)

Analyzing Core Files

  1. Make sure you have an X server running on your laptop or workstation and your DISPLAY set. 
  2. Connect to an ALCF login node via ssh -X to tunnel your DISPLAY.
  3. Go to the directory with the core files.
  4. Run coreprocessor.pl to initiate the session:  
     
       > coreprocessor.pl

    Note: The Coreprocessor does not require parameters to start, but you can list the possible options using -h:

       > coreprocessor.pl -h
  5. Once you have the Coreprocessor GUI started in a session, select: File-> Load Core
     
  6. A new window opens. Provide the following information, e.g.:

       Path to Cores: .

       CNK Binary: a.out

    Then, click Load Cores

    Path to Cores
     
  7. The two previous steps could also be completed with Coreprocessor options as follows:  
     
       > coreprocessor.pl -c=. -b=a.out
    
  8. Back at the main window, click ** Select Grouping Mode **
    Then, choose one of the stack traceback options, e.g., Stack Traceback (detailed)
    Stack Traceback
     
  9. The main window will show a stack listing with branches for each group of processors, and the count at that spot. Remember, to obtain the most detailed information, the executable needs to have been compiled with the debug option -g. Then, if you click on any line, it will show the source line number info and the list of core files at that spot.
    stack listing
     
  10. If your Internet connection is slow, consider using the Coreprocessor X-Windows client displaying a Virtual Network Computing (VNC) session. For more information, read Using VNC with a debugger.