[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6. Debugging and Tracing

If the difficulty is that the application fails to start in the distributed memory model (using topcc --mpi), then read 5.4.2 If Slaves Fail to Start, for some debugging techniques. Note also that TOP-C ignores SIGPIPE. This is because TOP-C employs the SO_KEEPALIVE option, and the master process would otherwise die if a slave process were to die. SO_KEEPALIVE is needed for robustness when slave processes execute long tasks without communicating with the master process. The rest of this section assumes that the application starts up correctly.

6.1 Debugging by Limiting the Parallelism  
6.2 Debugging with `--TOPC-safety'  
6.3 TOP-C and POSIX signals  
6.4 Tracing Messages  
6.5 Stepping Through a Slave Process with `gdb'  
6.6 Segmentation faults and other memory problems  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1 Debugging by Limiting the Parallelism

First, compile and link your code using topcc --seq --TOPC-safety=20 -g -O0, and make sure that your application works correctly sequentially. Only after you have confidence in the correctness of the sequential code, should you begin to debug the parallel version.

If the application works correctly in sequential mode, one should debug in the context of a single slave. It is convenient to declare the remote slave to be localhost in the `procgroup' file, in order to minimize network delays and so as not to disturb users of other machines. In this case, the code is "almost" sequential. An easy way to do this is:
 
  topcc --mpi --TOPC-num-slaves=1 -g -O0

Next, one should test on two slaves, and finally all possible slaves.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.2 Debugging with `--TOPC-safety'

The command-line option `--TOPC-safety=val' provides assistance for debugging parallel programs. At higher values of val, optimizations that do not change the correctness of the program are converted to safer equivalents. A good strategy is to test if `--TOPC-safety=20' causes the bug to go away, and if so, progressively lower val toward zero, until the bug reappears. The value at which the bug reappears indicates what `TOP-C' optimization feature is not being used correctly. If the bug still exists at `--TOPC-safety=20', one should next try compiling with the `--seq' flag and use a debugger to debug the sequential code.

The effects due to different safety levels are subject to change. To see the current effects, invoke any TOP-C application as follows
 
  ./a.out --TOPC-help --TOPC-verbose
and you will see something like:
 
  safety: >=0: all; >=2: no TOP-C memory mgr (uses malloc/free);
  >=4: no TOPC_MSG_PTR; >=8: no aggreg.;
  >=12: no TOPC_abort_tasks; >=14: no receive thread on slave;
  >=16: default atomic read/write for DoTask, UpdateSharedData;
   =19: only 2 slaves; >=20: only 1 slave
  (AGGREGATION NOT YET IMPLEMENTED)

Values higher than 4 cause TOPC_MSG_PTR() to act as if TOPC_MSG() was called instead. Values higher than 12 cause TOPC_abort_tasks() to have no effect. Values higher than 14 imply that a single thread in the slave process must receive messages and execute DoTask(). Normally, `TOP-C' arranges to overlap communication and computation on the slave by setting up a separate thread to receive and store messages from the master. Values higher than 15 imply that `TOP-C' will use malloc instead of trying to do its own memory allocation (which is optimized for `TOP-C' memory patterns). Values higher than 16 imply that all of DoTask acts as if a read lock was placed around it, and all of UpdateSharedData has a write lock placed aound it. (This has an effect only in the shared memory model where calls to TOPC_ATOMIC_READ/WRITE are ignored.) At values of 19 and 20, the number of slaves is reduced to 2 and to 1, regardless of the setting of `--TOPC-num-slaves' and the specification in a `procgroup' file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.3 TOP-C and POSIX signals

If an application handles its own signals, this can create a clash with the TOP-C. In the distributed memory model (--mpi), `TOP-C' will create its own signal handlers for SIGALRM. This is used in conjunction with alarm() to eventually kill runaway slave processes. In addition, if using `MPINU', the built-in MPI subset, `TOP-C' will create its own handler for SIGPIPE. This is in order to allow the master process to detect dead sockets, indicating dead slaves. Finally, for short periods, `MPINU' will disable the use of SIGINT around calls to select(). Nevertheless, if a SIGINT is sent during this period, TOP-C will pass the signal on to the original SIGINT handler of the application.

`TOP-C' does not modify signal handlers in the sequential (--seq) or shared memory (--pthread) models. Furthermore, if a different MPI (other than MPINU) is used with TOP-C, TOP-C will only handle SIGALRM. However, the other MPI may handle signals itself. See section C. Using a Different `MPI' with TOP-C.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4 Tracing Messages

If a bug appears as one moves to greater parallelism, one should trace messages between master and slaves (for any number of slaves). This is the default, and it can be enabled on the command line with:
 
  ./a.out --TOPC-trace=2 args
The variable TOPC_OPT_trace can be set in the code to dynamically turn tracing on (1 or 2) and off (0) during a single run. A trace value of 2 causes `TOP-C' to invoke the application-defined trace functions pointed to by TOPC_OPT_trace_input/result. If the application has not defined trace functions, or if TOPC_OPT_trace is 1, then the `TOP-C' default trace functions are invoked. All message traces are displayed by the master at the time that the master sends or receives the corresponding message.

Variable: void (*)(void *input) TOPC_OPT_trace_input
Variable: void (*)(void *input, void *output) TOPC_OPT_trace_result
Global pointer (default is NULL) to function returning void. User can set it to his or her own trace function to print out data-specific tracing information in addition to generic message tracing of TOPC_trace.
 
EXAMPLE:  if you pass integers via TOPC_MSG(), define
  TOPC_trace_input() as:
         void mytrace_input( int *input ) {
           printf("%d",*input);
         }
         TOPC_OPT_trace_input = mytrace_input;

Note that the term `result' in TOPC_OPT_trace_result refers to an `(input, output)' pair.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.5 Stepping Through a Slave Process with `gdb'

If you find the master hanging, waiting for a slave message, then the probable cause is that DoTask() is doing something bad (hanging, infinite loop, bus/segmentation error, etc.). First try to isolate the bug using a symbolic debugger (e.g. `gdb') and the sequential memory model. If your intended application is the shared memory model, you can also use `gdb' to set a breakpoint in your `DoTask' routine or at the `TOP-C' invocation, do_task_wrapper.

If the bug only appears in the distributed memory model, you can still symbolically debug DoTask() using `gdb' (the GNU C debugger) and its attach command (see section `Attach' in The GNU debugger), which allows you to attach and debug a separate running process. This lets you debug a running slave, if it is running o the same processor. For this strategy, you will want the slave to delay executing to give you time to execute gdb and attach on the remote host or remote thread. The command line option `--TOPC-slave-wait=30' will force the slave to wait 30 seconds before processing.

In applying this debugging strategy to an application `./a.out', one might see:
 
  [ Execute ./a.out in one window for master process ]
  gdb ./a.out
  (gdb) run --TOPC-trace=1 --TOPC-safety=19 --TOPC-slave-wait=30 args

  [ In a second window for a slave process on a different host, now type: ]
  ps
    ...
    1492  p4 S    0:00 a.out args localhost 6262 -p4amslave
  gdb a.out
  ...
  (gdb) break do_task_wrapper
    Breakpoint 1 at 0x80492ab: file ...
    [ `break slave_loop' is also useful.  This calls do_task_wrapper ]
  (gdb) attach 1492
    Attaching to program `a.out', process 1492
    0x40075d88 in sigsuspend ()
  [ After 30 sec's, traced messages in master window appear, ]
  [ for slave, type: ]
  (gdb) continue
    Continuing.
    Breakpoint 1, DoTask (input=0x805dc50) at ...

  [ Continue stepping through master and slave processes in 2 windows ]

If you try to attach to a second slave process after attaching to a first slave process, `gdb' will offer to kill your first slave process. To avoid this situation, remember to execute detach before attaching a second slave process.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.6 Segmentation faults and other memory problems

Memory bugs are among the most difficult to debug. If you suspect such a bug (perhaps because you are using TOPC_MSG_PTR), and you fail to free previously malloc'ed memory, that is a memory leak. If you access a buffer after freeing it, this may cause a segmentation error at a later stage in the program.

If you suspect such a bug (and maybe you should if nothing else worked), it is helpful to use a malloc or memory debugger. An excellent recent memory debugger is `valgrind'(2). `valgrind' can be directly applied to an application binary, without recompilation or relinking.

An older debugger is `efence',(3) topcc provides direct support for `efence'. `TOP-C' will link with the efence library if --efence is passed to topcc or topc++.
 
  topcc --efence ...
This causes all calls to malloc and free to be intercepted by the `efence' version. Modify the line LIBMALLOC= in topcc or topc++ if you use a different library.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Gene Cooperman on October, 6 2004 using texi2html