Go to the first, previous, next, last section, table of contents.


Debugging and Tracing

If the difficulty is that the application fails to start in the distributed memory model (using topcc --mpi), then read section section Invoking a TOP-C Application in Distributed Memory, for some debugging techniques. The rest of this section assumes that the application starts up correctly.

Debugging by Incrementally Increasing the Parallelism

First, compile and link your code using topcc --seq -g, and make sure that your application works correctly sequentially. Only after you have confidence in the correctness of the sequential code, should you begin to debug the parallel version.

If the application works correctly in sequential mode, one should debug in the context of a single slave. It is convenient to declare the remote slave to be localhost, in order to minimize network delays and so as not to disturb users of other machines. In this case, the code is "almost" sequential.

Next, one should test on two slaves, and finally all possible slaves.

Debugging with --TOPC_safety

The command-line option --TOPC_safety=val provides assistance for debugging parallel programs. At higher values of val, optimizations that do not change the correctness of the program are converted to safer equivalents. A good strategy is to test if --TOPC_safety=20 causes the bug to go away, and if so, progressively lower val toward zero, until the bug reappears. The value at which the bug reappears indicates what TOP-C optimization feature is not being used correctly. If the bug still exists at --TOPC_safety=20, one should next try compiling with the --seq flag and use a debugger to debug the sequential code.

Currently, the values of val recognized by `TOP-C' include:

  safety: >=0: all; >=4: no TOPC_MSG_PTR; >=8: no aggreg.;
  >=12: no TOPC_abort_tasks; >=14: no receive thread on slave;
  ...;
   =19: only 2 slaves; >=20: only 1 slave
  (NOT YET FULLY IMPLEMENTED)

These values are subject to change. Use --TOPC_verbose alongside --TOPC_safety=val to see a display similar to that above. TOPC_MSG_PTR and aggregation are not yet implemented. Values higher than 12 cause TOPC_abort_tasks() to have no effect. Values higher than 14 imply that a single thread in the slave process must receive messages and execute DoTask(). Normally, `TOP-C' arranges to overlap communication and computation on the slave by setting up a separate thread to receive and store messages from the master. At values of 19 and 20, the number of slaves is reduced to 2 and to 1, regardless of the setting of --TOPC_num_slaves and the specification in a `procgroup' file.

Tracing Messages

If a bug appears as one moves to greater parallelism, one should trace messages between master and slaves (for any number of slaves). This is the default, and it can be enabled on the command line with:

  ./a.out --TOPC_trace=2 args

The variable TOPC_OPT_trace can be set in the code to dynamically turn tracing on and off during a single run.

You can substitute your own function to display the messages being traced between master and slave.

Variable: void (*)(void *input) TOPC_OPT_trace_input
Variable: void (*)(void *input, void *output) TOPC_OPT_trace_result
void (*TOPC_OPT_trace_input)(void *input);
void (*TOPC_OPT_trace_result)(void *input, void *output);
         Global pointer to function (default is NULL).  User can
         set it to his or her own trace function to print out
         data-specific tracing information in addition to generic
         message tracing of TOPC_trace.  For example, if you pass
         integers, define TOPC_trace_input() as:

         void mytrace_input( int *input ) {
           printf("%d",*input);
         }
         TOPC_OPT_trace_input = mytrace_input;
Note that the term `result' refers to an `(input, output)' pair.

If you set --TOPC_trace=1 or if TOPC_OPT_trace_input, etc., has not been set, `TOP-C' uses its own default trace functions.

/* NOT IMPLEMENTED IN THIS VERSION */
void master_slave_stats();
        Prints cumulative statistics from all invocations of
master_slave();

Note that tracing takes place entirely on the master. So, any print statements produced by a slave may be asynchronous with the trace printing and other printing on the master.

Stepping Through a Slave Process with `gdb'

If you find the master hanging, waiting for a slave message, then the probable cause is that DoTask() is doing something bad (hanging, infinite loop, bus/segmentation error, etc.).

If you are really desperate, note that gdb (the GNU C debugger) includes an attach command (see section `Attach' in The GNU debugger), which allows you to attach and debug a separate running process. This lets you debug a running slave, if it is running o the same processor. For this strategy, you will want the slave to delay executing to give you time to execute gdb and attach on the remote host or remote thread. The command line option --TOPC_slave_wait=30 will force the slave to wait 30 seconds before processing.

In applying this debugging strategy to an application `./a.out', one might see:

  [ Execute ./a.out in one window for master process ]
  gdb ./a.out
  (gdb) run --TOPC_trace=1 --TOPC_slave_wait=30 args

  [ In a second window for a slave process, now type: ]
  ps
    ...
    1492  p4 S    0:00 a.out args localhost 6262 -p4amslave
  gdb a.out
  ...
  (gdb) break DoTask
    Breakpoint 1 at 0x80492ab: file ...
    [ `break slave_loop' is also useful.  This function calls DoTask ]
  (gdb) attach 1492
    Attaching to program `a.out', process 1492
    0x40075d88 in sigsuspend ()
  [ After 30 sec's, traced messages in master window appear, for slave, type: ]
  (gdb) continue
    Continuing.
    Breakpoint 1, DoTask (input=0x805dc50) at ...

  [ Now, continue stepping through master and slave processes in 2 windows ]

If you try to attach to a second slave process after attaching to a first slave process, `gdb' will offer to kill your first slave process. To avoid this situation, remember to execute detach before attaching a second slave process.

Segmentation faults and other memory problems

In `TOP-C', after an application calls TOPC_MSG(buf,buf_size), it is the responsibility of the application to free any memory that might have been malloc'ed for the sake of buf. (see section Task Input and Task Output Buffers) Failure to free previously malloc'ed memory is a memory leak: often a difficult debugging problem. A common symptom of such a problem is a segmentation error.

If you suspect such a bug (and maybe you should if nothing else worked), it is helpful to use a malloc debugger. We illustrate with `efence', which can be found at @url{http://sources.isc.org/devel/memleak/efence}. `TOP-C' will include the efence library if --efence is passed to topcc or topc++.

  topcc --efence ...

This causes all calls to malloc and free to be intercepted by the `efence' version. Modify the line libefence= in topcc or topc++ if you use a different library.


Go to the first, previous, next, last section, table of contents.