If the difficulty is that the application fails to start in the distributed
memory model (using topcc --mpi
), then read
section If Slaves Fail to Start, for some
debugging techniques. The rest of this section assumes that the
application starts up correctly.
First, compile and link your code using topcc --seq --TOPC-safety=20 -g
,
and make sure that your application works correctly sequentially. Only after
you have confidence in the correctness of the sequential code, should you
begin to debug the parallel version.
If the application works correctly in sequential mode, one should debug
in the context of a single slave. It is convenient to declare the
remote slave to be localhost
in the `procgroup' file, in order
to minimize network delays and so as not to disturb users of
other machines. In this case, the code is "almost"
sequential. An easy way to do this is:
topcc --mpi --TOPC-num-slaves=1 -g
Next, one should test on two slaves, and finally all possible slaves.
The command-line option --TOPC-safety=val
provides assistance
for debugging parallel programs. At higher values of val,
optimizations that do not change the correctness of the program are
converted to safer equivalents. A good strategy is to
test if --TOPC-safety=20
causes the bug to go away, and if so,
progressively lower val toward zero, until the bug reappears.
The value at which the bug reappears indicates what `TOP-C'
optimization feature is not being used correctly. If the bug still
exists at --TOPC-safety=20
, one should next try compiling
with the --seq
flag and use a debugger to debug the sequential code.
The effects due to different safety levels are subject to change. To see the current effects, invoke any TOP-C application as follows
./a.out --TOPC-help --TOPC-verbose
and you will see something like:
safety: >=0: all; >=4: no TOPC_MSG_PTR; >=8: no aggreg.; >=12: no TOPC_abort_tasks; >=14: no receive thread on slave; >=15: no TOP-C memory mgr (uses malloc/free); >=16: default atomic read/write for DoTask, UpdateSharedData; =19: only 2 slaves; >=20: only 1 slave (AGGREGATION NOT YET IMPLEMENTED)
Values higher than 4 cause TOPC_MSG_PTR()
to act as
if TOPC_MSG()
was called instead.
Values higher than 12 cause TOPC_abort_tasks()
to have no effect.
Values higher than 14 imply that a single thread in the slave process
must receive messages and execute DoTask()
. Normally,
`TOP-C' arranges to overlap communication and computation on the
slave by setting up a separate thread to receive and store messages from
the master.
Values higher than 15 imply that `TOP-C' will use malloc instead
of trying to do its
own memory allocation (which is optimized for `TOP-C' memory
patterns).
Values higher than 16 imply that all of DoTask
acts as
if a read lock was placed around it, and all of UpdateSharedData
has a write lock placed aound it. (This has an effect only
in the shared memory model where calls to TOPC_ATOMIC_READ/WRITE
are ignored.)
At values of 19 and 20, the number of slaves is reduced to
2 and to 1, regardless of the setting of --TOPC-num-slaves
and the specification in a `procgroup' file.
If a bug appears as one moves to greater parallelism, one should trace messages between master and slaves (for any number of slaves). This is the default, and it can be enabled on the command line with:
./a.out --TOPC-trace=2 args
The variable TOPC_OPT_trace
can be set in the code to
dynamically turn tracing on (1 or 2) and off (0) during a single run.
A trace value of 2 causes `TOP-C' to invoke the application-defined
trace functions pointed to by
TOPC_OPT_trace_input/result
. If the application has not
defined trace functions, or if TOPC_OPT_trace
is 1, then
the `TOP-C' default trace functions are invoked.
All message traces
are displayed by the master at the time that the master sends or
receives the corresponding message.
NULL
) to function returning
void
. User can
set it to his or her own trace function to print out
data-specific tracing information in addition to generic
message tracing of TOPC_trace
.
EXAMPLE: if you pass integers viaTOPC_MSG()
, defineTOPC_trace_input()
as: void mytrace_input( int *input ) { printf("%d",*input); } TOPC_OPT_trace_input = mytrace_input;
Note that the term `result' in TOPC_OPT_trace_result
refers to an `(input, output)' pair.
If you find the master hanging, waiting for a slave message, then the
probable cause is that DoTask()
is doing something bad (hanging,
infinite loop, bus/segmentation error, etc.). First try to isolate the
bug using a symbolic debugger (e.g. `gdb') and the sequential memory
model. If your intended application is the shared memory model, you can also
use `gdb' to set a breakpoint in your `DoTask' routine
or at the `TOP-C' invocation, do_task_wrapper
.
If the bug only appears in the distributed memory model, you can still
symbolically debug DoTask()
using `gdb' (the GNU C debugger)
and its attach
command
(see section `Attach' in The GNU debugger),
which allows you to attach and debug a separate running process. This
lets you debug a running slave, if it is running o the same processor.
For this strategy, you will want the slave to delay executing to give
you time to execute gdb and attach on the remote host or remote thread.
The command line option --TOPC-slave-wait=30
will force
the slave to wait 30 seconds before processing.
In applying this debugging strategy to an application `./a.out', one might see:
[ Execute ./a.out in one window for master process ] gdb ./a.out (gdb) run --TOPC-trace=1 --TOPC-safety=19 --TOPC-slave-wait=30 args [ In a second window for a slave process, now type: ] ps ... 1492 p4 S 0:00 a.out args localhost 6262 -p4amslave gdb a.out ... (gdb) break do_task_wrapper Breakpoint 1 at 0x80492ab: file ... [ `break slave_loop' is also useful. This calls do_task_wrapper ] (gdb) attach 1492 Attaching to program `a.out', process 1492 0x40075d88 in sigsuspend () [ After 30 sec's, traced messages in master window appear, ] [ for slave, type: ] (gdb) continue Continuing. Breakpoint 1, DoTask (input=0x805dc50) at ... [ Continue stepping through master and slave processes in 2 windows ]
If you try to attach to a second slave process after attaching
to a first slave process, `gdb' will offer to kill your first
slave process. To avoid this situation, remember to execute detach
before attaching a second slave process.
Memory bugs are among the most difficult to debug.
If you suspect such a bug (perhaps because you are using TOPC_MSG_PTR
),
and you fail to free previously malloc'ed memory, that is a memory leak.
If you access a buffer after freeing it, this may cause
a segmentation error at a later stage in the program.
If you suspect such a bug (and maybe you should if nothing else
worked), it is helpful to use a malloc debugger.
We illustrate with `efence', which can be found at
@url{http://sources.isc.org/devel/memleak/efence}.
`TOP-C' will include the efence library if --efence
is passed to topcc
or topc++
.
topcc --efence ...
This causes all calls to malloc
and free
to
be intercepted by the `efence' version.
Modify the line LIBMALLOC=
in topcc
or topc++
if you use a different library.
Go to the first, previous, next, last section, table of contents.