Go to the first, previous, next, last section, table of contents.


Performance and Long Jobs

Strategies for Greater Concurrency

Strategy 1: SEMI-INDEPENDENT TASKS:
Define tasks so that most task outputs do not require any update. This is always the case for trivial parallelism (when tasks are independent of each other). It is also often the case for many examples of search and enumeration problems.
Strategy 2: CACHE PARTIAL RESULTS:
Inside DoTask() and UpdateSharedData(), save partial computations in global private variables. Then, in the event of a REDO action, `TOP-C' guarantees to invoke DoTask() again on the original slave process or slave thread. That slave may then use previously computed partial results in order to shorten the required computation. Note that pointers on the slave to input and output buffers from previous UPDATE actions and from the original task input will no longer be valid. The slave process must copy any data it wishes to cache to global variables. In the case of the shared memory model, those global variables must be thread-private. (see section Thread-Private Global Variables) Note the existence of TOPC_is_REDO() for testing for a REDO action.
Strategy 3: MERGE TASK OUTPUTS:
Inside CheckTaskResult(), the master may merge two or more task outputs in an application independent way. This may avoid the need for a REDO action, or it may reduce the number of required UPDATE actions.

Improving Performance

If your application runs too slowly due to excessive time for communication, consider running multiple slave processes on a single processor. This allows one process to continue computing while another is communicating or idle waiting for a new task to be generated by the master.

If communication overhead or idle time is still too high, consider if it is possible to increase the granularity of your task -- perhaps by amalgamating several consecutive tasks as a single larger task to be performed by a single process. You can do some of this automatically. For example, if the statement:

  TOPC_agglom_count=5;  [ EXPERIMENTAL VERSION, ONLY ]

is executed before TOPC_master_slave(), then `TOP-C' will transparently bundle five task inputs as a single network message, and similarly for the corresponding task outputs.

Other useful techniques that may improve performance of certain applications are:

  1. set up multiple slaves on each processor (if slave processors are sometimes idle)
  2. re-write the code to bundle a set of tasks as a single task (to improve the granularity of your parallelism)
PERFORMANCE ISSUE FOR MPI:
If you have a more efficient version of `MPI' (perhaps a vendor version tuned to your hardware), consider replacing LIBMPI in `.../top-c/Makefile' by your vendor's `limbpi.a' or `libmpi.so', and delete or modify the the LIBMPI target in the `Makefile'.
PERFORMANCE ISSUE FOR SMP:
Finally under `SMP', there is an important performance issue concerning the interaction of `TOP-C' with the operating system. First, the vendor-supplied compiler, cc, is recommended over gcc for `SMP', due to specialized vendor-specific architectural issues. Second, if a thread completes its work before using its full scheduling quantum, the operating system may yield the CPU of that thread to another thread -- potentially including a thread belonging to a different process. There are several ways to defend against this. One defense is to insure that the time for a single task is significantly longer than one quantum. Another defense is to ask the operating system to give you at least as many "run slots" as you have threads (slaves plus master). Some operating systems use pthread_setconcurrency() to allow an application to declare this information, and `TOP-C' invokes pthread_setconcurrency() where it is available. However, other operating systems may have alternative ways of tuning the scheduling of threads, and it is worthwhile to read the relevant manuals of your operating system.

Long Jobs and Courtesy to Others

In the distributed memory model, infinite loops and broken socket connections tend to leave orphaned processes running. In the `TOP-C' distributed memory model, a slave times out if a task lasts longer than a half hour or if the master does not reply in a half hour. This is implemented with the UNIX system call, alarm().

A half hour (1800 seconds) is the default timeout period. The command line option --TOPC-slave-timeout=num allows one to change this default. If num is 0, then there is no timeout and `TOP-C' makes no calls to SIGALRM.

The application writer may also find some of the following UNIX system calls useful for allowing large jobs to coexist with other applications.

setpriority(PRIO_PROCESS,getpid(),prio)
#include <unistd.h>
#include <sys/resource.h>

--- prio = 10 still gives you some CPU time. prio = 19 means that any job of higher priority always runs before you. Place in main().
setrlimit(RLIMIT_RSS, &rlp)
#include <sys/resource.h>
struct rlimit rlp;
rlp.rlim_max = rlp.rlim_cur = SIZE;

--- SIZE is RAM limit (bytes). If your system has significant paging, the system will prefer to keep your process from growing beyond SIZE bytes of resident RAM. Even if you set nice to priority 20, this is still important. Otherwise you may cause someone to page out much of his or her job in your favor during one of your infrequent quantum slices of CPU time. Place in main(). (Not all operating systems enforce this request.)


Go to the first, previous, next, last section, table of contents.