Student Projects in Parallel Computing

Prof. Gene Cooperman

The following are suggested projects for CS G280 (Parallel Computing). You are welcome to suggest other projects if you like.

Most of the projects below have the potential to result in conference papers. Within each category, the projects are ordered roughly by order of importance, which correlates strongly with the likelihood of developing the project into a conference paper. Clicking on a project listing will take you to a fuller description. If a student has selected that project, clicking on the heading of that fuller description will take you to a directory where you can look at the results to date.

Throughout this web page, there are many references to TOP-C (Task Oriented Parallel C/C++). TOP-C is freely available and easily installed and used. It supports both shared memory and distributed memory parallelism with a single API (Application Program Interface).

Also, many of these projects will benefit from having a test suite of applications. There are several possible test suites, but a particularly high quality suite is the Stanford Parallel Applications for Shared Memory (SPLASH and SPLASH-2) with a .tar.gz file. While the applications are written for shared memory, they can be adapted to distributed memory in most cases. Since distributed memory tends to be more scalable, this is an important consideration. The kernel programs are smaller than apps programs, and should be examined first. splash2/codes/null_macros/c.m4.null provides expansions of the macros to turn the program into a single processor program.

Brief Overview

===============================

Detailed Listing

===============================

Applied Parallel Programming Languages

Applied Programming Languages is a subject in which one first determines, based on user experience, what features would be useful. One then looks for innovative ways to retrofit those features into an existing language. Retrofitting into an existing language often requires strong systems skills. In a pure approach, one would design a new language with new features (e.g.: Scheme, Smalltalk, etc.), and then hope that a new mainstream language (such as Java or C#) will adopt some of those features. In an applied approach, one tries to directly retrofit those features into an existing language, gain user experience, and then use that experience to create a new mainstream language. C++ is an example of the latter (although one can argue that C++ was released too early, with an overly complex programmer's model, and would have benefited from more user experience).


New Parallel Abstractions

This section considers parallel abstractions that would modify the TOP-C programmer's model.


Parallelizing Octave (Matlab lookalike)
Parallelizing Octave (Engineering/Science application) Extend TOP-C to work interactively in Octave. Octave is a Matlab-like language for interactive numerical computation. Its software is free under GPL. Something similar was previously done with ParGAP, the parallelization of the interactive language, GAP using TOP-C. Matlab has become almost a standard in engineering. A parallelization of Octave making it easy for end users of Matlab to write parallel code (via the Octave work-alike) is likely to gather a lot of attention.


C++ binding to TOP-C callback functions
C++ was not designed for use with callback functions. If you don't believe this, try using a non-abstract member function of some class as a callback function. In C, one can write
int (*do_task)(void *input);
int myfnc(int x) { return x+1; }
do_task = myfnc;
(*do_task)(input);
This does not work if myfnc is a non-abstract member function.

This problem was already faced by the designers of GTK+, a GUI project written in C that depends on callback functions. The designers had to add C++ callback functions. Study their design decisions, and desing an appropriate C++ binding for TOP-C that elegantly handles the callback functions.


Scalability of data in parallel computation
In a cluster, the size of the data may grow so large that it no longer fits in the RAM of a single processor. Traditionally, when an applications person was faced with this problem, he or she would look for a shared memory computer. A 16-processor shared memory computer is likely to have 16 times as much memory as a standalone 1-processor computer. This is an inefficient solution since shared memory computers are expensive and he or she plans to use only one of the 16 processors. Design a cluster-based solution, in which, in addition to master and worker nodes, one also has data nodes. A data node may be a separate processor. Alternatively, a single processor may support two processes, one primarily for computation and one primarily for maintaining some portion of a distributed database (not necessarily a relational database). The job of a data node is to handle requests from other nodes.

For example, we may distribute a 1 GB hash array among 10 data nodes. Each data node is responsible for a contiguous sequence of hash indices. For network efficiency, we may have to structure a parallel computation so that a worker node saves up 1000 hash queries per data node, and sends all 1000 hash queries to the data node as a single message.

In 2005, we are lucky enough to have an in-house application by Eric Robinson that would directly benefit from this model. This provides you with an excellent test case. Please talk to me and him about this project.


Annotated parallelization
openMP is a standard that provides for parallelism on top of POSIX threads. It does so by adding annotations and pragmas, that are recognized by a front-end program. Following that philosophy, do the same thing for applications on top of TOP-C. Annotate the main TOP-C task in such a way that a preprocessor can translate the code into parallel code that can be compiled and linked with the TOP-C library. An ANSI C grammar suitable for yacc/bison and lex/flex is readily available for this purpose. Perhaps a better approach is to use the power of an existing language transformer package, such as CIL.


Bottlenecks due to slow RAM
As the CPU-memory gap continues to grow, many common programs are now dominated by the speed of RAM, and the speed of the CPU is almost irrelevant. For example, most integer sorting programs are in this class. This holds to such an extent, that program runtimes can be estimated to within about 25% using models that count memory cycles. For such a simple program as
object Z[], Y[]; int X[];
for (int i = 0; i < N; i++) Z[i] = Y[X[i]];

there is a faster version that takes into account properties of RAM. How does one translate such ideas into such tasks as speeding up an FFT (Fast Fourier Transform) or integer matrix multiplication. See, also, the next topic on slow RAM.


Superlinear parallel speedup and slow RAM
The good news is that superlinear parallel speedup can be a reality. Superlinear parallel speedup is a phenomenon in which a sequential program is parallelized to run on 10 processors and then runs 20 times as fast. This can happen if the working set of the sequential program is larger than the cache of any one processor, but in the parallel version, the working set may be split up so that the working set on any single processor fits nicely within the available cache.

The bad news is that RAM can be up to 100 times slower when used in a random access pattern. In some shared memory applications, one can even observe a negative parallel speedup (speed-down?) when added processors compete for accessing slow RAM. In today's technology, the CPU processes 4-byte words at approximately 8 GB/s (i.e. 2 GHz = 8 GB/s); Gigabit Ethernet runs at approximately 128 MB/s (1 Gb/s/8 = 128 MB/s); and random access to DRAM runs at approximately 66.6 MB/s (4-byte word using 60 ns DRAM = 4/(60 ns) = 66.6 MB/s; 60 ns random access time for DRAM has been a standard for the last decade).

Can one make use of multiple data nodes (see Scalability of Data) with their multiple caches to partition the data in a distributed manner for faster access to RAM? What is a possible application that might usefully take advantage of such a scheme? If the bandwidth and latency of Ethernet is a bottleneck, can one implement this scheme successfully on a shared memory machine?


Recursion
It is non-trivial to provide recursion in a parallel system. One elegant solution is provided by Cilk. Cilk is primarily for shared memory, although a port to distributed memory does exist. Cilk uses the concept of work-stealing to implement recursion and to also provide good load balancing. Look at the MANTA project, which has its own solutions to recursion. Although TOP-C has a recursion example among its example files, the recursion is less elegant and more error-prone than Cilk. Either use the spawn/sync calls invoked in Cilk's work-stealing paradigm, or design a more fool-proof way to do recursion in TOP-C (possibly with the use of source code annotations in comments??). One useful trick that makes this possible is the possibility of copying stacks and pasting them together using tricks with the longjmp/setjmp system calls. If enough coverage is provided, this can be turned into a publishable paper (typically by investing as much time after the course as was invested during the course).


Multiple pages to eliminate false sharing in TOP-C
A problem with the TOP-C model occurs when there are frequent updates to the shared data. There is a well-known phenomenon in shared memory called false sharing. When two threads (or two processes in a DSM system) repeatedly modify the same page of memory, the operating system must repeatedly migrate that page first to one thread and then to the other. If the two threads are modifying distinct addresses on that page, then the migration of the page is unnecessary. Yet the operating system has no way of detecting this.

TOP-C exhibits a form of false sharing. Its concept of shared data effectively creates a single page for all of shared data. TOPC_is_up_to_date() tests if that single page has been modified since the current task was executed, and if so, the user program typically invokes a REDO action. This is expensive. Yet it can happen that the worker for the current task, and the worker that had since modified the shared data, may have in fact worked on distinct portions of that page. In that case, the REDO is unnecessary. The TOP-C model can be extended to multiple `pages' of shared data by modifying the action UPDATE(page_num) and the utility TOPC_is_up_to_date(page_num). Do so.


Shared memory and LISP/GAP
GAP (Groups, Algorithms and Programming) is a general-purpose interactive language for discrete mathematics, and especially group theory. Modify GAP (interpreted language) for multithreaded. ParGAP, a parallelization of GAP using TOP-C already exists and is a refereed GAP package. Replace this implementation (which is distributed memory only) with a new implementation that allows for both distributed and shared memory operation. A major issue in this project is to stop all threads before any garbage collection. (Multithreaded garbage collectors exist, but porting one to GAP would probably be beyond the scope of a one-quarter course.)

For those who don't mind LISP syntax, the same project can be attempted on GCL (GNU Common LISP). GCL also has a distributed memory parallelism, ParGCL.


Retrofitting Mainstream Languages with New Features


TOP-C, the Grid, and generic communication layers
The Computational Grid represents an important paradigm for computation in the future. In this paradigm, 30 computer centers may "share their computer resources" on the Internet. Then, instead of one person using a cluster during 30 nights over a month, one person can use 30 clusters in a single night to complete the computation. This requires no new resources and achieves the same throughput, but the elapsed time from the beginning of a computation to the end is drastically reduced. The Globus protocols are an evolving standard to support the Grid.

TOP-C currently supports the Grid through the Ampic communication layer for TOP-C, and also through the use of MPICH with the MPI communication layer for TOP-C. Ampic is part of the AppLeS project of the Grid Research And Innovation Laboratory (GRAIL) at the San Diego Supercomputer Center. The integration of TOP-C with Ampic was joint work with Henri Casanova, Jim Hayes and Thomas Witzel. Thomas Witzel is at Northeastern University, and Henri Casanova and Jim Hayes are at the University of California, San Diego.

Is there a way to automatically write a generic communication layer? One should specify the primitives for spawning a new process, sending a message, and receiving a message. Hence, given these primitives, one should be able to use the current comm-mpi.c as a template, and replace the MPI commands for spawning, sending, receiving, etc., by the commands given in the specification. (This is similar to the current generic marshaling package, joint work with Ning Ke.) A test of the success of this facility is the ability to easily incorporate the Globus protocols. This is important for making the software more maintainable (e.g. aspects in programming languages) and easier to update when protocols change.


Adapt checkpointing to interactive programs
Checkpointing programs are especially efficient for task-based master-slave computations, since they need only checkpoint the master process. Many parallel interactive programs rely on task-based master-slave architectures. Adapt an existing checkpointing package in order to checkpoint a parallel interactive program. There should be an emphasis on robustness, so that this works well in the "real world".


Marshaling
Marshaling of complex objects is inherent to any sophisticated application in distributed computing. For C (and for C++ when it uses "class" as a C "struct"), there are standard solutions. These solutions tend not to support the full feature set of C++. This was painfully observed in parallelizing Geant4. Rather than reinvent the wheel, one desires a package to enable marshaling through annotations in comments. Decisions must be made where to do a deep copy, shallow copy, index into a permanent data array, etc. Templates, inheritance and other issues add real world complications. Simplicity is important here. Many of these problems have been solved by Marshalgen (also, see paper and earlier paper). How good is the coverage? How easy is it to use?


Scalability of the master-worker paradigm
The master-worker paradigm suffers from scalability problems, as the number of workers grows (perhaps into the thousands). This becomes an issue as vendors move to quad processor chips and beyond. Set up a tree of masters to pass tasks to the slaves. Do this initially for only two levels, and then continue. Next, assume that the lowest layer is shared memory. (As a special case, each node at the lowest layer may have no sibling nodes, in which case the problem of shared memory goes away.) This problem can be solved by invoking the existing communication layers of TOP-C more than once. A new intermediate communication layer must be written for TOP-C which considers itself a worker process when communicating with its parent nodes, but it considers itself a master process when communicating with its child nodes.
Master-worker paradigm for combined distributed/shared memory computation
This is related to the previous issue of scalability. Here we emulate the future architecture of multiple quad processor (or dual processor) chips. To support this architecture, consider a three-level tree with leaf nodes using shared memory model and intermediate nodes using a distributed memory model. The topmost, root node acts as the master. To implement this, one has to write an intermediate communication layer that links with both comm-mpi.c (distributed memory) and comm-pthread.c (shared memory), while avoiding name conflicts. The following pseudo code does this:
comm-hybrid.c:
  #define COMM_ COMM_MPI_
  #include "comm-mpi.c"
  #define COMM_ COMM_PTHR_
  #include "comm-pthread.c"
One then implements the standard COMM_ functions, and makes calls to both COMM_MPI_XXX and COMM_PTHR_XXX.


Large messages
Large messages cause inordinate delays in a parallel program. This is especially threatening if it leads to unbalanced solutions in which the master must delay issuing tasks, or issue tasks in a context of incomplete information, due to network delays. A lot of this can be solved by proper use of TCP/IP socket options for setting buffer size and other parameters (see "man getsockopt" and "man setsockopt"). Modify TOP-C to adapt to variable size messages in an application that creates large message buffers.


Memory allocator
The topc.c file goes part of the way toward implementing an efficient memory allocator for TOP-C. Much of the difficulty is for the case of threads. Implement short messages allocated directly in memory header. Right now, there are three calls to the memory management: allocate memory, check memory, free memory; Can check memory be avoided in a new version? Philosophy: We know that all message headers will always be the same size. Malloc doesn't. Can we take advantage of that for more efficient organization?


Reorganize topc.c and memory.c (more modular)
This is an exercise in software design. Can TOP-C be made more object-oriented in spirit, while retaining the C language as the greatest common denominator in the C family of languages. (Consider as an example, the design of GTK+) One should create a separate utility for linked lists. Objects might be linked lists of slave states, linked lists of tasks, etc. As the code grows, can it be documented better by referring to standard design patterns? The goal here is to make the code as readable as possible. (Also see Memory Allocator.)

===============================

Middleware


Integration of Condor with TOP-C
Condor is widely used software with special benefits for setting up "computing farms". It has strong support for checkpointing (see also Poor Man's Checkpointing) and dynamic creation of processes (see also Dead Workers and New Workers). How can this be integrated as a lower layer under TOP-C, in order to provide the TOP-C programming model with these benefits?


RPC: Create communication layer using RPC for TOP-C
This can be used to allow new workers to automatically become available. (See Dead Workers and New Workers.)


XML/SOAP: Create communication layer using SOAP for TOP-C
SOAP is becoming more of a standard, with the rise of XML. Note that the Globus project also wants to use SOAP more heavily. The advantage is that the prevalence of the Web technology ensures that there will be a lot of tools to support SOAP on many platforms.


Distributed Shared Memory (DSM)
Distributed Shared Memory was long thought to be a simple programming model, because it provides a shared memory abstraction (similar to POSIX threads), while being built on top of a distributed memory architecture, such as a cluster. Explore the strengths and limitations of this paradigm, by implementing a toy example. The idea is to use mmap to change the protection bits on pages in RAM. So, access to a shared page will cause a page fault if the page is shared, and a user-level handler can recover from this. (See, for example, TreadMarks.)

===============================

Applications

It has been said about engineers that any engineer can design a solution, but only a good engineer can design an economical solution. In computer science, high level languages and then still higher level languages (e.g. perl) have been used to provide economical solutions to difficult programming problems. It is part of the thesis of this course that message-based parallel solutions are relatively low level, difficult to write, and difficult to debug. The purpose of most of these application-based projects is to demonstrate a task-based parallel solution. Of course, task-based solutions usually use messages at a lower layer. But the higher task-based abstraction hides the details of message passing, thus providing a more powerful tool.

In the case of parallelizing a very large program, this means that a task-based tool may provide the first successful parallelization of the program (as was the case for the parallelization of Geant4). In the case of a smaller application, one has the luxury of writing a new program. In this case, a task-based approach gives one the flexibility to easily experiment with many algorithms and thereby find an efficient algorithm. Once one has found an efficient algorithm, a direct message-based approach may still be more efficient (just as once one has found an efficient C-based implementation, an assembly implementation may still be more efficient). However, just because assembly is more efficient than C, and C is more efficient than Java, few people would want to neglect C and Java for the sake of assembly language.


Geant4

( Geant4 (GEometry ANd Tracking) is a simulation package for particle-matter interaction developed at CERN, the place where the Web was invented. Although initially developed to assist in collider design, its use has spread to many areas of physics, engineering and biomedicine. Including the associated toolkits, Geant4 contains approximately 1,000,000 lines of C++ code. The Geant4 software was developed by RD44, a world-wide collaboration of over 100 scientists participating in more than 10 collider experiments in Europe, Russia, Japan, Canada and the United States. Many specialized working groups are responsible for the various domains of the toolkit.

Geant4 is a 1,000,000 line C++ program. It has already been parallelized with event-level parallelism using TOP-C. The Geant4 user community has additional requirements for parallelism, which are the basis for these projects.


Parallelization of a Geant4 Application for Dark Matter Detection
Dark Matter experiments are currently being developed using a novel liquid xenon based technique for detection of WIMPs (Weakly Interacting Massive Particles). In order to calibrate the effectiveness of such detectors neutrons are used in the laboratory to mimic the interactions of candidate Dark Matter particles. This experimental work has been carried out at Imperial College in London, England and has produced some interesting and partially unexplained results.

To undertstand the observed signals it is necessary to simulate the neutron scattering within the laboratory and predict the expected interactions within the test detector. Tracking of neutrons in a detector-like manner is extremely processor intensive. In 50ns a pulse of 100 million neutrons is produced which once scattered around the room produces about 10000 events within the detector over a time-scale of about 1ms. A sequential simulation of this experiment processes less than 200 neutrons per hour. Therefore the only way to successfully understand and simulate this will require parallel resources.

The project will entail taking existing sequential Geant4 code of the experimental set-up and adapting it to a parallel regime. If time permits, the simulation can be extended to define and advise on the experimental location of the neutron source and suitable collimation to maximise the effectiveness of the measurement, which will be repeated in the immediate future.

The following tar files contain some sample macros and the source code to be run within Geant4.

There is the possibility of including this work within an experimental R&D journal paper on the development of this new type of Dark Matter detector. In addition a separate publication demonstrating the use of Geant4 in a novel regime (low energy and neutron tracking) with application to a novel experiment - Dark Matter searches using Liquid Xenon. There is also the intention to present this week at a summer conference within the context of Dark Matter detector development and simulation.

This is joint work with Alex Howard of the High Energy Physics group at Imperial College in London, England. The description is by him.


Geant4: shared memory parallelism
TOP-C allows the same source code to be easily converted from distributed memory parallelism into shared memory parallelism. However, Geant4 uses a navigator to determine geometry, and the navigator is a singleton class. In shared memory, many tasks would try to simultaneously use this singleton class, leading to disaster. The solution is to modify Geant4 to make the Navigator a non-singleton class. How can one do this with little or no change to the Geant4 source code, but primarily by adding new classes? A solution should make few assumptions about Geant4, so as to easily adapt to new versions.


Geant4: track-level parallelism
In Geant4, and (external) event creates a particle. Each particle creates a track, as part of its simulation. A track may end in decay of the particle or interaction with matter, giving rise to one or more secondary tracks. The secondaries may give rise to their own secondaries. There is an event work loop, which processes each event in turn. As a single event is processed, there is a track work loop, which processes a track, and then processes the secondary tracks that arise from previous tracks. Event-level parallelism for Geant4 has been done. Track-level parallelism (parallelization of the track work loop) remains to be done.


Geant4: Interactivity and PostponeOneEvent()
In the current parallelization of Geant4, user commands are read from a script by each process (master and worker processes). A more elegant parallelization would take user commands directly from the terminal. This means that user commands must be passed from the master to each worker process. Further, if the user command is to abort the current event or session, then the user command must be passed to all workers. The TOP-C "soft abort" facility is useful here.

In addition, PostponeOneEvent() in Geant4 has not yet been parallelized. In event-level parallelism, one may wish to suspend the simulation of an event until a later time, when more information will be available from other worker processes. However, this implies the conversion of previously trivial parallelism into non-trivial parallelism. Hence, the TOP-C features, such as the REDO action and the UpdateSharedData() callback may be required to handle this more sophisticated situation.


Other Applications


Parallelize as many of the SPLASH-2 applications as possible using TOP-C. There are important lessons to be learned here about what are the limitations of the TOP-C model. This is very important for developing the ideal future parallel language that will be easy to use. If enough coverage is provided, this can be turned into a publishable paper (typically by investing as much time after the course as was invested during the course).
Search for Steiner 5-designs (combinatorics)
This can be thought of as a problem in recreational mathematics, although there are important applications for statistics and other fields if a Steiner 5-design is discovered. All definitions are elementary and are understandable by a high school student.

The search is reduced by the appropriate theory, see below, to the problem of solving a Diophantine system of linear equations. The matrix of coefficients has entries 0 or 1 and on the right hand side the inhomogenous part is the all-1-vector. Any solution vector so also has only entries 0 or 1.

Since most entries are 0, a data structure is chosen that only uses the non-zero entries. So, for each row there is a stack of all column numbers where this row has an entry 1. The solutions of the search problem are selected column numbers such that in each row exactly one of the selected numbers occurs.

A second representation is available where for each column all row numbers are listed where the column has an entry 1.

Options for publication of a paper include:
Designs, Codes and Cryptography; J. Comb. Designs; Discrete Math; and other journals.

The full details are given here. This work will be joint with Prof. Laue and his working group Lehrstuhl II at U. of Bayreuth in Germany).


Network simulator
Network simulators are notorious for their irregular parallelism. How does one efficiently parallelize such an irregular application?


Neural Net
In the future, one can conceivably wish to employ neural networks with millions of nodes in each layer. In such cases, the slow speed of RAM is likely to dominate the running time. How can one structure this application to avoid this. (See, also, the related topics of Bottlenecks due to slow RAM and Superlinear parallel speedup and slow RAM.)


Parallel simulated annealing
Design a new parallel algorithm for simulated annealing. The simulations done so far are mostly based on message passing and data parallelism. Can one design an efficient task-based algorithm? There are similar, related questions for branch-and-bound, and other search-style problems.

Simulated annealing can be thought of as a noisy hill-climbing algorithm, in which one follows a given gradient. This is a challenge to do efficiently, since distinct worker processes will tend to climb different hills, and it is difficult to share such information to provide a general speedup. The `soft abort' feature of TOP-C may be useful in this context. Previous implementations of parallel simulated annealing exist. Some sites I found are distinguished for a simple overview, a general implementation in C++, a survey of various approaches, a journal paper of one successful approach, another such journal paper, and slides from a talk.


Shortest solutions for Rubik's cube
There are several approaches to finding either shortest or simply "short" solutions for arbitrary initial states of Rubik's cube. However, such approaches require enormous amounts of CPU time. Hence, parallel methods are called for.


Search in chess endgames
Chess endgames cannot always be "solved" (to determine who wins) when there are six or more chess pieces on the board. Part of the problem is storing all of the positions in a large distributed database on disk. (See Scalability of data in parallel computation.) How can this be managed?


Random numbers
Sequential simulations frequently use a single random number generator. What must a parallel simulation do in this context? Some desirable properties are:
    1.  Each worker process must draw from a distinct random number sequence.
    2.  Repeatability between parallel runs when using the same random seed.
    3.  Repeatability in comparing a parallel run with n slaves and a
        parallel run with only 1 slaves, when using the same random seed.
    4.  Repeatability between the parallelized simulation and the
        original (unmodified) sequential simulation.
Solutions satisfying 1, 2 and 3 are relatively easy. However, it is not trivial, since distinct parallel runs may re-order the individual tasks, depending on the speed of the worker processes. If task 5 executes before task 4, can one still maintain repeatability of the results? One sends a new seed or random state to each worker, along with each task. However, is there a theoretical basis to guarantee that the use of multiple random seeds satisfies the pseudo-randomness property of a single generator? Also, is there an efficient way to capture goal number 4?


Small Java version of TOP-C
This allows TOP-C to use the Java sandbox. Hence, a PC owner can safely volunteer a new worker process in the Java sandbox, without fear of danger to the local computer. This could also take advantage of Java RMI for messages. The TOP-C syntax could be extended to allow for this.


Numerical analysis
Consider your favorite numerical algorithm (Gaussian elimination, differential equation solver, etc.) and design a new, task-based algorithm for solving it in parallel.

===============================

Gene Cooperman
College of Computer Science, 336 WVH
Northeastern University
Boston, MA 02115
e-mail: gene@ccs.neu.edu
Phone: (617) 373-8686
Fax: (617) 373-5121