Research of Gene Cooperman

This page will always be obsolete at any given point in time, but sometimes more obsolete than other times. It is very obsolete right now. I'm happy to correspond. Here are my refereed publications and some examples of my software. I last updated this page in November, 2011.

A brief descripiont of my research follows.

A current theme in my High Performance Computing Laboratory at Northeastern University is adaptation of data structures and low-level software access algorithms to quickly changing technology. In the 90s, computers became faster. Now, we simply have more of them, and with the growth of heterogeneous computing, we have more types of them. A partially related thrust is to treat checkpoints of running programs as first-class objects (checkpoint images).

These research directions are summarized here:

User Space Distributed, Multi-Threaded Checkpointing (DMTCP): The DMTCP checkpoint-restart package employs a pure user-space approach. This enables DMTCP to be bundled with other major applications for distribution. Checkpointing to disk, or restart, takes place in seconds or less. DMTCP requires no modification of kernel or of application binary. It has been demonstrated on OpenMPI, SciPy (iPython), SCIRun, Java, bash, gcl, matlab, and so on. DMTCP is the most widely used user-space checkpointing package. (Some other checkpointing packages, such as BLCR, require a kernel module. They are more commonly used for batch queues. It is difficult to compare usage among batch queues versus user-space settings.)
Reversible Debugger: A new experimental version of DMTCP can now checkpoint debugging sessions. Using this, we had built URDB in 2009, a reversible debugger for single-threaded programs. It can reverse execute code (going backwards in time). While URDB is freely available (GPL), it is now obsolete. We plan to soon release a beta version of FReD (Fast Reversible Debugger). It is both more robust (capable of reversing MySQL, Firefox, and Apache) and capable of also debugging multi-threaded programs. Following a principle of orthogonality, FReD (and the planned debugging tools on top of it) operate with multiple debuggers. Using this, we have built the first reversible debuggers for: MATLAB, python (pdb module), perl (perl -d), and OpenMPI using gdb. This also provides a gateway both to program-based introspection and to speculative program execution.
Disk-Based Parallel Computation (data-intensive computing): Commodity computing is now seeing many cores, but the RAM is not growing in proportion. Our solution is to use the disk as an extension of RAM. The bandwidth of 50 local disks in a cluster is approximately the same as a single RAM subsystem. While this may solve the bandwidth problem of disk, the latency problem remains. We have developed over five years a series of applications that overcome this barrier. Development of such disk-parallel code is highly labor intensive. We are now working on a mini-language, Roomy, that reduces the software development and debugging time from person-months to person-days. The end user need only make minimal changes to a sequential program, and then link in the Roomy run-time library. Just as the Linda programming language provides coordinated access to a common tuple space, the Roomy language provides coordinated access to the disk storage resources of a computer cluster or SAN through a sequential API. A particular emphasis of this research is on a broad variety of search algorithms, with an eye to applications in formal verification and elsewhere. The demonstration that Rubik's cube can be solved in 26 moves or less using 8 terabytes of disk storage was a byproduct of this work that attracted popular attention.
Converting Distributed Memory Parallelism to Thread-Parallelism: This is a newer project. As the move to many-core computing provides less RAM per core (and for other reasons), it becomes desirable to migrate MPI or other distributed memory code to thread-parallel code. In particular, with the advent of many-core CPUs, large sequential codes must be converted to thread-parallel code with data sharing in order to avoid thrashing between the CPU cache and RAM. Source code transformations are used to segregate thread-private read-write data. In combination with copy-on-write (UNIX fork system call), nearly linear speedup is achieved. This has been tested so far on 24-core machines. An interesting byproduct of segregating the read-write thread-private data is that even for the sequential case (single thread), we sometimes observe a speedup. The methodology has been developed in cooperation with the Geant4 developers. Geant4 consists of about 750,000 lines of C++ code developed at CERN for simulation of particle-matter interaction. One of its applications is analyzing data from the LHC particle collider (particle accelerator) at CERN, which is about 8.6 kilometers in diameter.

History/Background: I have a background from the 80s and 90s in computational algebra (especially computational group theory). This has served me well as a testbed for parallel computating. This work led to the TOP-C (Task Oriented Parallel C/C++) model of parallel computing. In a nutshell, it was always designed for commodity computing, and it emphasizes a task-oriented model with lazy updates of globally shared memory. This allows for good latency tolerance, while providing an exceptionally easy model for end-users to implement a generalization of task-oriented parallelism allowing for non-trivial parallelism. Some outgrowths of that work are my support for parallel GAP (Groups, Algorithms and Programming), parallel GCL (GNU Common LISP), ParGeant4 (Geant4 is a million line program developed at CERN and elsewhere, which is used to design and simulate experiments on the LHC, the largest collider in the world). My software page describes this software further.

The Blue Ribbon Online Free Speech Campaign!

Gene Cooperman
Khoury College of Computer Sciences, 336 WVH
Northeastern University
Boston, MA 02115
e-mail: gene@ccs.neu.edu
Phone: (617) 373-8686
Fax: (617) 373-5121