High Performance Computing Lab, Northeastern University

The High Performance Computing Laboratory at Northeastern University is led by Gene Cooperman. The Lab is part of the College of Computer and Information Science and is located at 370 West Village H. It currently includes five Ph.D. students. The Laboratory is pursuing three inter-related topics: parallelization tools, scientific/engineering applications, and balancing architectural bottlenecks.

Professor Cooperman has over 70 refereed publications, and has been awarded 15 grants from the National Science Foundation. He is the head of the High Performance Computing Lab at Northeastern University. He is also the director of the Institute for Complex Scientific Software, an inter-disciplinary collaboration across five departments at Northeastern University.

Three current research directions are:

  1. Disk-Based Parallel Computation
  2. User-Space Distributed, Multi-Threaded Checkpointing
  3. Converting Distributed Memory Parallelism to Thread-Parallelism
They are described in further detail below.

===============================
  1. Disk-Based Parallel Computation: Commodity computing is now seeing many cores, but the RAM is not growing in proportion. Our solution is to use the disk as an extension of RAM. The bandwidth of 50 local disks in a cluster is approximately the same as a single RAM subsystem. While this may solve the bandwidth problem of disk, the latency problem remains. We have developed over five years a series of applications that overcome this barrier.

    We are now working on some general tools that others can use to quickly design and implement disk-based computations. A demontration of the power of this approach was our result that Rubik's cube can be solved in 26 moves or less. This was done in 2.5 days of a 32-node cluster using 8 terabytes of distributed disks. Development of such disk-parallel code is highly labor intensive. For an example of the power of this approach, you are welcome to read some source code written in the Roomy language (a library-based extension of C/C++). The Roomy-based code requires only 271 lines of code, and was written in less than one day. Even though the source code appears to the end user as a short sequential program, the code invokes the Roomy run-time library, which then employs multiple threads, MPI, and access to multiple files per computation node on behalf of the user.

    Currently, we are employing Roomy toward more serious efforts, such as formal verification. Many problems in formal verification are known to suffer from the state explosion problem.

  2. User-Level Distributed, Multi-Threaded Checkpointing: The user-space approach allows us to bundle the checkpointing capability with the application or with the computational facility, as opposed to kernel-space solutions, which (at least in binary form) are bound to particular versions of the kernel, and therefore to the computational facility. As one expects, we require no modification of kernel or of application binary. We have demonstrated that it works with OpenMPI, with MPICH-2, with SciPy (iPython), with the Java JVM, and a variety of other applications. Our latest version is DMTCP (available at SourceForge), and is available under GPL. The chart of the number of downloads shows DMTCP to be in active use (also reproduced below).

    Graph of Downloads

  3. Converting Distributed Memory Parallelism to Thread-Parallelism: This is a newer project. As the move to many-core computing provides less RAM per core (and for other reasons), it becomes desirable to migrate MPI or other distributed memory code to thread-parallel code. In particular, with the advent of many-core CPUs, large sequential codes must be converted to thread-parallel code with data sharing in order to avoid thrashing between the CPU cache and RAM. We are investigating to what extent some of this can be done semi-automatically for properly structured code.