Systems Seminar
Thursdays, 1:30 (Fall, 2012)
Room: 366 WVH when available; in case of conflict, we will be in 166 WVH

The systems seminars is held on Thursdays at 1:30 in 366 WVH (if available) or else 166 WVH, during Fall, 2012 semester.

Systems is the subject of how to organize computations in the presence of limited resources (which typically requires _system_ calls). In contrast to software engineering, systems begins with a problem for which the machine abstraction is deficient (where this means the machine presented by the hardware/OS/compiler/run-time library), and one must find novel ways to improve the abstraction at the machine level.

The figure of a merit in a community is sometimes called the taste of that community. The best definition of taste in systems that I have found is in the SOSP announcement:

A good paper will demonstrate that the authors:

• are attacking a significant problem,
• have devised an interesting, compelling solution,
• have demonstrated the practicality and benefits of the solution,
• have drawn appropriate conclusions,
• have clearly described what they have done, and
• have clearly articulated the advances beyond previous work.

We have discussed subjects ranging from parallel computing to overcoming systems bottlenecks due to RAM/Network/Disk to reversible debugging via checkpointing; and a variety of scientific applications. This area of research typically involves a fine interplay among algorithms, software, and applications.

## Currently Scheduled Talks (Fall, 2011 Semester)

Nov. 29, 2012
Title: The Recent Port of DMTCP to Android by 0xlab
Speaker: Gene Cooperman (work of Jim Huang and Kito Cheng)
About three weeks ago, the 0xlab in Taiwan announced the port of DMTCP from Linux to Android. The 0xlab has particular expertise in Android. Jim Huang and Kito Cheng performed the work. Recall that DMTCP (Distributed MultiThreaded CheckPointing) allows one to checkpoint a running process to a file, and to later restart the process from that file. This was announced at the Embedded Linux Conference Europe 2012, with their slides here. "Hearing" about this work via a Google alert, we wrote to them. The 0xlab team, similarly to the DMTCP team, is committed to open source software, and the two groups are now discussing a future collaboration. One of the barriers to the port to Android was that Android uses Bionic libc, with subtle differences from Linux's use of GNU libc. Impressively, the 0xlab team had analyzed some subtle DMTCP internals without assistance from the DMTCP team. Simultaneously with the work of 0xlab, Michael Coppola had developed a port of MTCP (the lower layer of DMTCP) from Linux to Android --- work which we hope to hear discussed in a future seminar. This talk describes the sociology as well as the technical details of how ideas from two communities (Android and checkpoint-restart) can lead to powerful synergies.

Nov. 15, 2012
Title: Extending DMTCP to support "attach" for Running Processes
Speaker: Jiajun Cao
This talk discusses progress in extending DMTCP to support an "attach" mode similar to GDB's "attach" command. At this time, we are close to fully supporting this behavior for MTCP (single processes, multi-threaded). Later, the work will be extended to supporting DMTCP (multiple processes). This work is joint with Gene Cooperman and Dick Fickling.

Nov. 8, 2012
Title: DMTCP Plugins and Aspect-Oriented Programming: A Comparison
Speaker: David Lorenz, Kapil Arya, and Gene Cooperman
The standard definition of an aspect is a cross-cutting concern. A standard example occurs in multi-threaded program. A thread may create a new thread, employ a mutex to synchronize between the two threads, and then join with the the new thread on exit. These three pieces of the program are scattered across the program. An aspect-oriented program will move these three excerpts of code into a single piece of aspect-oriented code, which interacts with the rest of the program. DMTCP plugins have many of the same features as aspects. Kapil Arya discussed DMTCP plugins in the talk of Oct. 18. David Lorenz is an expert on aspects. He is a professor at the Open University (Israel). The three speakers will together lead a discussion trying to understand better the relationship between aspects and DMTCP plugins.

Oct. 25, 2012
Title: A Seminar on Kernel Exploitation
Speaker: Michael Coppola
Root is the penultimate goal of hackers. And what better way to obtain it than by exploiting the kernel itself? In this presentation, we'll discuss common bug classes in the Linux kernel and learn how to exploit kernel vulnerabilities through live demonstrations, turning that $into #. Oct. 18, 2012 Title: Extensible User-space Process Virtualization Speaker: Kapil Arya This talk will present some of the problems with existing checkpoint-restart solutions when dealing with various real world artifacts. DMTCP, one of the most widely used user-space checkpoint-restart solution, was recently updated to provide plugins support. In the new architecture, the core checkpoint-restart model assumes that all processes and threads are completely independent of each other and outside world. Target applications are insulated from the outside world through process virtualization. Communications with the external world is handled through plugins. The checkpoint-restart system grows organically by defining a plugin for each new abstraction (pseudo-terminal devices, SysV shared memory, message queues, semaphores, etc.). Oct. 11, 2012 Title: Introductory Session Speaker: Gene Cooperman In this continuation of the previous week, students will continue to give impromptu short talks. Oct. 4, 2012 Title: Introductory Session Speaker: Gene Cooperman This first seminar of the Fall will review the various systems projects underway in our group. There will be a special emphasis on introducing those projects for anyone new to our seminar. Apr. 19, 2012 Title: Overview of Adpting Cilk to FReD Speaker: Samaneh Kazemi Cilk is a new multithread language which extends C/C++ by introducing two keywords, "spawn" and "sync". A typical bug in multithreaded programs is a data race, which is hard to debug. Using FReD lets one execute Cilk commands in the backward direction, and to debug it easier and faster. However, Cilk uses inline assembly instructions to implement read/write memory barriers. The first issue is to reimplement this using pthread locks, so that one can use FReD wrappers for replaying the barriers. A second issue for Cilk concerns its use of the rdtsc assembly instruction to get a timestamp. FReD's use of DMTCP checkpoint-restart will not virtualize 'rdtsc' properly. Since this instruction returns the number of clock cycles since the process was restarted, its value will be zero at the time of restart. Replacing 'rdtsc' with gettimeofday() function is the second approach. By modifying Cilk source code based on the mentioned approaches, Cilk works successfully on reverse-step and reverse-next for some simple Cilk examples. Current work is concentrating on improving FReD so that its reverse-watch and other commands can be applied to more sophisticated programs. Apr. 12, 2012 Title: Checkpointing a Virtual Machine and Its Implications Speaker: Rohan Garg Taking snapshots of the state of an operating system (OS) is one of the de-facto features in most of the available Virtual Machine Monitors. In this talk, we present the ability to checkpoint and restore the state of a guest operating system running in a virtual machine monitor. This gives us the benefit of having an almost "real-time" and transparent record-replay log for a complete OS and any processes within it. This can, potentially, have a significant impact on the techniques where virtual machines have been employed for doing malware analysis and debugging. We extend DMTCP, a Linux-based tool for transparent checkpoint-restart, to enable check-pointing of Qemu running in user-space. Evaluations with checkpointing of different Linux distributions and with checkpointing of Microsoft Windows XP, each at different states of operation, show promising results. Future work to extend this functionality for malware analysis is also presented. Mar. 15, 2012 Title: Migrating Parallel Algorithms to Hybrid CPU-GPU Clusters Speaker: Vlad Slavici and Raghu Varier For abstract, see talk below from Feb. 9, 2012. Mar. 7, 2012 (2:30 p.m., 366 WVH) (unusual time) Title: Migrating Parallel Algorithms to Hybrid CPU-GPU Clusters Speaker: Vlad Slavici and Raghu Varier For abstract, see talk below from Feb. 9, 2012. Mar. 1, 2012 Title: DMTCP, checkpointing, and the ARM CPU Speaker: Gene Cooperman The ARM CPU is becoming widely used for general purpose Linux-based computing. The widely used DMTCP package in Linux, for transparent checkpoint-restart, has recently been ported from x86/x86-64 to ARM. The port of DMTCP was made easier because DMTCP stays close to the POSIX API. This explicit port of DMTCP will provide insights into which are the few parts of DMTCP that are CPU architecture-specific. Feb. 23, 2012 Title: User-Space Process Virtualization and DMTCP Speaker: Kapil Arya Virtualization in the context of Virtual Machines (VM) is a well-known and proven technology. The virtual machine approach provides a full-featured approach – but it is also a relatively heavyweight approach, requiring large storage and checkpoint times on the scale of a minute. For many applications, a lightweight solution to virtualization is desired for better performance and finer grained control. In this talk we will talk about lightweight process-level virtualization. We will discuss the architecture needed to support a kind of virtualization along with an implementation, DMTCP. Feb. 16, 2012 Title: Data Quality and Query Cost in Pervasive Sensing Systems Speaker: David Yates (Bentley University) This research is motivated by large-scale pervasive sensing applications. We examine the benefits and costs of caching data for such applications. We propose and evaluate several approaches to querying for, and then caching data in sensor field data servers. We show that for some application requirements (i.e., when delay drives data quality), policies that emulate cache hits by computing and returning approximate values for sensor data yield a simultaneous quality improvement and cost saving. This win-win is because when system delay is sufficiently important, the benefit to both query cost and data quality achieved by using approximate values outweighs the negativeimpact on quality due to the approximation. In contrast, when data accuracy drives quality, a linear trade-off between query cost and data quality emerges. We also identify caching and lookup policies for which the sensor field query rate is bounded when servicing an arbitrary workload of user queries. This upper bound is achieved by having multiple user queries share the cost of a single sensor field query. Finally, we demonstrate that our results are robust to the manner in which the environment being monitored changes using models for two different sensing systems. Bio: David is an Associate Professor of Computer Information Systems at Bentley University. David's research areas include computer networking, data communications, sensor networks, embedded systems, operating systems, and computer architecture. With various colleagues, he holds several U.S. patents for processes and systems related to computer networking, content management, and mobile computing. David earned a PhD and MSc from the University of Massachusetts. He also holds a BSc from Tufts University. Feb. 9, 2012 Title: Migrating Parallel Algorithms to Hybrid CPU-GPU Clusters Speaker: Vlad Slavici Hybrid CPU-GPU clusters are the fastest growing technology in supercomputing today. It is expected that in a few months the Titan CPU-GPU hybrid from Oak Ridge National Laboratory (ORNL) will be the world's fastest supercomputer (~20 PetaFLOPS). The main reasons for the fast adoption of GPUs in HPC are the high 'performance to power' and 'performance to cost' ratios. We present a Programming Model that allows efficient migration of compute-intensive parallel algorithms from CPU clusters to Hybrid CPU-GPU clusters by increasing the flexibility of data and control flow and allowing the runtime environment to adjust the flow depending on the target architecture (in this case GPUs). The programming model is demonstrated on MADNESS, a high-performance scientific framework. The challenges behind migrating legacy algorithms to the emerging CPU-GPU architecture will be described, together with current and proposed solutions. This work is joint with Raghu Varier. Feb. 2, 2012 Title: No talk: see systems by Tony Skjellum at similar time, below: Gibraltar: GPU-Accelerated Software RAID for HPC Anthony Skjellum, University of Alabama at Birmingham Date: Thursday, Feb 2 Time: 2:30 PM Location: 442 Dana, ECE Dept. In this talk, we present the jointly developed UAB-Sandia Gibraltar library, that commoditizes RAID by first choosing software RAID, then performs middleware acceleration on that RAID using NVIDIA-based GPUs. The ability tocompute parity stripes at full rate with 2,3, 4, or more parity disks allowsstorage to become survivable for long periods of time, and particularly can address triple-disk failure and batch correlated failures. Gibraltar RAID is discussed, and the error rates that motivates this work are discussed as well. We are continuing to develop and extend this technology, which initially was designed to serve blocks via iSCSI and iSER. Current and future plans,including local striping for multi-cloud-vendor storage are mentioned. Current estimates are that practical petabytes of storage can be constructed for under$200,000, including a large amount of excess computational power, to enable "computation close to the storage." The general area of middleware acceleration through GP-GPU computing is also discussed.

Jan. 26, 2012
Title: State of the Lab Place: Probably 370 WVH; 164 WVH if 370 WVH is occupied
Speaker: Gene Cooperman
This talk will present a state-of-the-lab description about where we are, and where we are going. As always, this is a public seminar, and everybody is welcome.

Dec. 6, 2011
Title: Source-Level Transformation of Legacy Sequential Program into Scalable Thread-Parallel Software
Speaker: Xin Dong
This dissertation will focus on code transformation for legacy sequential programs toward thread-level parallelism on many-core computers. It highlights a semi-automatic method: Scalable Task-Oriented Code Transformation (STOCT). STOCT introduces thread parallelism for sequential code via a series of code transformation steps. STOCT's goal is: Given a large C/C++ software program, transform the source code to replace several independent copies of a sequential process with an equivalent single process consisting of several threads. This takes advantage of many-core computers in a memory-efficient and scalable manner.

The STOCT methodology is implemented in the parallelization work for three widely varying applications: (i) a simulation toolkit Geant4; (ii) a linear system solver based on ILU(k) preconditioning; and (iii) a large-scale data analysis tool AliRoot. STOCT bridges the gap from sequential programming to concurrent programming by decomposing the parallelization into four subgoals: to exert more parallelism, to share more data, to guarantee correctness, and to achieve scalability.

STOCT has the following features: (i) to pursue thread safety by maximal parallelism in which only the minimal shared data set is allowed; (ii) to reduce the memory footprint by sharing some user-level data whose value is no longer changed after initialization; (iii) to guarantee runtime correctness, which is weaker and yet strong enough for production runs; and (iv) to achieve scalability by eliminating two serious performance bottlenecks special for shared-memory computation: memory allocation/deallocation and cache coherence miss.

Nov. 29, 2011
Title: Temporal Meta-Programming: Treating Time as a Spatial Dimension
Speaker: Ana-Maria Visan
Reversible debuggers have existed since the early 1970s. However, they are not widely used, with the possible exception of GDB. Even GDB is worth using for only a few seconds. We hypothesize that the lack of success of reversible debuggers is due to the fact that executing backwards in search of an underlying bug often requires manual execution of thousands of reverse commands and is often impractical.

The challenges of building a practical reversible debugger will be discussed, first for the simpler case of a single-threaded application and then for multithreaded applications. Reversibly debugging multithreaded applications on today's multi-core hardware is a daunting task due to the lack of determinism. Two reversible debuggers, URDB and FReD will demonstrate the feasibility of our approach.

We also propose a new generation of tools on top of a reversible debugging platform. The tools will automate the temporal search for the earlier underlying bug. A particular tool, reverse expression watchpoint will be emphasized as a particularly useful key for this automation.

Nov. 22, 2011
Title: A Programming Model for hybrid CPU-GPU Clusters
In the past few years, Graphical Processing Units (GPUs) have become increasingly present in the High-Performance Computing (HPC) community. Three of the Top 10 supercomputers revealed last week at Supercomputing (SC11) are hybrid CPU-GPU clusters (the most performant at #2), whereas the top spot is retained by K (Japan), with a measured peformance of ~ 10 PFLOPS on Linpack. It is expected that by next year the top spot will be held by a CPU-GPU hybrid (the upcoming Titan from Oak Ridge National Laboratory (ORNL), which will be using NVIDIA Tesla, Fermi generation, is expected to peak at 20 PFLOPS). The main reasons for the fast adoption of GPUs in HPC are the high 'performance to power' and 'performance to cost' ratios.

However, the migration to hybrid CPU-GPU clusters brings with it a significant challenge: both new and legacy software and applications have to (re)written to comply with GPU programming models.

This talk will present a general programming model for hybrid CPU-GPU clusters that is currently being employed by the speaker and Raghu Varier (ECE, Northeastern U.) to efficiently adapt legacy code in MADNESS (a high-performance scientific library and framework) for CPU-GPU clusters.

Nov. 15, 2011 (10:30 a.m.: note unusual time)
Title: Deferred in favor of external speaker from Symantec
Speaker: Marc Dacier, Symatec (host: Engin Kirda)
There is a talk at 10:30 a.m., hosted by Engin Kirda. This talk is not part of the systems seminar series. But due to its close interest to that of the systems seminar, there will be no systems seminar. The title and abstract are being forwarded to the systems seminar mailing list.

Nov. 8, 2011 (note: may start a little after 1:30)
Title: Green HPC: A Systems Approach to Energy Efficient Datacenters
Speaker: Kurt Keville (MIT)
Green HPC is the new standard for High Performance Computing (HPC). This has now become the primary interest among HPC hardware researchers because of a renewed emphasis on Total Cost of Ownership (TCO) and the pursuit of higher performance. In this seminar, we analyze the problem holistically and describe best practice fixes to solve the problems of energy-inefficient HPC.

Nov. 1, 2011
Title: Feedback on FReD
Speaker: Gene Cooperman
As a result of presenting FReD at Indiana University and Boston University, this speaker received a certain amount of feedback. This talk will include an abbreviated persentation of the FReD talk that was given, along with some examples of useful feedback and potential collaborations. As time permits, we will also jointly review the current status of the FReD project.

Oct. 25, 2011
Title: Research Exchange with Mathematics Department
Speaker: Gene Cooperman, Alex Suciu, Jerzy Weyman
Professors and students will present computational research projects from each of the two departments (Computer Science and Mathematics). Professors Suciu and Weyman and their students will be speaking from the Mathematics Department. Each of the students in the High Performance Computing group should also be prepared to give a short, informal presentation.

Oct. 11, 2011 (2:30 p.m. -- NOTE UNUSUAL TIME)
Title: URDB: A Universal Reversible Debugger Based on Decomposing Debugging Histories
Speaker: Ana-Maria Visan
Reversible debuggers have existed since the early 1970s. A novel approach, URDB (Universal Reversible DeBugger), is presented based on checkpoint, restart, re-execute and decomposition of debugging histories. URDB adds reversibility to a debugger, while still placing the end user within the familiar environment of their preferred debugger. The URDB software layer currently includes modes that understand the syntax for four debuggers: GDB for C/C++/Java/Fortran, Python (pdb), MATLAB, and Perl (perl -d). It does so by adding a thin URDB software layer on top of the DMTCP checkpoint-restart package (through checkpointing the Linux ptrace system call). URDB passes native debugging commands between the end user and the underlying debugging session. URDB models the four common debugging primitives that form the basis for most debuggers: step, next, continue, break. For example, given a debugging history of the form [step, next, step], URDB's reverse-step produces a new history, [step, next] (this is equivalent to an undo command). Further, subtle algorithms are described for the more interesting cases that are not undo commands (e.g. [continue, next], followed by a reverse-step).

Oct. 4, 2011
Title: State of the High Performance Computing Laboratory
Speaker: Gene Cooperman
This talk will be an overview of the activities of the High Performance Computing Laboratory. Items of interest will include:
• ongoing work with FReD (Fast Reversible Debugger);
• collaboration with Oak Ridge National Laboratory (Madness software);
• cooperation with NeuroDebian (easy checkpointing built into NeuroDebian for thousands of users);
• mature multi-threaded Geant4MT (plans for imminent distribution);
• checkpointing of Cilk using DMTCP (anyone interested?);
• checkpointing of Qemu using DMTCP (anyone interested?);
• issues in making Condor work with DMTCP and compiled Matlab;
• the status of the DMTCP module system (now robust, record-replay module, many new module events defined);
• creating the missing DMTCP wrappers (epoll, inotify, and eventfd);
• DMTCP and OpenMP (a more robust tid virtualization);
• DMTCP-based process migration between Linux kernel versions (subtle issues; strong interest at Cadence corporation (anyone interested?));
• other opportunities (DMTCP "attach" based on new ptrace feature allowing injection of new threads); (DMTCP and checkpointing OpenGL 3-D graphics)
• DMTCP and the distros (DMTCP in Debian testing; DMTCP in OpenSuseFactory; better packaging as candidate for Fedora);
• issues of DMTCP checkpointing and supercomputing (IBM Blue Gene, upcoming Cray Titan at ORNL).
Issues concerned with FReD include: making memory-accurate record-replay robust (now working); making FReD work with Firefox and Apache (in addition to MySQL); and extending the reverse-watch algorithm to cases where an auxiliary thread (not the primary thread) causes the main expression to change. Issues concerned with the Madness software include: a new abstraction for "Big Data" for Madness; checkpointing over Infiniband; and reversibility for real-world software (Madness).

Sept. 13, 2011
Title: Compression of Program State Checkpoints
Speaker: Artem Y. Polyakov
Complexity of modern distributed systems raises new issues. Fault tolerance is one of them. Checkpointing of distributed program state is one of the main approaches to achieve fault tolerance. The major drawback of this technology is heavy I/O load that occurs during creation of a checkpoint. This issue is addressed in this work using a compression approach. Delta-compression and compression by traditional programs (e.g. gzip) are used for this purpose. Results and comparison with existing approaches are given.

Apr. 21, 2011
Title: Computational Kernels for Space-Limited Applications: Computing with Parallel Secondary Storage

Apr. 14, 2011
Title: Solving the Double Swap Problem: How Multiple Virtual Machines Can Share the RAM Better
Speaker: Kapil Arya
In a virtual machine in which a guest OS is running applications, a well-known problem is that of double-swapping. Given ten virtual machines on a host, most customers will not provision the host with ten times as much physical memory. So, in order to accommodate more and more virtual machines on a single host, most modern hypervisors overcommit the available RAM. They virtualize the physical RAM to make each virtual machine believe that it has the full amount of physical RAM. Two standard techniques for accommodating this are page sharing and ballooning.

But when the memory runs really low, the hypervisor has to fall back on transparently swapping (a.k.a. paging) out to disk the virtual machines' memory. However, the guest OS, unaware of hypervisor swapping, may overcommit its own physical memory and resort to swapping out that same memory. Now the hypervisor needs to swap the memory back in to allow the guest OS to swap it out. This results in a double swap.

In this talk, I will present the work done by me under supervision of Alex Garthwaite at VMware over the last three summers for finding the first fully working solution to the double swap problem that does not requires para-virtualization.

Apr. 7, 2011
Title: So You Replayed your Threads and Now They're Hanging --- What Next?
Speaker: Xin Dong
The recent work on a reversible debugger ("FReD: A Fast Reversible Debugger", by Visan et al.) is able to record most system calls and synchronization operations for the execution of multi-threaded programs and replay, while enforcing the original order of execution of those events. It leads to a style of weak determinism that suffices to facilitate debugging and testing for the majority of parallel programs. For example, the previous work had demonstrated deterministic replay for 10 of the 12 programs of the Splash-2 test suite, and found that an assumption of weak determinism sufficed to replay 8 of the 12 programs. However, some parallel computations (such as the remaining 4 SPLASH-2 programs) include intentional data races, in which FReD may fail to reproduce the recorded execution.

This talk illustrates some of the benign synchronization patterns that nevertheless lead to replay failure for remaining 4 programs. This work also discusses where a programmer can enforce replay determinism -- in some cases by adding additional locks to the source code. The modified code lead to successful replay of the 3-1/2 of the remaining 4 programs of the SPLASH-2 test suite.

Mar. 10, 2011
Title: Trip Report: Portland State University
Speaker: Gene Cooperman
Portland State University is one of three major universities in Oregon. They have on their faculty a Turing award winner (Ivan Sutherland, technical head of the Asynchronous Research Center), and two ACM Fellows, David Maier and Bryant York. They also have a large group of programming language researchers interested in Haskell (many of whom came from the old Oregon Graduate Institute, OGI). I spoke there on FReD (Fast Reversible Debugger). This talk will review some of the discussions and insights from the trip.

Mar. 3, 2011
Title: Caracal: Dynamic Translation of Runtime Environments for GPUs
Speaker: Rodrigo Dominguez
Graphic Processing Units have become the platform of choice for accelerating a large range of data parallel and task parallel applications. Both AMD and NVIDIA have developed GPU implementations targeted at the high performance computing market. The rapid adoption of GPU computing has been greatly aided by the introduction of high-level programming environments such as NVIDIA's CUDA C and Khronos's OpenCL. Given the fact that CUDA C has been on the market for a number of years, a large number of applications have been developed in CUDA by developers in the HPC community.

In this presentation we describe Caracal, our implementation of a dynamic translation framework that allows CUDA C programs to run on alternative GPU and CPU platforms. In this presentation we target the AMD Evergreen family of GPUs. We discuss the challenges of compatibility and correctness faced by the translator using specific examples. We also compare the quality of the code generated by our framework with that produced by the AMD OpenCL library. Our dynamically translated code performs comparably to the OpenCL library and provides a vehicle for evaluating compiler optimizations in the future.

Feb. 17, 2011
Title: New Ideas for the Roomy-Based Binary Decision Diagram Package
The Roomy-based binary decision diagram (BDD) package has demonstrated that operations with very large BDDs can be efficient in a parallel external memory environment. Operations like apply, SAT-count or any-SAT are already implemented in the package. However, previous attempts at developing efficient reachable sets programs or dynamic variable reordering for very large BDDs resulted in algorithms that were too slow for typical use cases. This talk will briefly review these previous attempts and then describe new ideas for algorithms which could make these two operations efficient for parallel disk-based computing.

Feb. 10, 2011
Title: Checkpointing OpenMPI with MTCP
Speaker: Alex Brick
OpenMPI is a message-passing infrastructure that is used by many distributed and high-performance computations. However, in addition to communicating between processes and nodes in a computation, it provides other facilities, including the ability of computations to be checkpointed. This talk will discuss how OpenMPI handles checkpointing currently, and what has been done in order to allow OpenMPI to use MTCP transparently internally. MTCP is the single-process component of the larger DMTCP software package. The talk will also include a discussion of the problems that were encountered in interfacing with the OpenMPI system and some of the design questions that it has raised for the future of MTCP.

Feb. 2, 2011
Title: Measuring and Predicting SSD Performance
Speaker: Peter Desnoyers
The talk presents the architecture of current Flash SSDs (NAND), along with common algorithms for the Flash Translation Layer (FTL) and opportunities for improvements through an O/S-based Flash filesystem.

Jan. 27, 2011
Title: Trip Report: Oak Ridge National Laboratory
Speaker: Gene Cooperman
Oak Ridge National Laboratory is a national research laboratory that receives much of its funding from the U.S. Department of Energy. It is a center of high-end computational science in areas ranging from climate modelling to neutron science to bioinformatics. I recently gave a talk there on FReD (Fast Reversible Debugger), and also discussed the needs of HPC for checkpointing. Most of our work at Northeastern to date has concentrated on low-end and medium computational science. The needs of high-end computing for better debugging support are at least as large, but are reflected through different technologies. For example, at the extreme high-end, one may prefer application-level checkpointing of only the needed data, and tricks like narrowing (converting double precision to single precision) to reduce disk bandwidth needs. To what extent could nonvolatile semiconductor memory ease the checkpointing problem for exascale computing? In addition, GPU computing is growing fast. How does one reversibly debug (or even just checkpoint) a GPU computation? What are the high-end applications needing large intermediate storage (e.g., needing Roomy)?

Jan. 20, 2011
Title: Algorithms for Large Finite State Automata
Finite state automata (FSA) are ubiquitous in computer science. Network packet filtering, pattern matching, speech recognition and lexical analysis are just a few areas in which FSA are used. FSA are so widespread because they express regular expressions/regular languages. The fact that the class of deterministic finite state automata (DFA) and the class of non-deterministic finite state automata (NFA) are equivalent is part of the foundation of theory of computation. Converting an NFA to a DFA (by the well-known subset construction) leads to a combinatorial explosion of the number of states. This talk will focus on the case when the combinatorial explosion of states cannot be kept in RAM. Roomy was used to store the states on the aggregate disks of a commodity cluster. The resulting DFA was then minimized by an forward refinement technique adapted for disk. The technique is based on a RAM-based parallel algorithms from the 1990s. The parallel disk-based computations extended previous RAM-based results by four new cases, and provided strong experimental evidence for a new conjecture for a series of theoretical sorting problem expressed as a series of token passing networks.

Jan. 13, 2011
Title: State of the High Performance Computing Lab
Speaker: Gene Cooperman
As we start a new calendar year, this seems to be a good time to take stock of where we are, and the roadmap going forward. I will lead a discussion of the status and future roadmap for DMTCP / Roomy / Geant4MT (multi-threaded), and my informal plan is for Peter Desnoyers to similarly lead a discussion for those projects that he is leading.

Dec. 9, 2010
Title: Roomy: A New Approach to Parallel Disk-based Computation
Speaker: Daniel Kunkle
Disk-based computation provides a major new application of disks in addition to the three traditional uses: file systems, databases, and virtual memory. Disk-based computation uses the many disks already available in a computational cluster as main memory. In doing so, disk is elevated to a level normally reserved for RAM, and applications are provided with several orders of magnitude more working space for the same price.

Unfortunately, developing efficient parallel disk-based applications is often a difficult task. To solve this problem, this talk introduces Roomy, a new programming language extension and open source library for writing parallel disk-based applications.

The core of this research is in developing a library of general algorithms and larger applications on top of Roomy. Applications considered include: a proof that 26 moves suffice for Rubik's Cube; general breadth-first search algorithms; and a package for very large binary decision diagrams.

In developing these applications, the two central questions addressed are:

1. What is the class of applications for which parallel disk-based computing is practical?
2. How can existing sequential algorithms and software be adapted to take advantage of parallel disk-based computing?

Dec. 2, 2010
Title: Using Disks-As-RAM Systems for Pattern Search?
Speaker: Mirek Riedewald
We will discuss the properties of a class of pattern search techniques, focusing +on data size and general access patterns. These techniques first train data +mining models from a given large data set, then go through multiple iterations +of creation and ranking of huge collections of model summaries. The current +implementation runs on the Hadoop MapReduce system and our goal is to explore +if disks-as-RAM approaches like Roomy would be a good fit as well.

Nov. 4, 2010
Title: Functionality in the New DMTCP 1.2.0 Release
Speaker: Kapil Arya
DMTCP is Distributed MultiThreaded CheckPointing. This release is a semi-major release: from DMTCP 1.1.9 to 1.2.0. As part of the change from 1.1 to 1.2, we will include support for the well-known GNU 'screen', which is commonly used as a framework for all of the processes of a single programming session. It will continue to checkpoint such software as OpenMPI Matlab, etc. In order to complete this work, the design of many of the subsystems of DMTCP had to be reviewed. In one outcome, we now envision for the 1.2.x series a new model of connection objects --- with single unique connection ids for each connection. Many wrappers around socket calls will be replaced by lazy creation of connection obects at the time of a fork. (Fork and dup are the primary places where a program can create shared file or socket descriptors.)

Oct. 27, 2010
Title: A Novel Approach to Prime Factorization (tentative)
Speaker: Bryant York
The originally scheduled talk by Kapil Arya was delayed one week due to the surprise visit of Prof. Bryant York of Portland State University. The title above is our own, but hopefully it fits the spirit of the talk.

Oct. 20, 2010
Title: Stable Parallel ILU(k) Preconditioning for Iterative Linear Solvers
Speaker: Xin Dong
Level-based incomplete LU factorization (ILU(k)) is a preconditioner used for many stable iterative linear solvers. Its parallelization is interesting for two reasons: domain-decomposition algorithms dominate even though they are unstable; and stable parallel ILU(k) preconditioners were not well-explored since their appearance in the 1990s. Current parallel implementations can even succeed with a small number of CPU cores, and then fail when given more cores to work with. Our work targets this problem. Even though this problem seems trivial at first glance, the line of work continued off and on over five years. In this research, we take advange of task-oriented parallel C (TOP-C) and produce the first efficiently parallelized ILU(k) preconditioner that maintains the stability property.

The new algorithm is called TPILU(k). Normally, TPILU(k) returns exactly the same preconditioned matrix as the original sequential ILU(k) preconditioner would have returne, thus directly guaranteeing stability. As an optimization, TPILU(k) attempts to use a new incomplete inverse method as a fast approximation for the original ILU(k) preconditioned matrix. This allows TPILU(k)-based solvers to compete with fast, unstable parallel solvers, such as Euclid parallel ILU(k) and parallel Block Jacobi ILU(k). This work is joint with Gene Cooperman.

Oct. 14, 2010
Title: A Parallel Disk-based BDD Package
A Binary Decision Diagram (BDD) is a data structure which is widely used in formal verification for model checking. Real-world computations with BDDs often exhaust available RAM on one machine or even a cluster of computers. Extensive BDD-related research in the past 20 years has focused on minimizing the size of BDDs and/or making use of distributed RAM or a single local disk to increase the available storage space for computations. Our group has taken a new approach to developing a BDD package: using parallel disks as main memory instead of RAM for large computations with BDDs. The algorithms that sit at the foundation of this BDD package will be discussed. This is joint work with Daniel Kunkle and Gene Cooperman.

Oct. 7, 2010
Title: Status Update on DMTCP, and Reversible Debugging
Speaker: Kapil Arya and the rest of the team
There continue to be interesting new developments both in DMTCP and reversible debugging. This will be a shorter talk, providing a status update.

Sept. 30, 2010
Title: Hybrid main memory with DRAM and Flash
Speaker: Xiaohai Yu
Recent memory technology trends have enabled the use of hybrid DRAM / non-volatile (e.g. flash) main memory systems. Current operating systems lack support for hybrid memory, however, and will require new algorithms in order to make effective use of them. Our work focus on software level memory management support for such hybrid model. We have implemented a simulation model on current Linux kernel, and based on it we are now proposing new algorithms for hybrid model.

Sept. 23, 2010
Title: The past, present and future of URDB (Universal Reversible Debugger)
Speaker: URDB team (Ana Visan, Kapil Arya, Gene Cooperman, Tyler Denniston, and Xin Dong)
URDB is a reversible debugger that will soon support deterministic replay on multi-core hardware. The main points of novelty of URDB are: (1) the ability to do this at near native speeds on real world software. (2) debugging compiled languages via gdb, as well as built-in debuggers for interpreted languages such as MATLAB, python, and perl. (3) a "reverse-watchpoint" capability that automatically uses binary search over the process life time to locate bugs. (4) employing a novel implementation based on an extension of the transparent, user-level, fast checkpointing package, DMTCP (Distributed MultiThreaded Checkpointing Package). To our knowledge, DMTCP is the first package to checkpoint ptrace-based debuggers such as gdb.

This talk will cover the history of the URDB project: how the project started, what were its original goals and how these goals evolved over time. Special emphasis will be given to the latest work on implementing determinism for multi-core architectures.

Sept. 16, 2010
NON-STANDARD PLACE; 166 West Village H
Title: Status and Overview of DMTCP
Speaker: Gene Cooperman
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent checkpointing package. It saves the state of a running computation (which may include multi-threaded processes on distributed computers) to a checkpoint image file. It operates on such diverse applications as Matlab, Maple, MySQL, OpenMPI (for parallel programming), parallel Geant4 (used in CERN large hadron collider experiments), Gap (Groups, Algorithms and Programming), Sage (distributed algebraic computation), Gcl (GNU Common Lisp), python, perl, bash, emacs, vim and others (and very soon including screen). This talk will review the status, and provide a brief overview of DMTCP. There are about ten currently active DMTCP developers, including two other universities. Some of the current activities include:
1. the ongoing collaboration with the Condor project (high throughput batch facility for pools of computers);
2. collaboration with some Condor users to support Condor/DMTCP/Matlab (which is in demand, since Condor's built-in checkpointing does not support this configuration);
3. the use of the NMI Build and Test Lab to automate testing under a wide variety of different Linux platforms;
4. ongoing collaboration with the OpenMPI projectto support DMTCP-based checkpointing (MPI is "Message Passing Interface", a standard for parallel computing);
5. porting of DMTCP to provide checkpointing of Infiniband programs (a newer interconnection network widely used for high performance computation);
6. development of checkpointing for X-Windows graphics programs, and possibly OpenGL in the future (This application is of special interest for the SCIRun project, a problem solving environment for modeling, simulation and visualization of scientific problems, and particularly medical imaging problems);
7. past collaborations for checkpointing with several user groups (a million CPU-hour application on a Cray XT supercomputer), the use of DMTCP for the iPlant Tree of Life project (see here for more information), the use of DMTCP in HOL Light (Higher Order Logic: mathematical theorem proving), and other users.

Apr. 22, 2010
Title: Transforming a Checkpoint Image into an ELF Executable
Speaker: Amruta Chougule
DMTCP is a checkpointing application. When a user executes a checkpoint command, a checkpoint thread associated with the process writes the process state to a file. The DMTCP restart routine is used to restart the process using the checkpoint file. This checkpoint image can be converted into an ELF binary, so that one can directly run the resulting binary as a self-extracting executable to resume the process later on. This work is joint with Shobhit Agarwal.

Apr. 8, 2010
Title: The Role of the Flash Translation Layer in NAND Flash
Speaker: Yerong Chen
NAND Flash has been enjoying wide popularity due to its beneficial characteristics, e.g. lighter weight, lower power consumption, shock resistance, faster access time and non-volatility. But unlike traditional block disks, data in flash can not be overwritten. Instead, a time-consuming erase operation on the used blocks must proceed before it can be written again. To hide internal characteristics and increase the performance and lifespan of flash, the Flash Translation Layer (FTL) is introduced. My talk will compare different existing FTLs according to the address mapping scheme they employ. I will analyze their performance for certain operations and see what we can learn from their design.

Mar. 25, 2010
Title: A Parallel Disk-based Binary Decision Diagram Package
(and short talk on: "Converting a DMTCP Checkpoint Image to a Binary Executable")
Speaker: Dan Kunkle and Vlad Slavici
(and Amruta Chougule for short talk)
A Binary Decision Diagram (BDD) is a data structure which allows representing boolean formulas in a compact way. BDDs are mainly used in formal verification (model checking) as well as in other areas. Donald Knuth's upcoming Volume 4 of "The Art of Computer Programming" is mainly dedicated to BDDs (and their close relatives).

Even though more compact than other boolean formula representations, BDDs often grow to sizes too large for one RAM subsystem or even parallel RAM. Furthermore, BDD operations have proved difficult to parallelize.

We propose a parallel disk-based BDD package which aims to overcome the space limitations, as well as taking a different approach to parallelism compared to existing parallel BDD packages. The approach is based on BFS algorithms.

SHORT TALK: Amruta Chougule will present a brief overview of the steps in using a knowledge of ELF to convert a DMTCP checkpoint image into an ELF executable. Questions are welcome. This is preparatory to a longer talk later in the semester, which will provide further details. This work is joint with Shobhit Agarwal.

Mar. 11, 2010
Speaker: Dana Schaa (ECE at N.U.)
A major issue for multi-GPU development is gaining access to the desired GPU hardware. Factors such as available bus slots, power supply, and cost all limit GPU resources on a workstation or laptop. The current GPU programming paradigm needs to be considered as well, and has ramifications on alternatives related to distributed programming. To alleviate this problem, we are planning a virtualization framework to provide transparent access to remote GPUs. The framework will allow an OS running on any system (even a laptop) to believe that virtually limitless GPU resources are directly available to it, while being completely transparent to the programmer and programming model.

This is a new research project. So questions and suggestions would be highly appreciated.

Feb. 18, 2010
NON-STANDARD PLACE; 147 Ryder Hall
Title: Write Endurance in Flash Drives: Measurements and Analysis
Speaker: Marcela Simona Boboila
In recent years flash memory has entered widespread use, in embedded media players, photography, portable drives, and solid-state disks (SSDs) for traditional computing storage. Flash memory has become the first competitor to magnetic disk storage to gain significant commercial acceptance. However, it differs from disk in many characteristics; one which has particular importance for the design of storage systems, is its limited write endurance: bits in a flash chip fail after a limited number of writes, eventually rendering the device unusable.

In this work, we investigate internal characteristics of flash devices. The focus is not only measurement results, but the underlying factors at the level of chips, algorithms in use, and achievable algorithms which determine the endurance of a device. We reverse-engineered a few flash drives, and found a close correlation between measured whole-device endurance, and predictions based on chip endurance values and internal algorithms. We present methods based on analysis of operation latency which provide a non-intrusive mechanism for determining internal parameters. We discuss implications for storage systems, and in particular the necessity of new scheduling algorithms designed specifically for flash, to extract maximum performance.

Feb. 4, 2010
Title: URDB: Universal Reversible Debugger
Speaker: Ana Visan and Kapil Arya
This week's talk will consist of two shorter talks. In the first half, Ana will talk about URDB. In the second half, Kapil will describe the currrent set of DMTCP bugs (all presumed to be relatively small) prior to having DMTCP validated as the checkpointer for Condor. A full abstract follows for the first half.

URDB is a universal reversible debugger that adds reversibility to almost any debugger through a python-based wrapper. URDB is still experimental. URDB supports multi-threaded and multi-process applications, as well as distributed computations. It currently has four personality modules, which support: gdb; MATLAB; python (pdb); and perl (perl -d). URDB gains its reversibility through the use of DMTCP (Distributed MultiThreaded CheckPointing), a fast checkpoint-restart package.

URDB also supports reverse expression watchpoints, a form of temporal search within a process lifetime. For example, the current value of a user specified expression indicates a bug. The user asks URDB/gdb to go back in time to a statement where the expression is about to take on its current value. This uses binary search: if n statements have been executed, URDB finds the point in time using only log(n) evaluations of the expression.

The experimental software is available at: http://urdb.sourceforge.net/.

Jan. 28, 2010
Title: Fast Multiplication of Large Permutations for 2-level Memory Hierarchies
Permutation multiplication (or permutation composition) is an operation which allows for a very simple, traditional implementation:
for i in {0 . . . N-1} do Z[i] = Y[X[i]]
For 2-level memory hierarchies such as RAM/disk, RAM/flash or parallel RAM/ parallel disk this algorithm is unfeasible for large permutations, because of the penalty incurred by a random access to an array stored on the lower-level memory. New algorithms for permutation multiplication that are feasible for the 2-level memory hierarchies mentioned above will be presented.

Surprisingly, the traditional algorithm in its multi-threaded version s not the fastest one for large permutations (at least 128 MB) even when the 2-level memory hierarchy is cache/RAM on a many-core machine. A new multi-threaded algorithm will be presented, which outperforms the traditional algorithm when running on at machine with at least 8 cores. In a special regime, the new algorithm outperforms the traditional one for permutations as small as 16 MB. Strong evidence that for future commodity machines the new algorithm will be faster than the traditional one will also be presented.

(This is joint work with Xin Dong, Daniel Kunkle and Gene Cooperman.)

Jan. 21, 2010
Title: Organizational Meeting for Spring, 2010
Speaker: Gene Cooperman (and others)
This will be an organizational meeting in which we can each report what we are currently working on. Gene will lead the seminar.

I apologize that I am most familiar with the work of my own students. Hence the list of short topics below is missing the important status and contributions of Peter Desnoyer's students (e.g. congratulations on the FAST publication), and whoever else attends.

	Among the short topics in my own group are:
ISSAC work (symb. alg.), potential collaboration w/ Moreno Maza and Xie.
Future of project w/ Mueller (J_4)
Success of Geant4MT (multi-threaded project), and future
Plans for URDB/DMTCP
Roomy has now gone beta
Other??
Other groups
Other topics


Friday, Dec. 18, 2009 (3:00 p.m.: NON-STANDARD TIME; 366-WVH
Title: Brief Report on How to Obtain Full Scalability for Multi-Threaded Geant4; and Brief Report on Dagstuhl seminar
Speaker: Xin Dong and Gene Cooperman
This Friday at 3:00 p.m. (note unusual time) we will have two brief reports to mark the end of the semester for those who are still on campus. Xin Dong will report on his final solution to obtain full linear scalability for multi-threaded Geant4, and how he found it. Gene Cooperman will report on the Dagstuhl meeting, Search Graph Engineering.

Nov. 11, 2009
Title: Computing Cylindrical Algebraic Decomposition via Triangular Decomposition
Speaker: Marc Moreno Maza (U. Western Ontario) (Joint work with Changbo Chen, Bican Xia and Lu Yang
Cylindrical algebraic decomposition is one of the most important tools for computing with semi-algebraic sets, while triangular decomposition is among the most important approaches for manipulating constructible sets. In this talk, for an arbitrary finite set $F \subset {\R}[y_1, \ldots, y_n]$ we apply comprehensive triangular decomposition in order to obtain an $F$-invariant cylindrical decomposition of the $n$-dimensional complex space, from which we extract an $F$-invariant cylindrical algebraic decomposition of the $n$-dimensional real spa\-ce. We report on an implementation of this new approach for constructing cylindrical algebraic decompositions.

Nov. 4, 2009
Title: Leveraging Social Networks in Information Systems
Alan Mislove
Recently, online social networking sites have exploded in popularity; numerous sites are dedicated to exchanging end-user generated content, such as photos on Flickr, videos on YouTube, and status updates on Twitter. These sites represent a new type of information system, where links between users (as opposed to links between content items) are the primary manner in which information is structured. But, the enabling of all users to become publishers is resulting in a deluge of data that only makes the fundamental problems of finding relevant and trustworthy content even harder. The social networks that these sites are based on, however, offer a potential solution, because the relationships among users in the social network may indicate the relevance and trustworthiness of content.

In this talk, I provide an overview of my research in this space. First, I detail my work on the measurement and analysis of real-world networks, focusing on techniques to infer hidden attributes of social network users. Second, I describe approaches leveraging social networks to solve some of the challenges facing information systems. I present Ostra, a novel system that prevents unwanted communication by leveraging the relationships among users in a social network. I conclude with a brief overview of ongoing research projects.

Oct. 29, 2009
Title: The URDB reversible debugger: Visions for the Future
Gene Cooperman
URDB (Universal Reversible Debugger) has now been released on sourceforge. That URL also has a pointer to a technical report at arxiv.org. While this pre-alpha release of URDB still has many bugs, it already opens up a vision for a future with temporal debugging, time-traveling program-base introspection, and program-controlled speculative execution. This talk will discuss these future possibilities.

Oct. 22, 2009
Title: Parallel Dense Polynomial Arithmetic on Multi-Cores
Speakers: Marc Moreno Maza (U. Western Ontario) and Yuzhen Xie (MIT)
Contact info (usernames): moreno and yxie
Contact info (hostnames): csd.uwo.ca and csail.mit.edu
We aim at multicore-enabling FFT-based dense polynomial arithmetic over finite fields. We show that balanced input data can maximize parallel speedup and minimize cache complexity for bivariate multiplication. We present effective techniques to reduce multivariate (and univariate) multiplication to balanced bivariate multiplication. This parallel multiplication also provides an efficient kernel for fast division and normal form computation so that composition of multi-level of parallelism succeeds. Our implementation in Cilk++ demonstrates good speedup on multi-cores.

Oct. 15, 2009
Title: Roomy, One Year Later: A Language for Using High-Latency Storage as Main Memory
Daniel Kunkle
A complete overview of Roomy was first presented in this seminar almost exactly a year ago, with an update almost exactly half a year ago. Roomy is now moving toward a formal beta release. This talk will present a self-contained overview, and then tell what's new in the evolution of Roomy. The previous abstract follows.

In our continuing effort to use "Disk as the New RAM", we are developing a new programming language: Roomy. Implemented as a library for C/C++, Roomy provides several highly scalable data structures, which can grow to many terabytes in size. These data structures can be stored on many disks, across a cluster of computers, but the complexities of parallelism and efficient disk-based operations data are hidden from the programmer.

This talk will cover: some of the underlying structure of Roomy; the high-level data structures and API; and an example application using Roomy.

Oct. 8, 2009
Title: Empirical Measurement of NAND Flash (followed by a separate short talk)
Peter Desnoyers; and short talk by Vlad Slavici, Xin Dong and Gene Cooperman
Everyone knows that flash memory wears out. But how quickly? The only people who know aren't talking, so we tested a bunch of chips and have some surprising results.

The separate short talk will be led by: Vlad Slavici, Xin Dong and Gene Cooperman. They will discuss their efforts to find a faster permutation multiplication algorithm, with striking successes on one machine, and a disappointing failure on another machine. The audience will be invited to help diagnose the performance bug on the second machine.

Oct. 1, 2009
TITLE: URDB: A Universal Reversible Debugger
Ana Visan
While there is a long history of reversible debuggers, such examples were always specific to some particular debugger. This talk presents URDB (A Universal Reversible Debugger), which produces a wrapper around almost any debugger that adds reversibility. Source code for the debugger is not required. Reversible debuggers have been demonstrated for four widely used debuggers: MATLAB, python, perl, and gdb. Each new reversible debugger requires modifying 200 lines of code to specify the target debugger commands and output format. URDB is based on DMTCP (Distributed MultiThreaded CheckPointing). URDB was made practical by DMTCP's ability to checkpoint a target debugging session in about 1.5 second, and to restart in about 0.5 seconds. This talk will also cover our experience of working in a team of six people under deadline pressure.

Sept 24, 2009
TITLE: Permutation Multiplication and High-Performance Computing
The naive way to compose (multiply) two permutations is to represent the first one as an array X, the second one as an array Y and write this piece of code:
    for (i = 0; i < N; i++){
Z[i] = Y[X[i]];
}


Depending on whether the permutations fit or don't fit in RAM, on the architecture (single-core, multi-core or cluster), on the length of the L2 cache line (and very likely other parameters which we are not going to discuss), the naive algorithm is sometimes outperformed by an algorithm that uses sequential access patterns and is aware of cache/RAM locality. Other times, the naive algorithm is more efficient.

We discuss the ways in which this simple exercise brings out important properties of current architectures and how the Roomy library (created by Daniel Kunkle) provides a natural translation between the naive and advanced algorithm with no effort from the programmer.

Sept 17, 2009
TITLE: Organizational Meeting
Gene Cooperman
We will discuss the organization of the seminar for the Fall, and Gene will then lead a discussion of current activities, interests, and goals of the seminar members.

## Talks from Previous Semester (Spring, 2009) and Before

May 20, 2009
TITLE: DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop
Jason Ansel
This is a special seminar. Jason Ansel will give a presentation based on his talk to be given at IPDPS in Rome on Tuesday, May 26. This is joint work with Kapil Arya and Gene Cooperman.

Apr. 2, 2009
TITLE: Roomy enough for Data Mining?
Speaker: Mirek Riedewald
We want to explore if Roomy can be used to simplify implementation of scalable data mining algorithms. This will be a working seminar. We will start with a brief summary of existing sequence mining algorithms (Mirek) and a summary of Roomy's features (Dan, Gene). Then we will explore using Roomy for sequence mining and maybe other mining tasks.

Mar. 26, 2009
TITLE: Permanent denial of service attacks on flash: Forcing fast wear out
Speaker: Simona Boboila
Flash memory is a non-volatile, electrically erasable programmable read-only memory. Its main restriction is the limited number of write/erasure cycles (i.e. 10,000 - 1,000,000) that it can sustain, after which it wears out, becoming unreliable. To cope with this limitation, most flash devices have an intermediate software layer, called Flash Translation Layer, that maps logic locations to empty physical locations transparently. The mapping is used to distribute the erase operations among all physical sectors and thus increase the lifetime of each sector. We are investigating methods to carry out a permanent denial of service attack on a flash device, by by-passing the Flash Translation Layer and wearing out the device.

Mar. 19, 2009
TITLE: The Compressed Skycube
Speaker: Donghui Zhang
This talk presents the SIGMOD'06 paper on the Compressed Skycube (CSC). The CSC supports the skyline queries on an arbitrary set of attributes, e.g. considering (#points, #rebounds, #assists), who are the best NBA players? Here a player is one of the best NBA players if there does not exist another player who is better in all dimensions. Before CSC, one would either sacrifice space and update cost (pre-compute and maintain all skylines), or sacrifice query cost (compute skylines at query time). The CSC presents a solution that is efficient in both aspects.

Mar. 12, 2009
TITLE: Parallel Disk-Based Coset Enumeration: Problems and Possible Solutions
Coset enumeration is a problem related to group enumeration. The idea is to enumerate a group whose elements have no known unique representation. Instead, the input is given as a set of relations between generators which hold for all elements of the group (each relation describes a cycle in the final directed graph of the group). The output is a unique representation for the elements of the group (in this case a permutation representation).

The algorithm that we have now presents some problems which make it inefficient: Generating new elements in breadth-first order requires a very large number of steps to get to a reasonable size (hundreds of gigabytes) -- the real branching factor seems to be close to 1 (maybe 1.1) -- each time a new level is generated by the BFS, external sorting is required -- slow.

We are currently investigating the rate at which duplicates are generated, and how many steps are necessary until all duplicates from one step are processed. Duplicate detection is layered - each layer takes long to process.

There are some potential solutions which will be presented as well.

Feb. 26, 2009
Title: TOPOFIT-DB: A Collaboration in Bioinformatics
Speaker: Xin Dong
This talk describes some of the real-world issues that one encounters in collaborating with researchers in bioinformatics. In particular, the application drives the research, and not computer science techniques. This is because the goal is scientific results, not new computer science. For example, missing some cases is fine, as long as the database collects a large sample of the interesting cases. Better to find only some interesting cases, but find them first --- rather than find all the interesting cases, but find them second. Or, as as my collaborator in bioinformatics has said:
I have a fishing boat, and I don't want you to build a rocket on top of it. I just want something that will help me catch more fish.

Feb. 19, 2009
Title: Scalable techniques for data management and analysis
Mirek Riedewald
Most sciences already are producing an abundance of data, and analyzing this rapidly growing wealth of information has become a major challenge. This provides an exciting opportunity for developing novel approaches that will have an impact both in computer science as well as in the domain sciences. I am interested in developing techniques for distributed data analysis, for mining observational data, for making realistic scientific simulations feasible through data mining, and for real-time processing of massive data streams.

Feb. 12, 2009
Title: Results on checkpointing ptrace with applications to GDB
Speaker: Ana M. Visan
Abstract: We are now close to being able to checkpoint ptrace. We will discuss this architecture and the implications for checkpointing gdb.

Feb. 5, 2009
Title: Progress Report on DMTCP
Kapil Arya, Gene Cooperman, Dan Kunkle, Praveen Solanki, Ana Visan
This week will concentrate on an overview and discussion of the current status and progress on DMTCP (http://dmtcp.sourceforge.net). Some of the areas of interest are: checkpointing X-Windows (via TightVNC), checkpointing ptrace (which opens up checkpointing of gdb), checkpointing of the various shells (tcsh, bash, dash), pid virtualization (coming, at last), checkpointing of screen, testing of DMTCP in Condor, testing of DMTCP with SCIRun, and other topics as they come up.

Jan. 29, 2009
Title: Discussions with Marc Snir
Marc Snir
Marc Snir will give the Distinguished Speaker Series talk on the same day in the afternoon. In addition, he will be available for informal discussions from 10:30 - 12:30 in 366 @VH.

Jan. 15, 2009
Title: State of the Lab
Gene Cooperman, Peter Desnoyers, and others
We will review the ongoing research projects, recent successes, and future challenges.

Dec. 4, 2008
Title: Group-theoretic Algorithms for Matrix Multiplication (cont.)
Emanuele Viola and Gene Cooperman
This continuation of the previous lecture will be jointly led by Emanuele and Gene. It will be a working session, in which we try to gain intuition about what types of groups are most likely to be useful for a faster matrix multiplication algorithm. Recall that we wish to find groups G with sets S_1, S_2 and S_3 (indexing rows and columns in our matrices) satisfying the triple product property for correctness. We also require for efficiency that |S_1||S_2||S_3| >> \sum d_i^3. The sum of the cubes of the character degrees d_i of a group is given by the GAP function:
MySumOfCubes := group -> Sum(CharacterDegrees(group), pair->pair[2]*pair[1]^3);


Gene will add some group-theoretic background to the discussions. For those wishing to "read ahead", one can look at: Character theory, and Fourier Transforms on Finite Groups. Both topics will be covered during the seminar. One can also gain further background by reading the chapter on fast polynomial multiplication in Cormen, Leiserson, Rivest and Adleman.

Nov. 20, 2008
Title: Group-theoretic Algorithms for Matrix Multiplication
Emanuele Viola
We present the group-theoretic approach to matrix multiplication by Cohn and Umans, later joint with Kleinberg and B. Szegedy. This approach gives algorithms for matrix multiplication that run in time n2.41, which is close to the current record n2.376 by Coppersmith-Winograd. Under certain group-theoretic conjectures, the approach gives algorithms that run in the optimal time n2.

Nov. 13, 2008
Title: Multi-Threaded Geant4: A General Approach to Converting Sequential Processes to Multi-Threaded Processes
Xin Dong
Our latest experiments show that our newest implementation of multi-threaded Geant4 reduces per thread memory consumption greatly. Copy-on-write semantics (child processes forked from a single parent) uses 380MB total for one master and three worker processes. Multi-threaded fullCMS now consumes 240MB for one master and three worker threads. Copy-on-write uses 70MB per worker process and multi-threaded fullCMS uses 22MB per worker thread.

After a brief review of Geant4, we discuss our ideas for a general approach to converting many single-threaded processes into a single multi-threaded process with good data sharing. This is important for future many-core CPUs to avoid thrashing between CPU cache and RAM.

Nov. 6, 2008
Title: Using Parallel Disks for Coset Enumeration
Coset enumeration is a mathematical algorithm similar to the construction of certain finite automata. It is used to construct large permutation groups. As one scalses the algorithm, the memory usage grows well beyond the available RAM. We discuss how to extend the algorithm by using parallel disks to construct very large permutation groups.

Oct. 30, 2008
Title: Roomy: A Language for Using High-Latency Storage as Main Memory
Daniel Kunkle
In our continuing effort to use "Disk as the New RAM", we are developing a new programming language: Roomy. Implemented as a library for C/C++, Roomy provides several highly scalable data structures, which can grow to many terabytes in size. These data structures can be stored on many disks, across a cluster of computers, but the complexities of parallelism and efficient disk-based operations data are hidden from the programmer.

This talk will cover: some of the underlying structure of Roomy; the high-level data structures and API; and an example application using Roomy.

Because Roomy is still in the early stages of development, feedback and suggestions during the presentation will be valuable and encouraged.

Oct. 23, 2008
Title: An Architecture for Checkpointing GDB Using DMTCP (cont.)
Praveen Solanki and Ana Visan
This is a continuation from the previous week.

Oct. 16, 2008
Title: An Architecture for Checkpointing GDB Using DMTCP
Praveen Solanki and Ana Visan
This will be a working session in which we all try to look for advantages and pitfalls of the architecture that they propose. This will allow them to refine their blueprint for checkpointing gdb.

Oct. 9, 2008
Title: Checkpoint-Restart: Progress Report
Kapil Arya, Gene Cooperman, Praveen Solanki, and Ana Visan
The DMTCP checkpoint-restart package is now becoming more mature. It successfully checkpoints and restarts a set of 20 common interactive applications, as well as computation-intensive packages such as MPI.

As one example, while we cannot yet checkpoint gdb (see below), one can now attach restarted processes using gdb. This is important in cases where a bug appears only after many minutes, hours or days. One can checkpoint a long-running process every minute, and then restart under control of gdb from the last checkpoint before the bug, This speeds up the debug cycle during repeated gdb sessions.

As a second example, there is progress toward the important goal of being able to directly checkpoint ongoing gdb sessions. This will provide a platform for treating gdb sessions similarly to version control systems, including multiple snapshots of the state of a gdb debugging session, while maintaining different branches. It also provides what is colloquially called a "reversible debugger".

Oct. 2, 2008
Title: Flash: New Directions in Storage
Peter Desnoyers
This talk will highlight important research opportunities in flahs storage.

Sept. 25, 2008
Title: Overview of ongoing projects (cont.)
Various speakers
Series of 5 or 10-minute overviews of ongoing projects.

Sept. 18, 2008
Title: Overview of ongoing projects
Various speakers
Series of 5 or 10-minute overviews of ongoing projects.

May 8, 2008
Title: Summing up for this academic year
Gene Cooperman
We will use this to sum up where we are as a working group, and how to organize best for the remaining opportunities.

May 1, 2008
Title: Graceful Degradation of Compact Hash Arrays
Peter Dillinger
I'm working on using lossless sets of hash values to over-approximate sets. I use these rather than inherently probabilistic structures so that I can rehash the sets to accomodate more elements in the same amount of space. Further, I want to be able to rehash within the same space itself (rather than allocating a new copy and copying all the elements).

A data structure by John G. Cleary[1] is clearly capable of this, but making the structure reasonably fast but also compact and flexible while supporting these kinds of rehashing presents some interesting challenges.

A data structure of mine, the Hash Pyramid, appears to be capable of this kind of rehashing except that (1) it would be much slower to rehash and (2) it's not clear whether it can rehash in place to a k=1 Bloom filter, as the Cleary structure can.

I plan to discuss these challenges and get some brainstorming going.

[1] John G. Cleary: Compact Hash Tables Using Bidirectional Linear Probing. IEEE Trans. Computers 33(9): 828-834 (1984) http://ieeexplore.ieee.org/iel5/12/35240/01676499.pdf

Apr. 23, 2008
Title: Report on Trip to CERN
Gene Cooperman
In the visit to CERN, Xin Dong and I proposed a software collaboration based on: (1) automatic multi-threaded parallelism; and (2) checkpointing. This was part of the Workshop on Multi-Core Computing and Virtualization in the LHC. Those two topics are WP8 and WP9 in the planned software efforts at CERN. We also received a presentation by the CMS group, where they describe a 2 million line program with about 1,000 packages, and 900 shared libraries (of which 400 are present in a given application). At CERN, they have discovered new limits on software scalability not typically seen elsewhere (although they anecdotally report that Firefox/Mozilla is having similar experiences). Two issues are: (1) the time to link so many shared libraries; and (2) a 1.5 MHz malloc rate due to use of STL/C++ code from non-C.S. specialists.

Apr. 17, 2008
Title: Dry Run of Thesis Defense
Eric Robinson
I'll be doing a dry-run of my thesis defense at the system's seminar this week, April 17th, 1:30-2:30PM. I'd appreciate as much feedback as I can get.

Mar. 27, 2008
Title: Linear Algebra for Graph Algorithms: Betweenness Centrality
Eric Robinson
With the advances that have been made with modern mathematical software and specifically with sparse matrix representation and storage, representing a graph as an adjacency matrix is no longer space-inefficient. It is now possible to effectively compute over large graphs using software such as Matlab, Maple, Mathematica, etc.

To demonstrate this, the ease of implementing a few of the basic graph algorithms using an adjacency matrix representation is shown. After this, a more complicated algorithm, betweenness centrality, is examined. Betweenness centrality seeks to compute the importance of each node in a graph by determining the number of shortest paths that it lies on. The algorithm's implementation using an adjacency matrix graph representation and only basic matrix and vector operations, such as matrix multiplication and element by element operations, is presented.

The ability to represent complex graph algorithms such as this as linear algebra problems in Matlab, etc. provides a reduced development time in many cases. The resulting code is significantly smaller. In addition, it provide access to "free" parallel versions of the algorithms with software such as pMatlab, starP, and parallel Matlab.

Mar. 20, 2008
Title: A New Approach to Direct Condensation using Parallel Disk as Primary Storage
Ana Maria Visan
Condensation is a method through which we can reduce the size of the object to be analyzed and that guarantees to retain enough information for usefulness. Condensation is a critical technique for the Modular Atlas project, which analyzes modular matrix representations for finite groups. For large simple groups, these representations are more efficient computationally than the next alternative: permutation representations. A new approach to direct condensation is presented that uses parallel disk as primary storage. The running time of the new approach is O(SN), where N is the size of the orbit being condensed and S the size of the generating set. Previous algorithms had a running time of O(LSN), where L is the landmark ratio, thus the new algorithm is an improvement over the previous algorithm. Using the new disk-based approach, we produce condensation matrices for the sporadic simple Janko Group J4 on an orbit of size 131,358,148,251. Each point of the orbit is a 4-dimensional subspace in a vector space of dimension 112 over GF(2). The new algorithm demonstrates a condensation of a group whose orbit is 10 times larger than the largest previous such group for which direct condensation was applied.

Mar. 12, 20008; 12 noon (unusual time, place and day)
212 West Village H
Endomorphism rings and their character tables
Jürgen Müller
Starting points are the notions of distance-transitivity and -regularity, as well as the Ramanujan property, in algebraic graph theory. These are closely connected to the action of the graph automorphism group. From that point of view, graphs appear as orbital graphs associated with a permutation action of a given finite group.

Given a permutation module, representation theory tells us to consider its endomorphism ring, which similar to the group case turns out to have a character table. The latter finally contains information about purely graph theoretic properties, like the ones mentioned above, of the associated orbital graphs.

Together with T. Breuer we have built a database containing the character tables of the endomorphism rings of the multiplicity-free permutation actions of the sporadic simple groups. We report on this, as well as on the computations necessary to compile the data. In particular, to tackle the large cases, together with M. Neunhoeffer and R. Wilson we have invented a new general technique to enumerate very large orbits.

Mar. 6, 2008
"Proving 25 Moves Suffice for Rubik's Cube; Disk-based Computation Wins Again (PART TWO)"
Daniel Kunkle
This is a continuation of last week's talk. Having heard some of the background, we will now hear about the algorithmic improvements needed to show that 25 moves suffice.

Feb. 28, 2008
"Proving 25 Moves Suffice for Rubik's Cube; Disk-based Computation Wins Again"
Daniel Kunkle
Last year, we proved that 26 moves suffice to solve Rubik's cube, beating the previous best upper bound on solution lengths of 27. This was accomplished with two separate computations: a disk-based breadth first search over 1.4 trillion states, which proved 29 moves sufficient; and a large trivially parallel RAM-only computation to refine the upper bound to 26. This presentation will describe our continuing work to further refine the upper bound to 25 (expected to complete in the next one to two months). To accomplish this, we needed to design a new disk-based version of the existing RAM-based refinement method. This example of transforming an existing RAM-based computation into a disk-based computation is representative of many such efforts in our research group, and will serve as a general guide for the process.

Feb. 21, 2008
"Introduction to Distributed Multi-Threaded Checkpointing (DMTCP)"
Kapil Arya
DMTCP (Distributed Multi-Threaded CheckPointing) is a user level transparent checkpointing tool for distributed, multithreaded applications. DMTCP can also be used for debugging processes as well as for process migration. It doesn't require any modification of kernel or application binary and is based on TCP sockets which gives it full generality. There are many subtleties to designing a package with broad coverage over such varied parallel libraries as OpenMPI, MPICH-2, SciPy (python), along with parallel applications such as ParGeant4 (a Monte Carlo simulation), etc. Some of the issues to be discussed are: whole checkpointing procedure including checkpointing of pseudo terminals, sockets, shared memory etc.; and Scalability.

Feb. 14, 2008
"Overview of the NVIDIA G8 Architecture and Design Issues"
Gene Cooperman
This talk will review the NVIDIA G8 architecture with an eye toward deducing the tradeoffs and design issues that were faced by the designers of the NVIDIA G8 architecture.

Jan. 30, 2008, (11:45 a.m., 366 WVH -- note unusual time)
"Multi-Core Computing"
Gene Cooperman
While the PhD admissions committee has our time slot, we continue to have a floating systems seminar. :-)

This Wednesday (tomorrow), we will hold a joint PL/Systems seminar at: 11:45 a.m., Wed., 366 WVH

I will be talking informally (white board) about multi-core computing. The goal is to find common issues affecting both systems and PL. Part of the talk will be review, and part of the talk will be speculation about the future.

Jan. 18, 2008 (1:45 p.m. -- note unusual time)
"Alternative Designs for Automatic Transformations to Implement Thread Parallelism in the Presence of C++ Inheritance
Xin Dong
Xin Dong will speak on the issue of implementing automated thread parallelism in the presence of object inheritance in C++. Xin will briefly present the issues and then ask the seminar attendess to help make suggestions and debug design possibilities. Xin already has a simple transformation for the common case encountered in Geant4 code. In the presence of inheritance, his simple transformation becomes harder to implement (since it must take account of the base class), and Xin is looking for an audience to discuss alternative designs.

Jan. 10, 2008, 1:30 p.m.
"Experience in Implementing Thread Parallelism for Geant4"
Xin Dong
Xin Dong will present his experience in implementing thread parallelism for Geant4. Some of the key ideas were simple transformations for singleton class, and the careful use of the relatively new __thread keyword, introduced in C99 and C++98.

Nov. 8, 2007 (1:30 p.m. -- note unusual time)
"Disk-Based Parallel Coset Enumeration" (cont.)
Vlad will continue the discussion by explaining his parallel, disk-based strategy for coset enumeration. If time permits, Gene will give the background for BDDs (Binary Decision Diagrams) --- another pointer-chasing problem. BDDs are used in verification, and a similar parallel, disk-based strategy may be available for this algorithm.

Nov. 1, 2007 (1:30 p.m. -- note unusual time)
"Disk-Based Parallel Coset Enumeration"
First, a short description of the general way to convert an NFA to a DFA. If the NFA has n states, the DFA will have at most 2^n states. All the states of the DFA can be represented by an n-bit string. Generation of a new state takes at most O(n) time using bit string operations. Pointer chasing is an issue because the next state may be owned by any node. Still, all the generated states are valid, so the disk space will be continually increasing until reaching the final configuration.

Then a problem where pointer chasing visibly dominates the computation: Coset enumeration is the problem of converting a presentation representation of a group into a permutation representation. Many problems require the permutation representation of a group as input. In order to obtain the result - a complete coset table - all known algorithms define additional cosets, most of which turn out to be duplicates. Basically, there are two approaches to this problem:

HLT - which defines as many cosets as necessary to fill a relator loop( a circular sequence of cosets of length equal to the order of a generator) and then tries to collapse them

Felsch - which tries to collapse cosets as soon as it defines new ones

HLT is easier to implement, but since previous approaches only used RAM, Felsch was preferred for using less space. For HLT, the intermediate space necessary overflowed RAM, or even one disk.

Our approach is to use the HLT strategy, but using parallel disks, so that the space necessary for keeping the intermediate states would not be an issue any more. The strategy will be presented, as well as some rough estimates of the number of accesses to disk for a given input size.

Oct. 25, 2007 (3:00 p.m. -- note unusual time)
"Twenty-Six Moves Suffice for Rubik's Cube"
Daniel Kunkle
The minimum number of moves sufficient to solve any configuration of Rubik's Cube has been a matter of long-standing conjecture -- since the puzzle appeared over 25 years ago. This number is sometimes known as "God's Number". This talk will describe the methods we have used to show that God's Number is no more than 26.

To achieve this result, we used new techniques in two areas: group theoretic methods for efficiently representing and computing over the possible configurations of Rubik's Cube; and, parallel disk-based algorithms for enumerating large search spaces. This talk will primarily focus on the disk-based search and enumeration techniques we have introduced.

Oct. 18, 2007
No systems seminar
The systems seminar is preempted by a faculty meeting at the same time.

Oct. 11, 2007 (3:00 p.m. -- note unusual time)
"Distributed Disk-Based Search and Enumeration: Discovering the Shortest Path to a Thesis"
Eric Robinson
With the introduction of multi-core computing, Moore's Law continues to hold. Problems that were previously infeasible due to computation time are now being re-examined. For many of these problems, however, the issue at hand is no longer CPU time, but rather available storage. This is especially true in the case of search and enumeration problems, which are typically characterized as being on the order of $O(N)$, but may require the storage of billions or trillions of values''.

Large search and enumeration is examined, and the case for using distributed disk is made. Fundamentally, disk is on the order of 10 times slower than RAM when streamed. This slowdown can be compensated for through the use of a cluster of disks or a RAID system. For this, the application programmer has access to on the order of 100 times more storage. Many techniques for search and enumeration are presented, including one method developed in-house, and a fair'' method for comparing these techniques is given.

The talk shall also follow my research as it developed through my 5.13 years (currently) at Northeastern, and will demonstrate some of the goals and pitfalls of being a Ph.D. student. These will be presented through a look at the progress, or lack thereof, of my research during my time at Northeastern.

Oct. 4, 2007
No meeting this week.

Sept. 27, 2007
"Parallel algorithms in numerical algebraic geometry"
Anton Leykin,
Institute for Mathematics and its Applications (IMA),
U. Minnesota
Numerical algebraic geometry (NAG) is a young area encompassing numerical methods that deal with systems of polynomial equations. The main engine of NAG, the polynomial homotopy continuation procedure, is easily parallelizable. This makes NAG algorithms suited for parallel and distributed computing better than those based on classical symbolic techniques such as Gröbner bases.

This talk will start with an introduction to NAG and continue with sketches of two parallel algorithms. One is a core algorithm for parallel monodromy breakup, the other is an application to pure mathematics; it computes Galois groups of simple Schubert problems.

Time permitting we will discuss the recent algorithm for numerical primary decomposition, which enables us to solve the ideal membership problem numerically.

Sept. 20, 2007
"Structures for potentially probabilistic state enumeration in RAM"
Peter Dillinger
Part 1: Classical approaches to representing visited lists can require much more memory than is theoretically required to store subsets of some finite universe of elements. In particular, when the subset is neither particularly dense nor particularly sparse in the universe, discrete information theory dictates that classical approaches are far from optimal. For example, it is possible to represent 200,000 arbitrary 32-bit values (such as IP addresses) using less than 16 bits per value. My data structure, the "hash pyramid," shares enough structure between elements to use only 17.5 bits per value in this case. A nearly-forgotten structure by Cleary is similarly compact, but only when made quite slow.

Part 2: When an exact visited list cannot fit in RAM, it is useful in many applications to resort to an over-approximated visited list. Because the elements are hashed, there is a predictable probability of whether a new state will erroneously be considered visited. Once again, discrete information theory can tell us how much space is required to achieve certain probabilities. One underacknowledged structure, the "hash compact" table, is nearly optimal, but only if the number of reachable elements is known in advance. In contrast, the popular Bloom filter is more generally mediocre, meaning it is likely to perform relatively well when the number of reachable elements is not well known. Pete Manolios and I were the first to contrast these structures in this way.

Part 3: Structures for exact sets can also be used as over-approximations, by exactly storing a set of hashes. The hash pyramid and Cleary structure are quite competitive when used in this way, and they have a property that neither the hash compact table nor the Bloom filter has: they can be rehashed to accomodate more entries. The Cleary structure in particular can be rehashed in place with linear scans.

Part 4: All of these schemes depend on hashing. Two hash functions in particular are quite useful in explicit-state verification; these have been implemented in our 3Spin and 3Murphi tools based on Spin and Murphi. The first is the Jenkins LOOKUP2 function, which is heuristically accurate and predictably fast. The second is H3, which has "universal hashing" accuracy guarantees, and can use a differential optimization to make it heuristically fast. Thus, these two hash functions are nice counterparts.

Sept. 13, 2007
"Organizational Session"
Gene Cooperman
We will have an organizational session in which Gene will give a short report on interesting applications he found on his recent travels, and others will give a five minute report on the current status of their research projects. As with all systems seminars, additional participants/listeners are welcome.

## Talks from Previous Semester (Fall, 2006)

Nov. 30, 2006
"Large-Space Computations: Using Disk as RAM"
Gene Cooperman
We have at least two examples in our working group demonstrating the solution of problems by using many terabytes of disk space spread among mainy distributed disks. This seminar will be interactive. We will use the wireless connection and google to help us widen the set of applications in real time. Some sample places to look are very large Markov chains (with hundreds gigabytes of training data?), password breaking using very large rainbow tables, large airline scheduling problems, the search for large codes, the travelling salesman (a perennial favorite), integer programming (and branch-and-bound style solutions), robust optimization in linear programming, and satisfiability (with potential applications to VLSI design).

Nov. 16, 2006
"Disk-Based Distributed Search: A New Approach"
Eric Robinson
When looking at previous methods for search space enumeration and notice an obvious partitioning. Some methods emphasize performing additional computations in order to reduce the overall space of the enumeration. While these methods seem advantageous at first, upon closer examination it is seen that streaming disk is actually faster than accessing memory randomly. The second class of methods take advantage of this by requiring more space, which is accessed only in a streaming manner, however performing less computations.

Both of these techniques focus in on a particular level in the memory hierarchy. The first aims primarily at using memory during the computation, the second at using disk. We offer an alternative that takes advantage of both the memory and disk levels in the memory hierarchy. Through this, we provide an algorithm that improves upon the best known disk-based algorithm in many cases. We offer several examples of where our technique has been used.

Finally, we show how all of these search space enumeration techniques fall into a hierarchy which has a natural tradeoff between space and time. Our goal is to define under exactly what circumstances each method will be optional based on criteria such as the problem size and architectural parameters such as disk and memory speed, along with the number of machines used in the computation and the speed of the network.

Nov. 9, 2006
"Parallelization of the ILU(k) Preconditioner for Solving the Matrixi-Vector Equation Ax=b" (cont.)
Xin Dong
This is a continuation from last week.

Nov. 2, 2006
"Parallelization of the ILU(k) Preconditioner for Solving the Matrixi-Vector Equation Ax=b"
Xin Dong
There is no generally optimal way to speedup the solving of Ax = b when A is a sparse matrix. The direct LU factorization results in so many fill-ins that A often becomes dense. The iterative method requires a good preconditioner to decrease iteration while the preconditioning may not be achieved easily. The incomplete LU factorization for level k (ILU(k)) is one such preconditioner. Although its computation is more effficient for small k, ILU(k) may still be a potential bottleneck even for small k. We implement an algorithm to compute ILU(k) on a cluster. In the area of linear solvers, many researchers believe it difficult to get better speedup with the distributed memory model due to the need for extensive communication and memory synchronization. However, our results show that a good balance between computation and communication leads to exciting speedups in practice.

Oct. 26, 2006
"Duplicate Detection in Very Large Search and Enumeration"
Dan Kunkle
And this week, we will finish up our extended review of duplicate detection methods.

Oct. 19, 2006
"Duplicate Detection in Very Large Search and Enumeration"
Dan Kunkle
Continued from two weeks ago.

Oct. 5, 2006
"Duplicate Detection in Very Large Search and Enumeration"
Continued from previous sesssion.

Sept. 28, 2006
"Duplicate Detection in Very Large Search and Enumeration"
Dan Kunkle
We examine the challenges associated with the enumeration of very large implicit search spaces. An implicit search space is one which is defined by a start state and a number of generating functions, which produce a set of neighboring states. A fundamental problem for all such search spaces is the enumeration of a spanning tree defining a minimum length path from some identity element to all other elements of the search space.
The fundamental challenge in such enumerations is duplicate detection. That is, determining when any particular state has been previously generated. This is especially challenging when the search space does not fit in memory, requiring the use of disk. We examine a number of different duplicate detection strategies, most using some form of hashing and/or sorting. We conclude with current work towards a general framework for duplicate detection strategies in general.

Sept. 21, 2006
"Very Large Disk-Based Search"
Eric Robinson
For large distributed search, duplicate detection through hashes will not always fit in main memory. We discuss these problems and the issues that arise in dealing with them effectively. We show that a solution that emphasizes the use of additional CPU power is not always as effective as one that is able to take advantage of the distributed disks of the cluster. We present a large distributed search problem where using distributed disk leads to a faster solution than any that could be obtained strictly through using distributed memory. In addition, we present other disk-based algorithms and compare them, determining under what conditions each is optimal.

Sept. 14, 2006
"Introductory Session"
Gene Cooperman
This will be a brief discussion os opportunities for research in high performance computing and in performance-related applications. High performance computing includes both parallel computing, and architectural issues: balancing the resources of CPU/RAM/disk/network. CPU, RAM, disk and network can potentially all operate simultaneously, Algorithms play a part both in reducing the difficulty of applications, and in adapting those algorithms to take advantage of parallel computing using a balance of architectural resources.

## Talks from Previous Semester (Spring, 2006)

Mar. 30, 2006
"How Does one Solve Rubik's Cube Optimally?"
Dan Kunkle
A classic search problem is finding shortest solutions to Rubik's cube. Researchers have looked at this problem with a single processor and external memory (disk-based algorithms), and with parallel processors. We ask how to use all of the weapons available (parallel processors and external memory) to do better than previously. This talk will describe some building blocks of such an algorithm. In the remainder of the talk, methods to putg together the building blocks will be examined as a group activity, with audience participation.

Mar. 15, 2006 (Wed.)
"Symmetry, Search, and Research Problems in Applications to Sudoku and Latin Sqares"
Ravi Sundaram, Dan Kunkle, and Gene Cooperman
For this week, only, we hold the systems seminar on Wednesday. We will employ the often-used triple speaker format --- choosing the variation where lectures are given in sequence rather than in parallel. Gene will present the formalisms of how to use computational group theory to efficiently support symmetries for pruning large search problems. Ravi Sundaram will present results on Sudoku and Latin Squares from the Theory Community, while Dan Kunkle will present results and open problems from outside the Theory Community.

Mar. 9, 2006
"The Issues of Choosing a Good Thesis Topic"
Dan Kunkle
I first considered applying a novel methodology in top-down style, in order to demonstrate on a concrete problem (numerical analysis). I am now considering the opposite, bottom-up style: starting with a concrete "challenge problem" in search and applying novel out-of-core techniques to solve it. He would then use the eventual solution to enunciate a general methodology.

Mar. 2, 2006
"Task Oriented Parallel ILU(k) Factorization and Its Optimization"
Ye Wang
This is a continuation of the previous week's talk.

Feb. 23, 2006
"Task Oriented Parallel ILU(k) Factorization and Its Optimization"
Ye Wang
I will briefly introduce the classic(sequencial) ILU factorization algorithm and the task oriented parallel ILU(k) algorithm. According to the features of the algorithm and the spase matrices, I will optimize the TOP-ILU algorithm.

Dec. 7, 2005
"Checkpointing Master-Worker Parallel Computations"
Jason Ansel
I will explore a system-level transparent method of checkpointing in which a snapshot is taken only of one node. In addition I will talk about utilizing system core dumps as a way to checkpoint a process and some of the issues encountered on the way.

Nov. 30, 2005
"Developing Task-Oriented Parallelism for Linear Algebra" (continued)
Ye Wang
This is a continuation of the previous talk in this seminar.

Nov. 16, 2005
"Developing Task-Oriented Parallelism for Linear Algebra"
Ye Wang
This talk will focus on solving important linear algebra problems, such as ILU preconditioner, using a new tool TOP-C. I will introduce the "Classic" algorithm for ILU preconditioner as well as my new task-oriented algorithm, and make comparison of the two algorithms.

Nov. 9, 2005
"Computational Algorithms for Finding Short Solutions for Rukik's Cube"
Gene Cooperman
This lecture will be a "fun" lecture --- no prior experience necessary. Rubik's cube has long held a fascination for computer scientists ranging, for example, from Korf (in A.I.) to Shamir (in Theory). Finding good solutions (or the best solution) requires enormous computational resources in both space and time. This makes it a wonderful testbed for systems: stressing all of network, CPU, disk and bandwidth of RAM.

We will review a variety of algorithms, including: Sims's algorithm for finding some solution; Korf's modification for finding a reasonable solutions (within a factor of two); Shamir's algorithm for finding the best solution (computationally feasible for solutions up to about 18 or 19 moves); my own enumeration of a subset of states of Rubik's cube using only two bits (0.25 bytes) per state in a Cayley graph from which shortest moves in the subset can be recovered; algebraic bounds on the maximum solution length over all states of Rubik's cube; and the "shape" of any breadth-first enumeration of states, in terms of the size of each successive level.

Nov. 2, 2005
Automatic Algorithm Selection
Dan Kunkle
I will describe my proposed automatic algorithm selection system: a system for determining the right algorithm and its parameters given a specific problem to solve. This method is particularly useful when here are many possible algorithm variants for a single problem, the appropriate algorithm is dependent on the actual problem instance being solved, there are no simple rules mapping problem instance to algorithm variant, and the algorithms are dependent on implementation and computing architecture. The system involves methods from multiobjective optimization, case based reasoning, and supervised learning. I will highlight the process of constructing the automatic algorithm selection system with an illustrative example: high performance solving of linear systems of equations.

Oct. 26, 2005
"Distributed Framework for Computation of Very Large Trees"
Eric Robinson
This talk focuses on the computation of trees too large to be computed on a single machine. For trees such as this, it is useful to have an abstract framework to change a local tree computation into a distributed one.

We provide such a framework and show how it can be used to compute over two different trees. One used in the computation of matrix groups, and the other used to compute chess endgame trees.

## Talks from Previous Semester (Spring, 2005)

Feb. 3, 2005
"Ischemic Stroke and a new approach in Machine Learning"
Viet Ha Nguyen

For the introductory lecture this semester, Viet Ha Nguyen will talk on his work in learning and ischemic stroke. He will informally describe how his application-oriented thesis led to a new approach in machine learning.

Ha has worked closely with a medical research group at MGH (Mass. General Hospital) for six months. At the end of six months, he had solved the problem that the "doctors" had assigned him: improve the accuracy of a model predicting which parts of the brain were at risk of dying after ischemic stroke.

However, this striking success in a particular medical application still needed to be translated into a general computer science technique, suitable for a thesis. Ha will describe both his original result, and how he generalized that result, so as to apply to more than a single application.

Feb. 18, 2005 (note, Friday)
"Fast Query Processing Using Cooperative CPU Caching for Index Structures"
Xiaoqin Ma

Data Intensive applications on clusters often require that requests quickly be sent to the node managing the desired data. In many applications, one must look through a sorted tree structure to determine the responsible node for accessing or storing the data. Examples include sensor networks, network routing tables, routing in publish-subscribe middleware, and database indexes. When the tree structure is larger than the CPU cache, the standard implementation potentially incurs a cache miss for a lookup at each successive level of the tree. This significantly degrades the performance because a single cache miss of one word will force an entire cache line to be loaded (e.g. 128 bytes in the case of the Pentium 4). As the CPU-RAM gap grows, this performance degradation will only become worse in the future.

Mar. 10, 2005
"The Conquering of the Baby Monster: A Problem of Resource Balancing"
Eric Robinson

The Baby Monster is a permutation group in a well-known ladder of challenge problems, the sporadic simple groups. Solving for this group is difficult not only in terms of the time for the computation, but also space. We provide a distributed disk-based solution and show that this solution is only marginally slower than an optimal solution given infinite disk and memory resources. In addition, we generalize this solution and show its application in the area of large tree discovery.

Mar. 31, 2005
"Some Research Problems in Systems"
Gene Cooperman

I will discuss some open research problems in systems. This will be an interactive talk. We will learn to propose research problems "by doing it". Some of the topics below are meant simply to be seeds for thought.

They include semi-automatic parallelization (one of my continuing favorites). The TOP-C model should make it exceptionally easy to parallelize for/while loops. There are standard programs for transforming fragments of C, such as CIL. What other constructs could be parallelized?

The MBRAM model can be used as the basis for choosing among compiler optimizations. How can one do that effectively? This could be a very hot area.

What about extending MBRAM to SMP or to distributed memory? For a dual-core chip, what can you say?

The clusters of tomorrow will probably be multiple CPUs at each node, just because of the trend toward dual and quad-core CPU chips. How does one reorganize a parallel computation (TOP-C or other) to take advantage of this?

Can the search methods of computational algebra be repackaged for other kinds of search. In the case of Eric, chess end games had been suggested.

On the area of neural nets, what will be an efficient memory access pattern if you have millions of neural net nodes at each layer?

The current rules of efficient memory access dictate different kinds of sorting algorithms. Can one produce a faster integer sorting algorithm than quicksort or mergesort? (Note that one can test one's ideas without coding by using the MBRAM model.)

Groebner bases are an interesting construct in computational algebra, which, in some sense, "search" for solutions to systems of polynomial equations. Can that search be executed in a more cache-efficient manner?

Cayley graphs are a kind of deterministic automaton. There are interesting types of graphs that are most easily constructed using Cayley graphs. Can one search for a good graph in such a space of graphs?

Apr. 7, 2005
"Towards a Mathematical Justification for Spatial Preconditioning in Spatially Distributed Feature Learning"
Gene Cooperman
This Thursday, I will talk on the current work of Viet Ha Nguyen (since he's in Japan). Ha had a nice applied solution to a problem in brain scans. By generalizing, he developed a new heuristic: spatial preconditioning. The idea is:
1. learn the spatial influence first
2. Then do feature learning locally, using the modified features based on the spatial influence of nearest neighbors.
The idea has been demonstrated to work nicely in experiments. We now need to look for a mathematical justification. The talk will discuss the first attempts in that direction. - Gene