Web Resources for CS7680 (Special Topics in Systems)

Instructor: Gene Cooperman
Spring, 2020

CS 7680 (Spring, 2020): Formalisms for Large Computer Systems

This course will be an inter-disciplinary study of formalisms for large computer systems. Of course, the best way to study a large system is to first study a medium-size system as an example. So, we will use the DMTCP project as a research vehicle. While DMTCP is only about 30,000 lines of code, it is an ideal study vehicle because it has similar goals to an operating system, a compiler, a large-scale formal verification platform, and other examples.
The goal of this course is an inter-disciplinary study bringing together multiple disciplines. Students from non-systems disciplines are also welcome. A basic familiarity with Linux system calls is required. All other necessary background will be provided in self-contained lectures. The knowledge and experience of students from other disciplines will be especially valuabe (see below, for some of our current inter-disciplinary collaborations). I am aware that many of the students will be advanced PhD students, and I will be respectful of your time commitments, and look for ways to integrate a course project with your ongoing research.
THE BASIC COURSE PROBLEM: DMTCP seeks to support all correct programs. The other examples have ambitions to: check the correctness of all programs; translate correct programs to assembly language; and execute semantically correct programs as processes. DMTCP stands for Distributed Multi-Threaded Checkpoint-Restart. It attempts to transparently save the state of a running process to a file (checkpoint), and later resume that process from the file on disk (restart).
DMTCP has a successful history of 15 years with widespread usage and support from the NSF, Intel Corp., Mentor Graphics, Raytheon, and others. While DMTCP outwardly appears successful, cracks in its design appear as DMTCP is thrown at at arbitrary use cases (third-party proprietary libraries, the newest system calls, new types of networks, privileged system services, poorly defined corner cases in the standards, new and looser memory consistency models, etc.
In its defense, the DMTCP team has developed new systems architectures to partially cope with this complexity. A very successful recent example is split processes, a new mechanism to isolate the application from the external world, and communicate to the rest of the world through a closely monitored API. two independent programs are loaded into a single address space. A lower-half program is tied to the hardware and system services that is never checkpointed. An upper-half application program is isolated from the outsdie world, except through a well-defined API (e.g., CUDA for GPUs, MPI or network interface) to the lower half. Only the application program is checkpointed (saved to a file): EASY! At restart time, a FRESH lower-half program is loaded in, and the upper-half program uses the API to restore the lower-half to a consistent state semantically equivalent to what was seen at checkpoint time.
This might suffice for a larger development team (e.g., for Linux or for GNU gcc). As a smaller team, we are reaching back to the roots of Linux:the original UNIx. We are planning a series of innovations with external collaborators who are eager to try out their tools. The collaborations include:

Inria (Prof. Martin Quinson, ENS-Rennes): SimGrid is a model-checking tool for simulation and formal verification. Unlike other systems of which I am aware, SimGrid uses the memory state of the process or processes as the model-checking state. Hence, instead of translating code to a formal model and checking it, we can directly check the code itself. As an example, come to me for a demo, and I'll show you a program that takes simple multi-threaded programs used to teach undergraduates, and it directly finds the race conditions, and displays a schedule leading to the race condition. SimGrid is also used with MPI. Ignacio Laguna at Lawrence Livermore National Laboratory is also keenly interested.
U. of Utah (Prof. Ganesh Gopalkrishnan): Prof. Gopalkrishnan's team has an initial tool for comparing two execution streams to decide whether they are semantically equivalent. This is critical for DMTCP, since we always want to compare execution of a program both natively and under DMTCP. This is joint with Prof. Martin Burtscher (Texas State U.).
U. of Tennesse, Chattanooga (Prof. Anthony Skjellum): Prof. Skjellum's team has developed a small implementation of MPI from first principles, based on formal specifications for the MPI calls. They have already seen a few successes in benchmarks that are competitive with the million-line production versions of MPI. (As always, if you want to understand a large system, then study a small system.) Using the MPI formal spec, a collaborator Martin Ruefenacht (student of Ajitha Rajan at U. of Edinburgh, probably taking job at to Los Alamso National Lab in Fall) can use a tool, LemonSpotter, to auto-generate a test suite from the spec. DMTCP needs this for the libc system calls. Second, the small MPI of Prof. Skjellum allows us to cleanly study scaling to exascale (largest supercomputers in the world). There is a joke that in High Performance Computing (HPC), we don't debug. (It's too expensive to reserve 250 computer nodes for one person to interactively debug at scale with GDB.) Maybe the new tools above, such as SmiGrid/DMTCP for MPI, can help us break that barrier and begin to debug.
Paul Grosu (Khoury, N.U.): Paul is a research scientist in my group. Just as Prof. Skjellum has developed a small MPI from first principles, Paul is developing a new type of DMTCP from first principles. It is based on an interpreted language that displays the high-level steps of checkpoint-restart, and allows users to interactively replace some steps by others. This more flexible platform will also be used to better integrate with formal verification techniques.

Finally, if you are curious about an earlier iteration of this course, you'll find it at: topics-sp17. However, in this iteration, the course will be more inter-disciplinary, and will have a richer offering of possible collaborations, for those students who don't already have their own research project in mind for the course.

I will determine office hours closer to the start of the course. I am also generous about offering other times to meet, and I encourage you to come to my office (no prior appointment) and interrupt me and ask when we can meet. Ideally, I will arrange that within half an hour if possible.

As befitting a PhD topics course, if a student actively participates in all areas (readings, presentation, class discussion, project, written project report), then the grade will be an A.

See below for:

Topics
Course Structure
Very Short Essay on Course Philosophy
The Importance of Writing (and oral presentation)
Course Projects
Example: DMTCP-style checkpointing: interposition on system calls

Topics:

(See the list of collaborations above for some idea of the topics, and more topics will be added based on the interests of the students. You may also look at the previous iteration of the course from three years ago for some older topics.)

Course Structure:

I am flexible on the course structure. But I want it to include an emphasis on these three elements:

Paper readings, including oral presentations in class
A medium-sized, but open-ended, exploratory Software Project (The project can have software targets; or it can be based on paper-and-pencil design with reference to existing software building blocks. The practical experience of designing systems is important for gaining insights, and for critically analyzing academic research papers.)
Frequent Updates to your own technical writing, which may cover a survey of some aspect of the paper readings, or a report on discoveries from your exploration through an open-ended software project, or both. Written technical communication is a key skill. There are rules (even recipes) for good technical writing, and I intend to teach those rules and recipes. (For example, there are global recipes, such as reading just the first sentence of each paragraph for content; and there are local recipes such as the "Structure of Prose" in "The Science of Scientific Writing" from American Scientist.) Good technical writing is something that you can take with you for a lifetime.

Very Short Essay on Course Philosophy:

I believe that all three platforms (the Cloud, HPC cluster, and the Data Center) are fast evolving and converging into a new future concept that none of us might fully recognize today. Evidence of this is interactive use of Slurm in a batch-oriented HPC cluster, "bare-metal" Clouds and other completely new entities, GPU-based clusters (deep learning, unified address space between GPU and CPU/host), clusters for big data (Hadoop and Spark, stream processing). Much of this vision will be radically transformed by the newer memory technologies, including SSDs and the 3D XPoint (aka Optane) of Intel and Micron, as well as tight integration with GPU-like "AI" chips (e.g., TPU: Tensor Processing Unit chips). How do we integrate all this information and prepare for this future convergence of today's paradigms?

Let's develop the course content together as a community project among interested students.

The Importance of Writing:

The motivation for paper readings is obvious. The motivation for medium-sized exploratory software is to "keep us honest" (see projects, below). It's easy to read papers and spin up castles in the air. But how do we know that those castles in the air are practical? What are the real challenges when we try to build our castle in the air?

And finally, the motivation for an emphasis on technical writing with frequent updates is that technical writing is a critical skill that is typically given low priority due to the pressure of other course concerns. But especially in today's environment of highly competitive conference, and the need to exploit the web to propagate your technical visions, it is critical to write well. There are rules to learn good technical writing, just as there are rules to learn good principles of writing software. It will be my responsibility to serve as the "software compiler", and give you frequent feedback on issues with your writing. Probably, I will use Overleaf to easily share a single source file with the commonly used C.S. standard: LaTeX. But I am open to other suggestions. My goal is to be able to comment on and modify your writing at any time of day or night.

One last comment on writing: Good technical writing has nothing to do with whether your native language is English or something else. The content of good technical writing will shine through a translation into any language at all. It is the skill of writing good technical content with a clear exposition that I am looking for.

Course projects:

I would like to see each student choose a course project on the general theme of virtualization. Ideally, if the student is already working on a thesis or other long-term project, then some aspect of that involving virtualization can be used as the course project. The project can even be a thesis proposal or a chapter in a thesis.

I am not concerned with the particular functionality of the project. The project may be based on actual software, or on a software paper design.

The project will be used as a vehicle for learning better technical communication skills. We will use Overleaf as a way to share a document between the professor and the student. In this way, I can asynchronously point out places where full communication could be improved. From time to time, I will also ask you to spontaneously communicate orally about the ideas of your projects. The goal is fluency in communication --- not polished presentation. Polished presentation will be emphasized closer to the end of the course.

Further, the emphasis of this software (or paper design) project will not be software engineering. I don't care if any of your software works at the end, or if any of your paper design has been implemented at the end. Instead, I want you to poke at large, complex software from the outside, in order to get insights into it. There are two kinds of "proof" in this world: mathematical proof and scientific proof. Mathematical proof is about formal proofs. It relates to formal verification and to semantics of programming language. It works best for programming "in the small". Scientific proof is evidence-based. It works best for programming "in the large". What are some scientific experiments that you can perform on this large software in order to gain evidence for its expected behavior? Virtualization is about interposing on large, complex software. How does one poke at it, without having to spend huge amounts of time reverse-engineering the internals of that software?

If you don't already have a project, or if you are interested in doing something different, here are some possible ideas. This is a short list that will grow over time. It is particularly DMTCP-centric because my group's research emphasizes that project. Students are welcome to substitute their favorite complex software (Hadoop?, Spark?, MPI?, a distributed system?) for DMTCP, and ask similar questions.

I will teach about DMTCP and a new theory of split processes in a self-contained manner in the course. But if you would like to get the flavor of it now, you could look at slides from a recent talk at Microsoft Research Labos, or a recent talk from the MVAPICH (MPI) User's Group.

Transparent checkpointing for Microsoft Windows. Transparent checkpointing still exists only for Linux. This is because Windows splits its process table into a kernel portion and a user-space portion (part of the user's address space). While user-space process information adds efficiency, it makes any scheme for saving and restoring the state of all of user-space memory very problematic. But applying the newer idea of split process provides some hope that it's finally possible for Windows.
DMTCP uses the dlsym system call to interpose on functions to create wrapper functions. This is one of the key requirements in the development of DMTCP. It is also and excellent way to do virtualization through interposition. However, Docker and the Go language use statically linked binaries. For statically linked binaries, dlsym has no effect. (After all, "dl" stands for "dynamically linked".) What are the alternatives, in order to support statically linked binaries? If you choose this, I will provide you with a tiny package that parses ELF libraries and then directly does interposition on ELF symbols. How easy is it to port DMTCP to use this new package instead of dlsym? In general, to what extent does static linking imply that we lose the ability to interpose (and hence virtualize), and to what extent can we get around this with ELF symbols, ELF relocation, trampolines, etc? (See these slides for an overview of DMTCP.)
In supercomputing, the InfiniBand network fabric has been the standard for over a decade. It is based on the concept of RDMA. There are now newer network fabrics that the world is moving toward. For example, Intel Omni-Path stands a good chance of becoming the next standard. There are also attempts to unify these standards (e.g., OpenFabric). Every time that a new network is chosen, most of the HPC software stack must be ported toward that new stack. Further, DMTCP must then create a new plugin for the new network. What is the potential for creating an abstraction for an RDMA network sufficient for interposition (and hence for virtualization). Can split processes help?
Other projects motivated by your own research?

DMTCP-style checkpointing:

DMTCP-style checkpointing (interposition on system calls)