Readings for CS7680 (Special Topics in Systems)

These suggested readings cover a range including machine-level, operating system-level, container-based, library-OS-based, and language-level virtualization. The boundaries between these different styles of virtualization are not always well-defined. I've tried to choose a selection that includes a balance of the most common research methodologies, so that the next time you read a random research paper concerning virtualization, you will already find their basic research methodology to be familiar. This makes reading new research papers much faster than otherwise.

Early history of virtualization (Condor)
Virtualization in order to create new Operating System personalities
Virtualization at the hardware and operating system level
Virtualization in the Datacenter and in the Cloud
Process-level virtualization
Language-level virtualization

(and at the end of this page are additional background papers not related to virtualization)

Early history of virtualization (Condor):
Historical note: MOSIX arrived not long after Condor, with some goals that overlapped those of Condor.

Condor - a hunter of idle workstations. Michael J. Litzkow, Miron Livny, and Matthew Mutka. In Proceedings of the 8th International Conference of Distributed Computing Systems, June 1988 (pdf)
(Note: one of the earliest examples of virtualization at the process level. Condor's use of stub functions and other restrictions differs from more general checkpointing approaches that came later. See, for example, Arya et al.)

Abstract: The design, implementation, and performance of the Condor scheduling system, which operates in a workstation environment, are presented. The system aims to maximize the utilization of workstations with as little interference as possible between the jobs it schedules and the activities of the people who own workstations. It identifies idle workstations and schedules background jobs on them. When the owner of a workstation resumes activity at a station, Condor checkpoints the remote job running on the station and transfers it to another workstation. The system guarantees that the job will eventually complete, and that very little, if any, work will be performed more than once. A performance profile of the system is presented that is based on data accumulated from 23 stations during one month.

Checkpoint and migration of UNIX processes in the Condor distributed processing system. M Litzkow, T Tannenbaum, J Basney, M Livny - 1997 - minds.wisconsin.edu (pdf)

Condor is a distributed batch processing system for UNIX developed at the University of Wisconsin. This system schedules jobs on idle workstations in a network, resulting in more efficient resource utilization. It is of primary importance in Condor to ensure that the owner of a workstation does not pay a penalty for adding his or her workstation to the Condor pool of workstations. So, a job must have the ability to immediately vacate a workstation when the owner begins to use it, and either migrate to another idle workstation or queue until one becomes idle.

To allow migrating jobs to make progress, Condor must be able to start the vacated job from where it left off. Condor does this by writing a checkpoint of the process's state before vacating. A checkpoint file contains the process's data and stack segments, as well as the information about open files, pending signals, and CPU state. Condor gives a program the ability to chekcpoint itself by providing a checkpointing library. Programs submitted to be run by the Condor system are re-linked (but not re-compiled) to include this library.

Process hijacking. Zandy, V.C., Miller, B.P. and Livny, M., 1999. In High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on (pp. 177-184). IEEE. (pdf) (Note: This is the paper for the early history of interposition: one of the definining attributes of virtualization.)

Process checkpointing is a basic mechanism required for providing high throughput computing service on distributively owned resources. We present a new process checkpoint and migration technique, called process hijacking, that uses dynamic program re-writing techniques to add checkpointing capability to a running program. Process hijacking makes it possible to checkpoint and migrate proprietary applications that cannot be re-linked with a checkpoint library, and it makes it possible to dynamically hand off an ordinary running process to a distributed resource management system such as Condor. We discuss the problems of adding checkpointing capability to a program already in execution: loading new code into the running process; and replacing functions of the process with calls to dynamically loaded functions. We use the DynInst API process editing library, augmented with a new call for replacing functions, to solve these problems.

Multiple bypass: Interposition agents for distributed computing. Thain, D., and Livny, M. (2001). Cluster Computing, 4(1), 39-47. (pdf)

Interposition agents are a well-known device for attaching legacy applications to distributed systems. However, agents are difficult to build and are often large, monolithic pieces of software which are suited only to limited applications or systems. We solve this problem with Bypass, a language and a tool for quickly building multiple small agents that can be combined together to create powerful yet manageable software.

Virtualization in order to create new Operating System personalities:

Rethinking the library OS from the top down. Porter, D. E., Boyd-Wickizer, S., Howell, J., Olinsky, R., and Hunt, G. C. (2011, March). In ACM SIGPLAN Notices (Vol. 46, No. 3, pp. 291-304), (ASPLOS'11). ACM. (pdf)
(Note its influence on "bash on Windows/Windows Subsystem for Linux" and on "Microsoft Azure" for the Cloud.)

This paper revisits an old approach to operating system constru construction, the library OS, in a new context. The idea of the library OS is that the personality of the OS on which an application depends runs in the address space of the application. A small, fixed set of abstractions connects the library OS to the host OS kernel, offering the promise of better system security and more rapid independent evolution of OS components.

We describe a working prototype of a Windows 7 library OS that runs the latest releases of major applications such as Microsoft Excel, PowerPoint, and Internet Explorer. We demonstrate that desktop sharing across independent, securely isolated, library OS instances can be achieved through the pragmatic re-use of networking protocols. Each instance has significantly lower overhead than a full VM bundled with an application: a typical application adds just 16MB of working set and 64MB of disk footprint. We contribute a new ABI below the library OS that enables application mobility. We also show that our library OS can address many of the current uses of hardware virtual machines at a fraction of the overheads. This paper describes the first working prototype of a full commercial OS redesigned as a library OS capable of running significant applications. Our experience shows that the long-promised benefits of the library OS approach -- better protection of system integrity and rapid system evolution -- are readily obtainable.

Exokernel: An operating system architecture for application-level resource management Engler, D. R., and Kaashoek, M. F. (1995). (Vol. 29, No. 5, pp. 251-266). (SOSP'95), ACM. (pdf)

Abstract Traditional operating systems limit the performance, flexibility, and functionality of applications by fixing the interface and implementation of operating system abstractions such as interprocess communication and virtual memory. The exokernel operating system architecture addresses this problem by providing application-level management of physical resources. In the exokernel architecture, a small kernel securely exports all hardware resources through a low-level interface to untrusted library operating systems. Library operating systems use this interface to implement system objects and policies. This separation of resource protection from management allows application-specific customization of traditional operating system abstractions by extending, specializing, or even replacing libraries.

We have implemented a prototype exokernel operating system. Measurements show that most primitive kernel operations (such as exception handling and protected control transfer) are ten to 100 times faster than in Ultrix, a mature monolithic UNIX operating system. In addition, we demonstrate that an exokernel allo ws applica- tions to control machine resources in ways not possible in traditional operating systems. For instance, virtual memory and interprocess communication abstractions are implemented entirely within an application-le vel library . Measurements show that application-level virtual memory and interprocess communication primitives are five to 40 times faster than Ultrix's kernel primitives. Compared to state-of-the-art implementations from the literature, the prototype exokernel system is at least five times faster on operations such as exception dispatching and interprocess communication.

Mach: A new kernel foundation for UNIX development. Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., Tevanian, A., and Young, M., USENIX'86 Summer Conference, (1986). (pdf) (Also, appendix to Silberschatz et al.: The Mach System (pdf)) (Note its influence on the Mac OSX kernel, a combination of Mach and BSD. It was also a UNIX look-alike created before Linux.)

ABSTRACT: Mach provides a new foundation for UNIX development that spans networks of uniprocessors and multiprocessors. Mach is a multiprocessor operating system kernel. The basic Mach abstractions are intended not simply as extensions to the normal UNIX facilities but as a new foundation upon which UNIX facilities can be built and future development of UNIX-like systems for new architectures can continue. The difference between Mach and UNIX is that Mach is not a trademark of AT&T Laboratories whereas UNIX is a trademark of AT&T Laboratories. This paper describe s Mach and the motivations that led to its design. It also describes some of the details of its implementation and current status.

Unix as an Application Program. David Golub, Randall Dean, Alessandro Forin, Richard Rashid, USENIX'90 Summer Conference, (pdf) (Note that this was one of the original motivations for the concept of a micro-kernel.)

Since March of 1989 we have had running at CMU a computing environment in which the functions of a traditional Unix system are cleanly divided into two parts: facilities which manage the hardware resources of a computer system (such as CPU, I/O and memory) and support for higher-level resource abstractions used in the building of application programs, e.g. files and sockets. This paper describes the implementation of Unix as a multithreaded application program running on the Mach kernel. The rationale, design, implementation history and performance of the system is presented.

Virtualization at the hardware and operating system level:

Memory resource management in VMware ESX server. Carl A. Waldspurger. SIGOPS Oper. Syst. Rev., 36(SI):181-194, 2002. (pdf)
(Note: a classic paper from the early days of VMware.)

Abstract: VMware ESX Server is a thin software layer designed to multiplex hardware resources efficiently among virtual machines running unmodified commodity operating systems. This paper introduces several novel ESX Server mechanisms and policies for managing memory. A ballooning technique reclaims the pages considered least valuable by the operating system running in a virtual machine. An idle memory tax achieves efficient . memory utilization while maintaining performance isolation guarantees. Content-based page sharing and hot I/O page remapping exploit transparent page remapping to eliminate redundancy and reduce copying overheads. These techniques are combined to efficiently support virtual machine workloads that overcommit memory.

Overshadow: a virtualization-based approach to retrofitting protection in commodity operating systems, Lewis, E. Christopher, Subrahmanyam, Pratap, Waldspurger, Carl A., Boneh, Dan, Dwoskin, Jeffrey and Ports, Dan RK. (2008, March). In ACM SIGARCH Computer Architecture News (Vol. 36, No. 1, pp. 2-13). (ASPLOS'08), ACM. (pdf) (Note: virtualize the memory pages, so that the O/S sees an encrypted view and the application sees a cleartext view.)

Commodity operating systems entrusted with securing sensitive data are remarkably large and complex, and consequently, frequently prone to compromise. To address this limitation, we introduce a virtual-machine-based system called Overshadow that protects the privacy and integrity of application data, even in the event of a total OS compromise. Overshadow presents an application with a normal view of its resources, but the OS with an encrypted view. This allows the operating system to carry out the complex task of managing an application's resources, without allowing it to read or modify them. Thus, Overshadow offers a last line of defense for application data.

Overshadow builds on multi-shadowing, a novel mechanism that presents different views of "physical" memory, depending on the context performing the access. This primitive offers an additional dimension of protection beyond the hierarchical protection domains implemented by traditional operating systems and processor architectures.

We present the design and implementation of Overshadow and show how its new protection semantics can be integrated with existing systems. Our design has been fully implemented and used to protect a wide range of unmodified legacy applications running on an unmodified Linux operating system. We evaluate the performance of our implementation, demonstrating that this approach is practical.

Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools. Garfinkel, T. (2003, February). In NDSS (Vol. 3, pp. 163-176). (pdf)

System call interposition is a powerful method for regulating and monitoring application behavior. In recent years, a wide variety of security tools have been developed that use this technique. This approach brings with it a host of pitfalls for the unwary implementer that if overlooked can allow his tool to be easily circumvented. To shed light on these problems, we present the lessons we learned in the course of several design and implementation cycles with our own system call interposition-based sandboxing tool. We first present some of the problems and pitfalls we encountered, including incorrectly replicating OS seman- tics, overlooking indirect paths to resources, race condi- tions, incorrectly subsetting a complex interface, and side effects of denying system calls. We then present some practical solutions to these problems, and provide general principles for avoiding the difficulties we encountered.

Recovering Device Drivers Swift, M. M., Annamalai, M., Bershad, B. N., and Levy, H. M. (2006). ACM Transactions on Computer Systems (TOCS), 24(4), 333-360. (pdf) (Note: A shadow device driver is in charge of recovering when the original device driver fails.)

This article presents a new mechanism that enables applications to run correctly when device drivers fail. Because device drivers are the principal failing component in most systems, reducing driver-induced failures greatly improves overall reliability. Earlier work has shown that an operating system can survive driver failures [Swift et al. 2005], but the applications that depend on them cannot. Thus, while operating system reliability was greatly improved, application reliability generally was not.To remedy this situation, we introduce a new operating system mechanism called a shadow driver. A shadow driver monitors device drivers and transparently recovers from driver failures. Moreover, it assumes the role of the failed driver during recovery. In this way, applications using the failed driver, as well as the kernel itself, continue to function as expected.We implemented shadow drivers for the Linux operating system and tested them on over a dozen device drivers. Our results show that applications and the OS can indeed survive the failure of a variety of device drivers. Moreover, shadow drivers impose minimal performance overhead. Lastly, they can be introduced with only modest changes to the OS kernel and with no changes at all to existing device drivers.

A comparison of software and hardware techniques for x86 virtualization. Adams, K., and Agesen, O. (2006). ACM SIGOPS Operating Systems Review, 40(5), 2-13. (pdf) (Note: Even after Intel delivered hardware virtualization of page tables for virtual memory (for the MMU), the software-based virtualization continued to perform better for some special cases. Here's why.)

Until recently, the x86 architecture has not permitted classical trap-and-emulate virtualization. Virtual Machine Monitors for x86, such as VMware ® Workstation and Virtual PC, have instead used binary translation of the guest kernel code. However, both Intel and AMD have now introduced architectural extensions to support classical virtualization.We compare an existing software VMM with a new VMM designed for the emerging hardware support. Surprisingly, the hardware VMM often suffers lower performance than the pure software VMM. To determine why, we study architecture-level events such as page table updates, context switches and I/O, and find their costs vastly different among native, software VMM and hardware VMM execution.We find that the hardware support fails to provide an unambiguous performance advantage for two primary reasons: first, it offers no support for MMU virtualization; second, it fails to co-exist with existing software techniques for MMU virtualization. We look ahead to emerging techniques for addressing this MMU virtualization problem in the context of hardware-assisted virtualization.

Xen and the art of virtualization. Barham, Paul, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield, (2003, October). In ACM SIGOPS operating systems review (Vol. 37, No. 5, pp. 164-177). ACM. (pdf)

Numerous systems have been designed which use virtualization to subdivide the ample resources of a modern computer. Some require specialized hardware, or cannot support commodity operating systems. Some target 100% binary compatibility at the expense of performance. Others sacrifice security or functionality for speed. Few offer resource isolation or performance guarantees; most provide only best-effort provisioning, risking denial of service.This paper presents Xen, an x86 virtual machine monitor which allows multiple commodity operating systems to share conventional hardware in a safe and resource managed fashion, but without sacrificing either performance or functionality. This is achieved by providing an idealized virtual machine abstraction to which operating systems such as Linux, BSD and Windows XP, can be ported with minimal effort.Our design is targeted at hosting up to 100 virtual machine instances simultaneously on a modern server. The virtualization approach taken by Xen is extremely efficient: we allow operating systems such as Linux and Windows XP to be hosted simultaneously for a negligible performance overhead --- at most a few percent compared with the unvirtualized case. We considerably outperform competing commercial and freely available solutions in a range of microbenchmarks and system-wide tests.

SPIDER: Stealthy Binary Program Instrumentation and Debugging via Hardware Virtualization and Zhang, Xiangyu and Xu, Dongyan Proc. of 29th Annual Computer Security Applications Conference (ACSAC'13), 2013, ACM, pp. 289-298 (pdf)

The ability to trap the execution of a binary program at desired instructions is essential in many security scenarios such as malware analysis and attack provenance. However, an increasing percent of both malicious and legitimate programs are equipped with anti-debugging and anti-instrumentation techniques, which render existing debuggers and instrumentation tools inadequate. In this paper, we present Spider, a stealthy program instrumentation framework which enables transparent, efficient and flexible instruction-level trapping based on hardware virtualization. Spider uses invisible breakpoint, a novel primitive we develop that inherits the efficiency and flexibility of software breakpoint, and utilizes hardware virtualization to hide its side-effects from the guest. We have implemented a prototype of Spider on KVM. Our evaluation shows that Spider succeeds in remaining transparent against state-of-the-art anti-debugging and anti-instrumentation techniques; the overhead of invisible breakpoint is comparable with traditional hardware breakpoint. We also demonstrate Spider's usage in various security applications.

Virtualization in the Datacenter and in the Cloud:

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. Hindman, Benjamin, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy H. Katz, Scott Shenker, and Ion Stoica. and Stoica, I. (2011, March). In NSDI (Vol. 11, No. 2011, pp. 22-22) (pdf)
(Note that this was one of the earliest orchestration platforms (horizontal integration) for the datacenter.)

Abstract We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI. Sharing improves cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine- grained manner, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. To support the sophisticated schedulers of today's frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. Our results show that Mesos can achieve near-optimal data locality when shar- ing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.

Containers and cloud: From LXC to Docker to Kubernetes. Bernstein, D. (2014). IEEE Cloud Computing, 1(3), 81-84. (pdf) (Another orchestration framework. This one originated at Google and was announced in 2014. Here are some random slides with some use cases.)

This issue's "Cloud Tidbit" focuses on container technology and how it's emerging as an important part of the cloud computing infrastructure. It looks at Docker, an open source project that automates the faster deployment of Linux applications, and Kubernetes, an open source cluster manager for Docker containers.

A Comparison and Critique of Eucalyptus, OpenNebula and Nimbus, Peter Sempolinski and Douglas Thain, IEEE 2nd Int. Conf. on Cloud Computing Technology and Science (CloudCom'10), 2010. (pdf)

Eucalyptus, OpenNebula and Nimbus are three major open-source cloud-computing software platforms. The overall function of these systems is to manage the provisioning of virtual machines for a cloud providing infrastructure-as-a-service. These various open-source projects provide an important alternative for those who do not wish to use a commercially provided cloud. We provide a comparison and analysis of each of these systems. We begin with a short summary com- paring the current raw feature set of these projects. After that, we deepen our analysis by describing how these cloud management frameworks relate to the many other software components required to create a func- tioning cloud computing system. We also analyse the overall structure of each of these projects and address how the differing features and implementations reflect the different goals of each of these projects. Lastly, we discuss some of the common challenges that emerge in setting up any of these frameworks and suggest avenues of further research and development. These include the problem of fair scheduling in absence of money, evic- tion or preemption, the difficulties of network configuration, and the frequent lack of clean abstractions.

OpenStack: Toward an Open-Source Solution for Cloud Computing Sefraoui, O., Aissaoui, M., and Eleuldj, M. (2012). International Journal of Computer Applications, 55(3). (pdf) (Note that this is an easy read, and should be supplemented by other sources. There doesn't yet (as of 2016) exist a classic paper on open source cloud computing.)

Abstract: Cloud management platforms may manage the resources provided by the infrastructure as a service (IaaS) cloud. With the rapid development of open-source cloud platforms, they have been widely used due to open and free, some of them can substitute commercial clouds. Some existed related works only concisely compare the basic features of open-source platforms, and not including some new released features. In this paper, we firstly present the function of OpenStack and OpenNebula briefly, and then compare them from provenance, architecture, hypervisors, security and other angles in detail. Moreover, we provide some deployment recommendations according to different user demands and platform characteristics.

Hcloud: Resource-efficient Provisioning in Shared Cloud Systems, Delimitrou, Christina, and Christos Kozyrakis, ACM SIGOPS Operating Systems Review (ASPLOS'16), Vol. 50. No. 2. ACM, 2016, pp. 473-488 (pdf)

Cloud computing promises flexibility and high performance for users and cost efficiency for operators. To achieve this, cloud providers offer instances of different sizes, both as long-term reservations and short-term, on-demand allocations. Unfortunately, determining the best provisioning strategy is a complex, multi-dimensional problem that depends on the load fluctuation and duration of incoming jobs, and the performance unpredictability and cost of resources. We first compare the two main provisioning strategies (reserved and on-demand resources) on Google Compute Engine (GCE) using three representative workload scenarios with batch and latency-critical applications. We show that either approach is suboptimal for performance or cost. We then present HCloud, a hybrid provisioning system that uses both reserved and on-demand resources. HCloud determines which jobs should be mapped to reserved versus on-demand resources based on overall load, and resource unpredictability. It also determines the optimal instance size an application needs to satisfy its Quality of Service (QoS) constraints. We demonstrate that hybrid configurations improve performance by 2.1x compared to fully on-demand provisioning, and reduce cost by 46% compared to fully reserved systems. We also show that hybrid strategies are robust to variation in system and job parameters, such as cost and system load.

Hey, you, Get Off of my Cloud: Exploring Information Leakage in Third-party Compute Clouds, Ristenpart, Thomas and Tromer, Eran and Shacham, Hovav and Savage, Stefan, Proc. 16th ACM Conf. on Computer and Communications Security (CCS'09), 2009, pp. 199-212 (pdf)

Third-party cloud computing represents the promise of outsourcing as applied to computation. Services, such as Microsoft's Azure and Amazon's EC2, allow users to instantiate virtual machines (VMs) on demand and thus purchase precisely the capacity they require when they require it. In turn, the use of virtualization allows third-party cloud providers to maximize the utilization of their sunk capital costs by multiplexing many customer VMs across a shared physical infrastructure. However, in this paper, we show that this approach can also introduce new vulnerabilities. Using the Amazon EC2 service as a case study, we show that it is possible to map the internal cloud infrastructure, identify where a particular target VM is likely to reside, and then instantiate new VMs until one is placed co-resident with the target. We explore how such placement can then be used to mount cross-VM side-channel attacks to extract information from a target VM on the same machine.

Process-level virtualization:

Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. Hargrove, P. H., and Duell, J. C. (2006). In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing. (pdf)
(Note that this was perhaps the first really successful checkpointing system in its own right. While others attempted checkpointing through kernel modification, this kernel module-based approach was the first one that was widely used. I exclude the Condor-based checkpointing that was used primarily as part of Condor itself. One could argue that this is more operating system-level virtualization, as opposed to process-level virtualization, but it is convenient to keep some of the checkpointing-related papers together.)

This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to ''fault precursors'' (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instance reducing idle cycles (by allowing for shutdown without any queue draining period or reallocation of resources to eliminate idle nodes when better fitting jobs are queued), and reducing the average queued time (by limiting large jobs to running during off-peak hours, without the need to limit the length of such jobs). Each of these potential uses makes BLCR a valuable tool for efficient resource management in Linux clusters.

Design and implementation for checkpointing of distributed resources using process-level virtualization. Arya, Kapil, Rohan Garg, Artem Y. Polyakov, and Gene Cooperman, (2016, September). In Cluster Computing (CLUSTER), 2016 IEEE International Conference on (pp. 402-412). IEEE. (pdf) (Note, this article proposes process-level virtualization, in contrast to machine-level, language-level, container-based, and library-OS-based virtualization. Fair warning: this comes from my own research group, and so I may be biased.)

System-level checkpoint-restart is a critical technology for long-running jobs in high-performance computing. Yet, only two approaches to checkpointing MPI applications continue to survive in wide use today. One approach is to use the kernel module-based BLCR in combination with an MPI checkpoint-restart service particular to the MPI implementation in use. Unfortunately, this lacks support for some important Linux system services such as SysV IPC (e.g., shared memory objects). A second approach has been to use the original 2009 DMTCP implementation (herein referred to as DMTCP-09) for transparent, system-level checkpointing. Unfortunately, DMTCP-09 lacked support for checkpointing many of the necessary features found by MPI in a modern batch environment. These include: ssh, the InfiniBand network, process migration (restarting an MPI application on different cluster nodes), and modified file path prefixes on restart (typically due to a changing current directory, mount points, library paths, etc.). This work presents DMTCP-PV, a new user-space transparent checkpointing system based on the concept of process virtualization. This approach separately models the state of each local or distributed subsystem while decoupling it from the core checkpointing engine. By separating these concerns, a domain expert can extend checkpointing into a new domain without any knowledge of the core checkpointing engine. This allowed DMTCP-PV to address the deficiencies noted above and many others. It is shown that the runtime overhead of DMTCP-PV is generally less than 1%, and the checkpointing time is dominated by the time to write an image file to stable storage.

Distributed Speculative Parallelization using Checkpoint Restart, Devarshi Ghoshal, Sreesudhan R. Ramkumar, and Arun Chauhan, Procedia Computer Science, 4, pp. 422--431, May, 2011 (pdf) (Note: This is still one of my favorite applications of checkpointing. It combines ideas of software transcations, speculation, and checkpointing in a really cute way.)

Abstract: Speculative software parallelism has gained renewed interest recently as a mechanism to leverage multiple cores on emerging architectures. Two major mechanisms have been used to implement speculation-based parallelism in software, software transactional memory and speculative threads. We propose a third mechanism based on checkpoint restart. With recent developments in checkpoint restart technology this has become an attractive alternative. The approach has the potential advantage of the conceptual simplicity of transactional memory and flexibility of speculative threads. Since many checkpoint restart systems work with large distributed memory programs, this provides an automatic way to perform distributed speculation over clusters. Additionally, since checkpoint restart systems are primarily designed for fault tolerance, using the same system for speculation could provide fault tolerance within speculative execution as well when it is embedded in large-scale applications where fault tolerance is desirable. In this paper we use a series of micro-benchmarks to study the relative performance of a speculative system based on the DMTCP checkpoint restart system and compare it against a thread level speculative system. We highlight the relative merits of each approach and draw some lessons that could be used to guide future developments in speculative systems.

DTHREADS: Efficient Deterministic Multithreading Liu, T., Curtsinger, C., and Berger, E. D. (2011, October). In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (pp. 327-336). ACM. (pdf) (Note: This is a really clever idea. A multi-threaded program is virtualized as program over multiple processes, using shared memory. This has a similar philosophy to that of process-level virtualization, in the spirit of the paper by Arya et al. But one might alternatively argue that this is language-level virtualization. Compare this approach to "determinizing" multi-threaded code with the "Pinplay" paper, below, in the language-level subsection.)

Multithreaded programming is notoriously difficult to get right. A key problem is non-determinism, which complicates debugging, testing, and reproducing errors. One way to simplify multithreaded programming is to enforce deterministic execution, but current deterministic systems for C/C++ are incomplete or impractical. These systems require program modification, do not ensure determinism in the presence of data races, do not work with general-purpose multithreaded programs, or run up to 8.4× slower than pthreads.

This paper presents Dthreads, an efficient deterministic multithreading system for unmodified C/C++ applications that replaces the pthreads library. Dthreads enforces determinism in the face of data races and deadlocks. Dthreads works by exploding multithreaded applications into multiple processes, with private, copy-on-write mappings to shared memory. It uses standard virtual memory protection to track writes, and deterministically orders updates by each thread. By separating updates from different threads, Dthreads has the additional benefit of eliminating false sharing. Experimental results show that Dthreads substantially outperforms a state-of-the-art deterministic runtime system, and for a majority of the benchmarks evaluated here, matches and occasionally exceeds the performance of pthreads.

Language-level virtualization:

A few billion lines of code later: using static analysis to find bugs in the real world. Bessey, Al, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler. Communications of the ACM 53, no. 2 (2010): 66-75. (pdf)
(Note: Even though it's about static analysis, there is a flavor of language-level virtualization here. If you decide to report on this paper, you should report jointly on this paper and on the paper below: "Bugs as Deviant Behavior".)

How Coverity built a bug-finding tool, and a business, around the unlimited supply of bugs in software systems.
In 2002, COVERITY commercialized a research static bug-finding tool. Not surprisingly, as academics, our view of commercial realities was not perfectly accurate. However, the problems we encountered were not the obvious ones. ...

Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code Engler, D., Chen, D. Y., Hallem, S., Chou, A., and Chelf, B. (2001, October). In ACM SIGOPS Operating Systems Review (Vol. 35, No. 5, pp. 57-72). ACM. (pdf) (Note: If you report on this paper, you should jointly report on this and the paper above: "A Few Billion Lines of Code Later".)

A major obstacle to finding program errors in a real system is knowing what correctness rules the system must obey. These rules are often undocumented or specified in an ad hoc manner. This paper demonstrates techniques that automatically extract such checking information from the source code itself, rather than the programmer, thereby avoiding the need for a priori knowledge of system rules.The cornerstone of our approach is inferring programmer "beliefs" that we then cross-check for contradictions. Beliefs are facts implied by code: a dereference of a pointer, p, implies a belief that p is non-null, a call to "unlock(1)" implies that 1 was locked, etc. For beliefs we know the programmer must hold, such as the pointer dereference above, we immediately flag contradictions as errors. For beliefs that the programmer may hold, we can assume these beliefs hold and use a statistical analysis to rank the resulting errors from most to least likely. For example, a call to "spin_lock" followed once by a call to "spin_unlock" implies that the programmer may have paired these calls by coincidence. If the pairing happens 999 out of 1000 times, though, then it is probably a valid belief and the sole deviation a probable error. The key feature of this approach is that it requires no a priori knowledge of truth: if two beliefs contradict, we know that one is an error without knowing what the correct belief is.Conceptually, our checkers extract beliefs by tailoring rule "templates" to a system --- for example, finding all functions that fit the rule template "a must be paired with b." We have developed six checkers that follow this conceptual framework. They find hundreds of bugs in real systems such as Linux and OpenBSD. From our experience, they give a dramatic reduction in the manual effort needed to check a large system. Compared to our previous work, these template checkers find ten to one hundred times more rule instances and derive properties we found impractical to specify manually.

Transactional rollback for language-based systems Rudys, A., and Wallach, D. S. (2002). In Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on (pp. 439-448). IEEE. (pdf) (Note, this is a nice example of language-level virtualization: It speculates on the results of codelets (small pieces of the code).)

Language run-time systems are routinely used to host potentially buggy or malicious codelets-software modules, agents, applets, etc.-in a secure environment. A number of techniques exist for managing access control to system services and even for terminating codelets once they have been determined to be misbehaving. However because codelets can be terminated anywhere in their execution, a codelet's internal state might become inconsistent; restarting the codelet could result in unexpected behavior. Any state the codelet shares with other codelets may likewise become inconsistent, destabilizing those codelets as well. To address these problems, we have designed a mechanism, strictly using code-to-code transformations, which provides transactional rollback support for codelets. Each instance of a codelet is run in its own transaction, and standard (ACID) transactional semantics apply. All changes made by the codelet are automatically rolled back when the corresponding transaction aborts. We discuss a transactional rollback implementation for Java, and present its performance.

Rx: Treating Bugs As Allergies -- A Safe Method to Survive Software Failures Qin, F., Tucek, J., Sundaresan, J., and Zhou, Y. (2005, October). In ACM SIGOPS Operating Systems Review (Vol. 39, No. 5, pp. 235-248). ACM. (pdf) (Note: Here is a cute idea using speculative execution. Be sure to especially read Section 3.3 carefully on a first reading: the five mechanisms for semi-automatically fixing bugs.)

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: Required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time.This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to re-execute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the "allergen" from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.We have implemented RX on Linux. Our experiments with four server applications that contain six bugs of various types show that RX can survive all the six software failures and provide transparent fast recovery within 0.017-0.16 seconds, 21-53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and re-execution without environmental changes, cannot successfully recover the three servers (Squid, Apache, and CVS) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a non-deterministic concurrency bug. Additionally, RX's checkpointing system is lightweight, imposing small time and space overheads.

Pinplay: a framework for deterministic replay and reproducible analysis of parallel programs. Patil, H., Pereira, C., Stallcup, M., Lueck, G., and Cownie, J. (2010, April). In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization (pp. 2-11). ACM. (pdf) (Note: This is an excellent example of a research direction that argues that debugging multi-threaded programs is harder than sequential programs, because of race conditions involving at least two distinct points in a program. They then argue that deterministic replay is important for capturing such bugs/race conditions for analysis and debugging. If you report on this work, please compare it to the more recent work on Castor (Default-On Multi-Core Record/Replay), below. Optionally, you may also want to compare this approach to "determinizing" multi-threaded code with the "Dthreads" paper, above.)

Analysis of parallel programs is hard mainly because their behavior changes from run to run. We present an execution capture and deterministic replay system that enables repeatable analysis of parallel programs. Our goal is to provide an easy-to-use framework for capturing, deterministically replaying, and analyzing execution of large programs with reasonable runtime and disk usage. Our system, called PinPlay, is based on the popular Pin dynamic instrumentation system hence is very easy to use. PinPlay extends the capability of Pin-based analysis by providing a tool for capturing one execution instance of a program (as log files called pinballs) and by allowing Pin-based tools to run off the captured execution. Most Pintools can be trivially modified to work off pinballs thus doing their usual analysis but with a guaranteed repeatability. Furthermore, the capture/replay works across operating systems (Windows to Linux) as the pinball format is independent of the operating system. We have used PinPlay to analyze and deterministically debug large parallel programs running trillions of instructions. This paper describes the design of PinPlay and its applications for analyses such as simulation point selection, tracing, and debugging.

Towards Practical Default-On Multi-Core Record/Replay Ali José Mashtizadeh, Tal Garfinkel, David Terei, David Mazières, Mendel Rosenblum (to appear, ASPLOS 2017) (pdf) (Note that the pdf link is not a permanent link. This is the draft version of a recently accepted paper. If you report on this paper, please compare this recent work to the earlier classic "PinPlay" work, above. In particular, what do the two authors say about the still earlier work, RecPlay by Ronsse et al?)

The following is generally interesting as background reading, but it is not directly related to virtualization:

Shared Memory Consistency Models: A Tutorial

TO BE ADDED: Spider Debugging