Suggested Team Projects for CS 5600
NOTE:
The Wiki page for the team projects has now been set up.
Please go to the Wiki page for further information.
The preferred team size is three students. Exceptions can be
made with proper justification. Please consider me as an informal
fourth member of each team.
We will expect 5-minute oral summaries of the progress by each team, during
each week, in class. There will also be a full oral presentation,
and a full written documentation of the results, at the end of the semester.
Many of these projects are highly ambitious, and it is not necessarily
expected that each project will be completed within the semester.
Instead, the oral and written presentations should concentrate
on documenting what was achieved, what was not achieved, what new
information was learned in failing to achieve the desired goals, and
what new directions would be taken in the future in order to continue
the progress. This philosophy makes the project closer to the real world
(as opposed to an academic toy project). This style of work is typical of
some industrial production code
(e.g.,
agile software development), of industrial R&D, and of
general research.
For the projects based around Mesos, see:
-
Mesos: A Platform for Fine-Grained Resource Sharing
in the Data Center
(also at USENIX NSDI 2011)
-
Notes on Mesos and Docker (from Douglas Thain,
U. of Notre Dame)
- Apache Mesos
Documentation
- 1. DMTCP attach
- The hw3 homework showed some approaches to doing a DMTCP attach.
To date, no one has made a serious effort at creating an attach
feature for DMTCP.
- 2. DMTCP handling of static executables
- The hw3 homework also showed some approaches to handling static
executables. Handling static executables will most likely
require the use of trampolines (code that modifies
the assembly entry points to library functions).
To date, no one has made a serious effort at supporting static
executables using DMTCP.
- 3. General API for DMTCP Coordinator
- Right now, there is only one kind of DMTCP coordinator. (Actually,
there are two kinds, since dmtcp_launch --no-coordinator
causes DMTCP to create a built-in coordinator.) Can we extend
the concept of DMTCP plugins to create a general API that will
allow an end user to write a library against the API to define
a new kind of coordinator? Some options are: a tree of coordinators;
or a second standby coordinator that takes over if the first
coordinator dies; etc.
- 4. General API for DMTCP Plugin for Writing a Checkpoint
- Similar to the previous project. But here we want to allow the
creation of: checkpoint images on a remote computer; or maybe
an encrypted version of a checkpoint image; or maybe
replicates of the checkpoint image for fault tolerance.
- 5. Versioned symbols for ELF
- Wrapper functions are a natural concept in computer science.
The Linux/Posix system calls dlopen and
dlsym directly support wrappers. (See
man dlsym
and search on "wrapper".)
GNU libc 2.1 (glibc 2.1) introduced
symbol versioning. A newer library (.so file) can define both
a new version of a symbol (e.g., function) intended to
fix bugs and/or add new features; while at the same time
defining an older version of the symbol for backward compatibility.
The system call dlsym chooses the older version of the symbol,
while executables that dynamically link to a library will
usually receive the newer version of the symbol (actually,
the version that is informally considered the "default version").
The goal of this project is to learn more about ELF, and to use
that information to write a new function that will
choose the newer "default" version of a symbol.
You will find more information on these issues and a partial
implementation in DMTCP, in the DMTCP file
doc/dlsym_default.txt. I will provide additional
information, if a team chooses this project.
- 6. Checkpointing valgrind (valgrind attach)
-
Valgrind is a widely used
software that excels at finding memory leaks. Its usage is
simple: valgrind a.out args.
Because running under valgrind is slower than native
execution (e.g., 10 times slower or worse),
many users have hoped for a "valgrind attach" feature.
This is probably impossible, since valgrind runs the
executable in software that emulates the underlying assembly
language.
So, a next-best option is to run valgrind under DMTCP (or
other checkpointing tool) until the interesting point.
Then, one checkpoints. Finally, one can restart many times,
and direct the executable to choose different execution
paths (e.g., different application options) on each restart.
While a VM snapshot could checkpoint valgrind, that is a heavyweight
option. The goal of this project is to use a standard
checkpointing package (or your own custom one) to checkpoint
valgrind.
- 7. Checkpointing screen and/or tmux
-
GNU screen, and
tmux,
are commonly used for detaching a terminal session from the
terminal, and other manipulations. This software uses the
concept of a ptty.
At one time, DMTCP supported GNU screen, but that was
before the era of DMTCP plugins.
The goal of this project
is to produce a DMTCP plugin for supporting checkpointing of
either GNU screen or of tmux.
- 8. Checkpointing Hadoop (Big Data)
-
Hadoop was the first full-featured open source version
of the MapReduce software from Google. Its architecture
typically assumes back-end disk nodes with large files,
and a front-end compute node on which resides the
Hadoop executable, and a Hadoop scheduler for the back end.
Checkpointing would be very useful, in order to put aside
a currently running job, when a newer, high-priority job
arrives. Since the files on the back end are large,
the intention is to copy the back end files to a temporary
region as part of the checkpoint, and then to copy them
back as part of the restart.
We have access to some software from INRIA that will manage
the back-end files. The goal of this project is to write
the front-end, including a DMTCP plugin, that will take
special actions at checkpoint and restart to save the
front-end Hadoop application and later restore it. We will
apply this only to the simpler Hadoop, version 1.
- 9. Checkpointing of Docker
-
Docker is sometimes
called a lightweight virtual machine, although it does not
include a separate "guest" Linux kernel. It uses the underlying
Linux kernel. Nevertheless, it has gained popularity in
many domains where virtual machines are also used.
Virtual machines have snapshots. The goal of this project
is to checkpoint Docker using DMTCP. (An alternate
checkpointing package that currently works on Docker is
CRIU.)
While Docker is normally compiled as a statically linked
executable under GC, there is also a dynamically linked executable
for Docker using
GNU GCCGO.
(See
The Go Blog for more information.)
In principle, this should make it easy for
DMTCP to checkpoint Docker. However, DMTCP must be extended
to support Linux cgroups and pid namespaces.
There is already a partial implementation of checkpointing
of Docker within the DMTCP team. This will be made available
to a team that tackles this project.
Docker typically runs just a single process.
If time permits, the effort should be extended
to support Docker's
Supervisor package. Alternatively, the team may prefer
a different extension: the use of plugins
to integrate with the Docker daemon on checkpoint and restart.
- 10. Security: Multi-architecture Checkpoint-Restart
-
In defending against malware, it is useful to present
a dynamically shifting "attack surface" against attackers.
One such technique is multi-architecture checkpoint-restart.
An example of such work (as execution migration) is:
Execution Migration in a Heterogeneous-ISA Chip Multiprocessor.
The goal of this project is to checkpoint under one CPU instruction
set (e.g., Intel), and to restart under a different
CPU instruction set (e.g., ARM).
We will assume that we fully control the target application.
For example, we can compile it under both CPU architectures.
We can also compile it with research compilers such
as LLVM.
LLVM is the foundation for the well-known
clang compiler.
LLVM allows you to easily modify the compiler
to emit additional code, such as "landmarks" in the prolog
and epilog of a function, where it is acceptable to checkpoint.
Thus, one can checkpoint at one of these landmarks,
and replace the text segment with the text segment of the
other CPU architecture, and then restart at the corresponding
landmark in the alternative text segment. With a little luck,
we can persuade LLVM to emit an almost identical data segment under the
two CPU architectures. The remaining task is then to translate
the call frames of the stack from one CPU architecture to
another.
If a team takes on this project, we will provide additional
lectures on how to modify the LLVM compiler.
- 11. Mesos: Fault-tolerant Resource Scheduling
-
Many companies that operate at web scale spread their production
systems across data centers in different geographical locations.
This project will implement a feature in Apache Mesos allowing
slaves in a datacenter to connect to a local Mesos master,
enabling the Mesos masters of different datacenters to handle
automated failover among them.
If a team takes on this project, we will provide additional
lectures on this aspect of Mesos.
- 12. Mesos: Load Balancing
-
Apache Mesos operates in a master-slave hierarchy. If a leading
master fails, one of the standby masters will take over.
However, if a master is failing due to overload or network
congestion, failover to a single standby master is not an
appropriate solution. This project should create multiple
active masters to share the workload.
If a team takes on this project, we will provide additional
lectures on this aspect of Mesos.
I am still considering additional projects. Students are welcome
to propose additional projects in areas of their interest,
or modifications to the current projects.