Project Ideas

This list is for guidance only. Feel free to change the scope/application of any of these projects or find one on your own. See CS5600 (Fall'15) project page and BU/NU Cloud Computing Course project page (Sprint'16) for more ideas.

Checkpoint-Restart

  • Fault-tolerant DMTCP coordinator

    Currently, DMTCP has a centralized coordinator that communicates with all worker nodes. The centralized nature makes it a single point of failure for the computation. One way to address is to use distributed coordinator processes with leader-election. These coordinators can share the current state using some in-built consensus protocol or using some external tool such as Zookeeper. One can further extend the idea to have a tree of coordinators to handle tens of thousands or worker nodes simultaneously. One can also make it multi-threaded to further the scalability goals.
  • Add checkpoint capabilities to Mesos

    Apache Mesos is a two-level scheduler that can enhance resource utilization in a datacenter. Majority of workloads for Mesos are stateless and thus one can kill tasks without a high performance penalty if one has to take down a node. However, with stateful application, the situation is different. Simply killing a task to relocate it result in a heavy performance penalty. This problem can be solved by adding checkpoint-restart capabilities to Mesos for doing task migration.
  • Checkpointing valgrind (valgrind attach)

    Valgrind is a widely used software that excels at finding memory leaks. Its usage is simple: valgrind a.out args. Because running under valgrind is slower than native execution (e.g., 10 times slower or worse), many users have hoped for a "valgrind attach" feature. This is probably impossible, since valgrind runs the executable in software that emulates the underlying assembly language. So, a next-best option is to run valgrind under DMTCP (or other checkpointing tool) until the interesting point. Then, one checkpoints. Finally, one can restart many times, and direct the executable to choose different execution paths (e.g., different application options) on each restart. While a VM snapshot could checkpoint valgrind, that is a heavyweight option. The goal of this project is to use a standard checkpointing package (or your own custom one) to checkpoint valgrind.
  • Checkpointing Hadoop (Big Data)

    Hadoop was the first full-featured open source version of the MapReduce software from Google. Its architecture typically assumes back-end disk nodes with large files, and a front-end compute node on which resides the Hadoop executable, and a Hadoop scheduler for the back end.
    Checkpointing would be very useful, in order to put aside a currently running job, when a newer, high-priority job arrives. Since the files on the back end are large, the intention is to copy the back end files to a temporary region as part of the checkpoint, and then to copy them back as part of the restart.
    We have access to some software from INRIA that will manage the back-end files. The goal of this project is to write the front-end, including a DMTCP plugin, that will take special actions at checkpoint and restart to save the front-end Hadoop application and later restore it. We will apply this only to the simpler Hadoop, version 1.
  • Checkpoint support for Docker/Appc Containers

    Docker is sometimes called a lightweight virtual machine, although it does not include a separate "guest" Linux kernel. It uses the underlying Linux kernel. Nevertheless, it has gained popularity in many domains where virtual machines are also used.
    Virtual machines have snapshots. The goal of this project is to checkpoint Docker using DMTCP. (An alternate checkpointing package that currently works on Docker is CRIU.)
    While Docker is normally compiled as a statically linked executable under GC, there is also a dynamically linked executable for Docker using GNU GCCGO. (See The Go Blog for more information.) In principle, this should make it easy for DMTCP to checkpoint Docker. However, DMTCP must be extended to support Linux cgroups and pid namespaces.
    There is already a partial implementation of checkpointing of Docker within the DMTCP team. This will be made available to a team that tackles this project.
    Docker typically runs just a single process. If time permits, the effort should be extended to support Docker's Supervisor package. Alternatively, the team may prefer a different extension: the use of plugins to integrate with the Docker daemon on checkpoint and restart.
  • DMTCP/CRIU Integration

    Use CRIU as the single process checkpointer in DMTCP. This will allow one to quickly checkpoint a network of containers across the network.
  • Security: Multi-architecture Checkpoint-Restart

    In defending against malware, it is useful to present a dynamically shifting "attack surface" against attackers. One such technique is multi-architecture checkpoint-restart. An example of such work (as execution migration) is: Execution Migration in a Heterogeneous-ISA Chip Multiprocessor.
    The goal of this project is to checkpoint under one CPU instruction set (e.g., Intel), and to restart under a different CPU instruction set (e.g., ARM).
    We will assume that we fully control the target application. For example, we can compile it under both CPU architectures. We can also compile it with research compilers such as LLVM. LLVM is the foundation for the well-known clang compiler. LLVM allows you to easily modify the compiler to emit additional code, such as "landmarks" in the prolog and epilog of a function, where it is acceptable to checkpoint. Thus, one can checkpoint at one of these landmarks, and replace the text segment with the text segment of the other CPU architecture, and then restart at the corresponding landmark in the alternative text segment. With a little luck, we can persuade LLVM to emit an almost identical data segment under the two CPU architectures. The remaining task is then to translate the call frames of the stack from one CPU architecture to another.
    If a team takes on this project, we will provide additional lectures on how to modify the LLVM compiler.

Scheduling

  • Investigate Linux CFS scheduling algorithm for datacenters

    The goal is to investigate and recommend best practices for resource sharing and performance isolation for datacenter workloads managed by the Linux CFS scheduling algorithm. In particular, we want to avoid undesired and unpredictable scheduling delays for latency-critical workloads with sub millisecond SLOs [3] or underutilizing cores by assigning them exclusively to latency-critical workloads even during periods of low load [1]. Our specific target is to identify the right CFS settings and control of CPU bandwidth for scenarios where different latency-critical and batch workloads are executing concurrently on the same server [2]. [1] Large-scale cluster management at Google with Borg [2] CPU Bandwidth Control for CFS [3] Reconciling High Server Utilization and Sub-millisecond Quality-of-Service
  • Mesos: Load-balanced scheduling with multiple active masters

    Apache Mesos operates in a master-slave hierarchy with one master node communicating with many slave nodes. In case the leading master fails, one of the optional standby masters will take over. While this is crucial for the high-availability and fault-tolerant property of a Mesos cluster, full load and all traffic will always go to the leading master. In cases where a master fails due to overload or network congestion, new masters are prone to be overloaded and fail in similar ways. The idea is to explore the possibility of having multiple active masters sharing the workload in large-scale Apache Mesos clusters. The work involves analysis of current data flow and shared state, implementing a prototype which lets frameworks be serviced by multiple masters and evaluating its viability for real-world use.

Storage

  • Distributed NAS

    Build basic scalable distributed NAS using multi-node communication. Nodes communicate with each other using some quorum protocols (standard or custom) to elect a master which will serve as Actual NFS server end point. (In real solution, nodes also can implement raid level striping and IP takeover for seamless interaction with clients. For simplicity, each machine is connected to same physical storage (either NFS or Disks) and all nodes are running as a thread on the same machine.
  • Client-side SSD Cache

    Optimize performance of NAS client using Client side SSD Cache (Very open ended. Can implement multiple caching algorithms.)
  • Server-side SSD Cache

    Optimize performance of NAS server using server-side SSD Cache (Very open ended. Can implement multiple caching algorithms.)
  • Backup application

    Implement simple file backup application using cloud storage (box, dropbox, etc.). Addonbonus: Implement stretchable backup using multiple cloud storage providers to utilize space efficiently.