Project Ideas
      This list is for guidance only. Feel free to change the scope/application
      of any of these projects or find one on your own.
      See 
        CS5600 (Fall'15) project page and 
         
           BU/NU Cloud Computing Course project page (Sprint'16) for more
         ideas. 
      
Checkpoint-Restart
      
        - 
          Fault-tolerant DMTCP coordinatorCurrently, DMTCP has a centralized coordinator that communicates with
          all worker nodes. The centralized nature makes it a single point of
          failure for the computation. One way to address is to use distributed
          coordinator processes with leader-election. These coordinators can
          share the current state using some in-built consensus protocol or
          using some external tool such as Zookeeper. One can further extend the
          idea to have a tree of coordinators to handle tens of thousands or
          worker nodes simultaneously. One can also make it multi-threaded to
          further the scalability goals.
- 
          Add checkpoint capabilities to MesosApache Mesos is a two-level scheduler that can enhance resource
          utilization in a datacenter. Majority of workloads for Mesos are
          stateless and thus one can kill tasks without a high performance
          penalty if one has to take down a node. However, with stateful
          application, the situation is different. Simply killing a task to
          relocate it result in a heavy performance penalty. This problem can be
          solved by adding checkpoint-restart capabilities to Mesos for doing
          task migration.
- 
          Checkpointing valgrind (valgrind attach)Valgrind is a widely used software that excels at finding memory
          leaks. Its usage is simple: valgrind a.out args. Because running under
          valgrind is slower than native execution (e.g., 10 times slower or
          worse), many users have hoped for a "valgrind attach" feature. This is
          probably impossible, since valgrind runs the executable in software
          that emulates the underlying assembly language.  So, a next-best
          option is to run valgrind under DMTCP (or other checkpointing tool)
          until the interesting point. Then, one checkpoints. Finally, one can
          restart many times, and direct the executable to choose different
          execution paths (e.g., different application options) on each restart.
          While a VM snapshot could checkpoint valgrind, that is a heavyweight
          option. The goal of this project is to use a standard checkpointing
          package (or your own custom one) to checkpoint valgrind.
- 
          Checkpointing Hadoop (Big Data)Hadoop was the first full-featured open source version
          of the MapReduce software from Google.  Its architecture
          typically assumes back-end disk nodes with large files,
          and a front-end compute node on which resides the
          Hadoop executable, and a Hadoop scheduler for the back end.
 Checkpointing would be very useful, in order to put aside
          a currently running job, when a newer, high-priority job
          arrives.  Since the files on the back end are large,
          the intention is to copy the back end files to a temporary
          region as part of the checkpoint, and then to copy them
          back as part of the restart.
 We have access to some software from INRIA that will manage
          the back-end files.  The goal of this project is to write
          the front-end, including a DMTCP plugin, that will take
          special actions at checkpoint and restart to save the
          front-end Hadoop application and later restore it.  We will
          apply this only to the simpler Hadoop, version 1.
- 
          Checkpoint support for Docker/Appc ContainersDocker is sometimes called a
          lightweight virtual machine, although it does not include a separate
          "guest" Linux kernel.  It uses the underlying Linux kernel.
          Nevertheless, it has gained popularity in many domains where virtual
          machines are also used.
 Virtual machines have snapshots.  The goal
          of this project is to checkpoint Docker using DMTCP.  (An alternate
          checkpointing package that currently works on Docker is
          CRIU.)
 While Docker is normally compiled as a statically linked executable
          under GC, there is also a dynamically linked executable for Docker
          using GNU GCCGO.
          (See  The Go
            Blog for more information.) In principle, this should make it
          easy for DMTCP to checkpoint Docker.  However, DMTCP must be extended
          to support Linux cgroups and pid namespaces.
 There is already a partial implementation of checkpointing of Docker
          within the DMTCP team.  This will be made available to a team that
          tackles this project.
 Docker typically runs just a single process.
          If time permits, the effort should be extended to support
          Docker's 
            Supervisor package.  Alternatively, the team may prefer a
          different extension:  the use of plugins to integrate with the Docker
          daemon on checkpoint and restart.
- 
          DMTCP/CRIU IntegrationUse CRIU as the single process checkpointer in DMTCP. This will allow
          one to quickly checkpoint a network of containers across the network.
- 
           Security: Multi-architecture Checkpoint-RestartIn defending against malware, it is useful to present
          a dynamically shifting "attack surface" against attackers.
          One such technique is multi-architecture checkpoint-restart.
          An example of such work (as execution migration) is:
          
            Execution Migration in a Heterogeneous-ISA Chip
            Multiprocessor.
 The goal of this project is to checkpoint under one CPU instruction
          set (e.g., Intel), and to restart under a different
          CPU instruction set (e.g., ARM).
 We will assume that we fully control the target application.
          For example, we can compile it under both CPU architectures.
          We can also compile it with research compilers such
          as LLVM.
          LLVM is the foundation for the well-known
          clang compiler.
          LLVM allows you to easily modify the compiler
          to emit additional code, such as "landmarks" in the prolog
          and epilog of a function, where it is acceptable to checkpoint.
          Thus, one can checkpoint at one of these landmarks,
          and replace the text segment with the text segment of the
          other CPU architecture, and then restart at the corresponding
          landmark in the alternative text segment.  With a little luck,
          we can persuade LLVM to emit an almost identical data segment under the
          two CPU architectures.  The remaining task is then to translate
          the call frames of the stack from one CPU architecture to
          another.
 If a team takes on this project, we will provide additional
          lectures on how to modify the LLVM compiler.
Scheduling
      
      
        - 
          Investigate Linux CFS scheduling algorithm for datacentersThe goal is to investigate and recommend best practices for resource
          sharing and performance isolation for datacenter workloads managed by
          the Linux CFS scheduling algorithm. In particular, we want to avoid
          undesired and unpredictable scheduling delays for latency-critical
          workloads with sub millisecond SLOs [3] or underutilizing cores by
          assigning them exclusively to latency-critical workloads even during
          periods of low load [1]. Our specific target is to identify the right
          CFS settings and control of CPU bandwidth for scenarios where
          different latency-critical and batch workloads are executing
          concurrently on the same server [2].  
          [1] 
            Large-scale cluster management at Google with Borg
          [2] 
            CPU Bandwidth Control for CFS
          [3] 
          Reconciling High Server Utilization and Sub-millisecond
          Quality-of-Service
- 
           Mesos: Load-balanced scheduling with multiple active mastersApache Mesos operates in a master-slave hierarchy with one master node
          communicating with many slave nodes. In case the leading master fails,
          one of the optional standby masters will take over. While this is
          crucial for the high-availability and fault-tolerant property of a
          Mesos cluster, full load and all traffic will always go to the leading
          master. In cases where a master fails due to overload or network
          congestion, new masters are prone to be overloaded and fail in similar
          ways.
          The idea is to explore the possibility of having multiple
          active masters sharing the workload in large-scale Apache Mesos
          clusters. The work involves analysis of current data flow and shared
          state, implementing a prototype which lets frameworks be serviced by
          multiple masters and evaluating its viability for real-world use.
Storage
      
        - 
          Distributed NASBuild basic scalable distributed NAS using multi-node communication.
          Nodes communicate with each other using some quorum protocols
          (standard or custom) to elect a master which will serve as Actual NFS
          server end point. (In real solution, nodes also can implement raid
          level striping and IP takeover for seamless interaction with clients.
          For simplicity, each machine is connected to same physical storage
          (either NFS or Disks) and all nodes are running as a thread on the
          same machine.
- 
          Client-side SSD CacheOptimize performance of NAS client using Client side SSD Cache (Very
          open ended.  Can implement multiple caching algorithms.)
- 
          Server-side SSD CacheOptimize performance of NAS server using server-side SSD Cache (Very
          open ended.  Can implement multiple caching algorithms.)
- 
          Backup applicationImplement simple file backup application using cloud storage (box,
          dropbox, etc.). Addonbonus: Implement stretchable backup using
          multiple cloud storage providers to utilize space efficiently.