Growing up: How Big Data processing can cope with limited bandwidth and complex code

  • Date
    March 11, 2014
  • Time
    10:30am
  • Location
    366 WVH

Abstract

We are now entering an era in which organizations collect and process unprecedented data volumes. This “big data” is handled using large-scale distributed systems of unprecedented scale.  My work addresses problems in effectively deploying and managing these systems.  In this talk, I will focus on two parts of my research.

First, I describe my work building wide-area data collection and analytics pipelines to cope with large and variable bandwidth demands.

Second, I describe my work on better configuration management for the increasingly complex software seen in these environments.

In wide area contexts, available bandwidth can vary over time. Current analytics systems require users to specify in advance the data to be collected. As a consequence, systems are provisioned for the worst case, which is costly and inflexible. We are building a distributed analytics system, JetStream, designed for the wide area. JetStream lets users specify explicit policy for how the system should respond to varying data volumes and bandwidth availability. As a result, the system can make optimal use of available resources at each point in time.

As data grows, so does the complexity of the software used to manage it. Modern software stacks are increasingly complex and correspondingly difficult to configure. Users and administrators are left resorting to trial and error or internet searching when difficulties arise. My research in this area tames system configuration by applying static analysis. Analysis can determine the dependencies between configuration options and error messages. As a result, system failures can be quickly traced to a small set of potentially responsible options. Users thus get immediate feedback on how to resolve configuration errors.

Brief Biography

Ariel Rabkin is interested in techniques for building and debugging complex software systems. He is currently a postdoctoral researcher at Princeton University, working with Michael Freedman and Vivek Pai. He received his PhD in Computer Science from UC Berkeley in May 2012, where he was advised by Randy Katz.  He previously attended Cornell University (AB 2006, MEng 2007). He is a contributor to several open source projects, including Hadoop, the Chukwa log collection framework, and the JChord program analysis toolset.