Although data processing systems now execute queries faster than ever before, they only address first half of the data analysis cycle. The latter half — presenting and interpreting the results in order to clean the data, formulating new queries, generating hypothesis, and summarizing and presenting results — is currently ill-served by existing systems. In this talk, I will describe two examples of systems that “close the loop” by letting users query the results of their data analysis.
The first, Scorpion, answers “why are these results outliers?” in the context of aggregation queries. Aggregation is commonly used to reduce large data sets to a managable size, but also obscures the input records that are correlated with outliers from those that are uncorrelated. Scorpion identifies the input records that most contributed to an outlier value and generates predicates that describe their common properties.
The second, SubZero, answers “what records generated this result?” in the context of scientific workflows. For example, astronomers want to know which pixels in the set of all input images were used to detect an interesting star. Naively storing input-output relationships (lineage) for every pixel in each step of the workflow can incur significant storage and runtime costs. SubZero is a workflow system that efficiently tracks lineage information while also meeting user specified storage and runtime overhead constraints.
Eugene Wu is a Ph.D. student in the database group at MIT, advised by Samuel Madden and Michael Stonebraker. He is broadly interested in building systems for data management and has contributed to research in a wide variety of areas including data cleaning, core database performance, human computation, and complex event processing.