
Data-intensive science is emerging as a new paradigm that is concerned with collecting, archiving, and analyzing the vast amounts of data being produced and accumulated by modern science. Turning scientific raw data into knowledge will be the key for future scientific discoveries. A typical data-intensive science project has the following major steps:
The focus of Scolopax is on exploratory analysis. An essential aspect of exploratory analysis is the use of non-parametric (or semi-parametric) data mining techniques. These techniques enable scientists to train accurate prediction models of complex processes even in the absence of a complete understanding of these processes. Using such models for analysis is preferable compared to simply summarizing the raw data directly. A model with good generalization performance does not overfit to noise or a specific data sample and hence captures the natural process more accurately. Unfortunately, complex prediction models per se are not intelligible. They cannot be used directly for answering questions like 'Which environmental features have the strongest effect on bird abundance and how do they interact?'. To make data mining models 'digestible' and to provide end users with new hypotheses, we need to 'open up the blackbox', i.e., provide tools for determining important relationships that the model has learned. This can be done by summarizing a complex model with simpler patterns like partial dependence functions. The number of such model summaries is overwhelming: each 'slice' or 'dice' of a lower-dimensional subspace of the original data space could contain an interesting model summary.

The Scolopax project addresses various data management challenges to enable exploratory analysis (see system overview above). Scientists will be able to express their exploration preferences in a user-friendly language. The preference specification will then be automatically transformed into a formal query, for which Scolopax finds an efficient execution plan for a multi-processor environment like a cluster or Cloud. Other Scolopax components are concerned with post-processing of the discovered patterns and efficient training of data mining models. Our approach is validated through our ongoing collaboration with the Cornell Lab of Ornithology, using citizen science data and other data resources organized by the ornithological community in the Avian Knowledge Network (AKN).
There are currently two Scolopax demo versions. Both generate and rank summaries for a data mining model that was trained on a large high-dimensional data set containing bird sightings reported by citizen scientists through the eBird project. Processing happens in parallel on a 40-core cluster running the Hadoop version of MapReduce. Currently only pre-defined ranking measures are supported. In the future a more flexible search language will be implemented.
Prototype 1: version with frames or version without frames. This is the latest version. It uses HBase to store and manage query results. If a newly submitted query and ranking measure are found in the database, the old results are re-used, speeding up processing significantly. If the same query is used with a different ranking measure, some speedup is achieved since existing summaries can be re-used, but have to be re-ranked on-the-fly. For new queries, all summaries are computed from scratch and ranked on-the-fly.
Prototype 2: version without frames. This is the previous version, which always computes summaries from scratch and ranks them on-the-fly. It has been tested more than prototype 1, hence should be more stable.
[A. Okcan and M. Riedewald.
Processing Theta-Joins using MapReduce.
In
Proc. ACM SIGMOD Int. Conf. on Managament of Data,
pages 949-960, 2011]
To find related summaries, we need flexible join operators, not just standard
equi-joins. We developed novel techniques for efficiently computing arbitrary
theta-joins in parallel, with particular focus on MapReduce systems. Our most
general algorithm is randomized and provably achieves a near-optimal low latency. For popular join predicates, including equi-,
inequality-, and epsilon-join, we present specialized techniques that work well,
no matter how skewed the data distribution.
[B. Panda, M. Riedewald, and D. Fink.
The Model Summary Problem and a Solution for
Trees. In Proc. IEEE Int. Conf. on Data
Engineering (ICDE), pages 449-460, 2010]
Model summaries form the basis for exploratory analysis. For a typical analysis,
millions to billions of such summaries have to be created. We show how to
exploit workload properties to reduce computation time asymptotically, perform
fast batch computation, and effectively parallelize the workload in MapReduce.
[A. Lachmann
and M. Riedewald.
Finding Relevant Patterns in Bursty Sequences.
In Proc. of the VLDB Endowment (PVLDB), 1(1):78-89, 2008]
Finding relevant frequent patterns in bursty sequences is expensive and suffers
from a large number of un-interesting patterns with high support. We propose a
novel approach that addresses both problems and prove important properties
regarding preservation of interesting sequences.
[D. Sorokina,
R. Caruana, M. Riedewald, and D. Fink.
Detecting
Statistical Interactions with Additive Groves of Trees. In Proc.
International Conference on Machine Learning (ICML), pages 1000-1007, 2008]
[D. Sorokina
, R. Caruana, M. Riedewald, W. M. Hochachka, and S. Kelling.
Detecting and
Interpreting Variable Interactions in Observational Ornithology Data. In Proc. IEEE Int. Workshop on Domain Driven Data Mining (DDDM), 2009]
Model summaries inherently lose information compared to the full model. To
better understand when a summary might be hiding important information, we need
to understand which variables strongly interact. Our techniques identify such
variables using a mostly non-parametric approach.
[B. Panda, M.
Riedewald, J. Gehrke, and S. B. Pope:
High-Speed Function
Approximation. In Proc. IEEE Int. Conf. on Data Mining (ICDM),
pages 613-618, 2007]
While traditional data mining research usually focused on model accuracy and
training cost, exploratory analysis shifts the bottleneck to the prediction
phase, when the model is actually being used. We propose approximation techniques
that significantly speed up precition time, while
maintaining high prediction accuracy.
In addition to core computer science contributions, this project (and its predecessor) has also contributed to domain science results:
[D. Fink, W.
M. Hochachka, B. Zuckerberg, D. W. Winkler, B. Shaby, M. A. Munson, G.
Hooker, M. Riedewald, D. Sheldon, and S. Kelling.
Spatiotemporal Exploratory Models for
Broad-Scale Survey Data. Ecological Applications, 20(8):2131-2147, 2010]
[S. Kelling,
W. M. Hochachka, D. Fink, M. Riedewald, R. Caruana, G. Ballard, and G. Hooker.
Data Intensive Science: A New
Paradigm for Biodiversity Studies. BioScience, 57(7):613-620, 2009]
[W. M. Hochachka, R. Caruana, D. Fink, A. Munson, M. Riedewald, D. Sorokina, and
S. Kelling. Data-Mining Discovery of Pattern and Process in Ecological Systems. In
Journal of Wildlife Management, 71(7):2427--2437, 2007]
Mirek Riedewald
Daniel Fink (Cornell Lab of Ornithology)
Alper Okcan (Northeastern U. Ph.D. student)
Yue Huang (Northeastern U. Ph.D. student)
Wesley M. Hochachka (Cornell Lab of Ornithology)
Giles Hooker
(Cornell Dept. of Biological Statistics and Computational Biology)
Steve Kelling (Cornell Lab of Ornithology)
Kevin Webb (Cornell Lab of Ornithology)
Gawande Pratik Bhagwat (Northeastern U. MS student while
working on the project)
Sahib S. Dhindsa (Cornell ISST undergrad student while working on the
project)
Alexander Lachmann (visiting Cornell CS undergrad student while working on the project)
Shweta S. Memane (Northeastern U. MS student while working on the project)
Biswanath Panda
(Cornell Ph.D. student while working on the project)
Baturalp Torun (Northeastern U. MS student while working on the project)
This
material is based upon work supported by the National Science Foundation under
Grant Nos.
0612031,
0920869, and
1017793. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not necessarily
reflect the views of the National Science Foundation.