Data-intensive science is emerging as a new paradigm that is concerned with collecting, archiving, and analyzing the vast amounts of data being produced and accumulated by modern science. Turning scientific raw data into knowledge will be the key for future scientific discoveries. A typical data-intensive science project has the following major steps:
While there are challenging research problems in each step, the focus of this project is on exploratory analysis.
An essential part of exploratory analysis is the use of non-parametric (or semi-parametric) data mining techniques. These techniques enable scientists to train accurate prediction models of complex natural processes even in the absence of a complete understanding of these processes. Using such models for analysis is preferable compared to simply summarizing the raw data directly. A model with good generalization performance does not overfit to noise or a specific data sample and hence captures the natural process more accurately. Unfortunately, complex prediction models per se are not intelligible. They cannot be used directly for answering questions like 'Which environmental features have the strongest effect on bird abundance and how do they interact?'. To make data mining models 'digestible' and to provide end users with new hypotheses, we need to 'open up the blackbox', i.e., provide tools for determining important relationships that the model has learned. This can be done by summarizing a complex model with simpler patterns like partial dependence functions. The number of such model summaries is overwhelming: each 'slice' or 'dice' of a lower-dimensional subspace of the original data space could contain an interesting model summary.
The goal of this project is to develop techniques for finding the most 'interesting' model summaries automatically and efficiently. This is done in three steps. First, by formalizing the notion of interestingness for a wide variety of pattern types. Second, by developing a declarative language for specifying these interestingness measures. With a declarative language users define what they find interesting, but they need not specify how to find it efficiently. Third, by developing an optimizing compiler for a small language fragment. A major research challenge is to strike the right balance between expressiveness of the language and making it amenable to effective query optimization.
The results of this project will pave the way for powerful exploratory analysis tools. They will also enable future research on optimizers and user-friendly interfaces for the declarative language. The approach will be validated using the rich data resources being organized by the ornithological community in the Avian Knowledge Network (AKN). This will have a tremendous impact on the ability to identify the most significant environmental variables that affect biodiversity on the planet. A fragment of the language will be available to the public through Web services on the AKN Web site. This will enable a broad audience, from educators to land managers or researchers to interested citizens to derive novel knowledge from the data resources gathered. For example, land managers could discover the possible impact of their decisions on an ecosystem's health.
A simple prototype of a pattern search engine for complex data mining models is already online. This prototype assumes that all model summaries have been computed before. It also relies on a relational database running on a single CPU for pattern ranking. We are in the process of completing a new prototype that implements all functionality--training models, creating summaries, ranking summaries, generating visualizations of selected summaries--in parallel on a MapReduce cluster running Hadoop.
This
material is based upon work supported by the National Science Foundation under
Grant Nos.
0427914,
0748626, and
0920869. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not necessarily
reflect the views of the National Science Foundation.
B.
Panda, M.Riedewald, and D. Fink. The Model Summary Problem and a Solution for
Trees. To appear in Proc. IEEE Int. Conf. on Data
Engineering (ICDE), 2010
D. Fink, W. M. Hochachka, B. Zuckerberg, D. W. Winkler, B. Shaby, M. A. Munson, G. Hooker, M. Riedewald, D. Sheldon, and S. Kelling. Spatiotemporal Exploratory Models for Broad-Scale Survey Data. Under review.
A. Lachmann and M. Riedewald. Finding Relevant Patterns in Bursty Sequences. In Proc. of the VLDB Endowment (PVLDB), 1(1):78-89, 2008
M. Riedewald. Finding Patterns in Large-Scale Observational Data. Invited talk at the Int. Conf. on Computational Sustainability, Working Group on Species Distribution, June 2009
M. Riedewald. Finding Patterns in Large-Scale Observational Data. Poster presentation at the Int. Conf. on Computational Sustainability, June 2009
D. Fink, W. Hochachka, M. Herzog, and N. Nur. Predicting Abundance from Bird
Monitoring Data Across Large Landscapes: Strategies for Spatial Interpolation
Using Environmental Information. 125th Meeting of the American Ornithologists'
Union, Portland, Oregon, 2008.
D. Fink, W. Hochachka, and N. Nur. Exploring bird monitoring data to guide
management and research decisions: Predicting relative abundance with decision
trees. 4th International Partners in Flight Conference, McAllen, Texas, 2008.
M. Riedewald, R. Caruana, D. Fink, W. Hochachka, S. Kelling, A. Munson, B. Shaby, and D. Sorokina. Tracking Environmental Change through the Data Resources of the Bird-monitoring Community. Poster presentation at the Microsoft eScience Workshop at RENCI, 2007.
Mirek Riedewald
(PI)
Daniel Fink (co-PI)
Alper Okcan (Northeastern U. Ph.D. student)
Wesley M. Hochachka (Cornell Lab of Ornithology)
Giles Hooker
(Cornell Dept. of Biological Statistics and Computational Biology)
Steve Kelling (Cornell Lab of Ornithology)
Kevin Webb (Cornell Lab of Ornithology)
Biswanath Panda
(Cornell Ph.D. student while working on the project)
Sahib S. Dhindsa (Cornell ISST undergrad student while working on the
project)
Alexander Lachmann (visiting CS undergrad student while working on the project)