CSG 339: Google, eScience, and the Cloud--Scalable Techniques for Massive Data

The "How much Information?" project at UC Berkeley in 2003 estimated that stored information worldwide grew by about 30% a year between 1999 and 2002. Internet search companies, large online retailers, and social networking sites are looking for new approaches for managing their large data collections. At the same time, data-intensive science is emerging as a new paradigm that is concerned with collecting, archiving, and analyzing the vast amounts of data being produced and accumulated by modern science. This includes gigabytes, even petabytes, of data generated by scientific instruments and sensors, human observers, and computer simulations. Turning scientific raw data into knowledge will be the key for future scientific discoveries.

To turn raw data into knowledge and to make it broadly available, requires innovative approaches to scalable data management and data mining. One major difference compared to previous decades is the end of the era of exponentially increasing processor speed. Hence the current trend is towards multi-core and cluster architectures as the main platforms for scalable data processing.

In this course we will discuss recent papers that introduce innovative solutions coming from both industry and academia. Where appropriate, classic papers and landmark results will be presented to provide the necessary background for the latest research results.

Course Information

Instructor: Mirek Riedewald

Meeting times: Tue, Fri 11:45 - 1:25

Requirements

This course is ideally suited for Ph.D. students at all levels, in particular first- and second-year students looking for research topics. There are no specific formal pre-requisites. Knowledge of important database and operating system concepts, e.g., as covered by an undergraduate course, will be helpful but not essential.

Interested Master's students need to contact the instructor prior to enrolling.

Coursework

Project Milestones

Lectures

Jan 6: Google's MapReduce

Jan 9: Google File System

Jan 13: Bigtable

Jan 16: Sawzall

Jan 20: Pig and PigLatin

Jan 23: From MapReduce to SQL, MapReduce performance study

Jan 27: Dryad and DryadLINQ

Jan 30: Overview of Cloud computing and the Grid

Feb 3: Introduction to data mining for eScience

Feb 6: Tree models

Feb 10: X-raying complex mining models

Feb 13: Parallel data mining

Feb 17, 20: Sequence mining

Feb 24, 27, Mar 10: Parallel databases

Mar 13: Distributed DBMS fault tolerance

Mar 17: Consensus in distributed systems

Mar 20: Classic DB solutions and new consistency notions

Mar 24: Amazon's technology

Mar 27: Potpourri (due to re-scheduling)

Mar 31: Yahoo! technology

Apr 3: Clustera and Sinfonia

Apr 7: Managing models in a DBMS

Apr 10: Probabilistic DBMS

Apr 14: Other interesting DB trends

Apr 17: Other interesting DB trends (cont.)