CS 6240: Parallel Data Processing in MapReduce

This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.


News

Link to Piazza discussion forum: https://piazza.com/northeastern/fall2012/cs6240/home

Acknowledgment: This course was kindly supported by an AWS in Education Coursework Grant award from Amazon.com, Inc.

[12/11/2012] Lecture audio updated


Lectures

(Future lectures and events are tentative.)

Date Topic Remarks and Reading Assignments
Sep 11 Introduction, simple algorithms, measures of success  
Sep 18 MapReduce, word count, equi-join, handling failures Read the Google MapReduce paper. Look carefully at the word count and equi-join examples and make sure you can explain how the computation works.
Sep 25 Reverse Web graph, inverted index, sorting, Google File System Read the relevant chapters in White's book. Read the Google File System paper.
Oct 2 Hadoop specifics, MapReduce Design Patterns Read the relevant chapters in White's book. Read the appropriate sections in the Lin/Dyer book (see below). Try to re-write the word count example so that it uses the Local Aggregation design pattern.
Oct 9 Design Patterns Read the appropriate sections in the Lin/Dyer book (see below). Go through the Order Inversion design pattern in detail by using an example like the relative bird color counts we discussed in class.
Oct 16 Design Patterns, Theta-Joins in MapReduce Read the appropriate sections in the Lin/Dyer book (see below). For the joins, take a look at our paper.
Oct 23 Graph Algorithms Read the appropriate sections in the Lin/Dyer book (see below). Create a small example graph and manually run the MapReduce programs on the example to better understand what happens in each iteration.
Oct 30 Graph Algorithms; Pig Read the Pig paper and the corresponding chapter in the Tom White book.
Nov 6 Midterm Exam Same time and location as lecture.
Nov 13 HW 2 discussion; Databases  
Nov 20 Project and midterm discussion; Databases, HBase, and Hive; Reducing Map-to-Reduce data transfer Read more about HBase and Hive in the books by Tom White and Lars George (see below).
Nov 27 Project progress presentations  
Dec 4 MapReduce for Machine Learning; Parallel Computing Landscape Read more about MapReduce for machine learning in this paper. LLNL has a good overview of high-performance computing and MPI.
Dec 11 Final project presentations  

Course Information

Instructor: Mirek Riedewald

TA: Alper Okcan

Meeting times: Tue 6 - 9 PM
Meeting location: 425 Shillman Hall

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Reading Materials

  1. "Hadoop: The Definitive Guide" by Tom White, 3rd edition. (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
  2. "Hadoop in Action" by Chuck Lam  (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
  3. "Data-Intensive Text Processing with MapReduce" by Jimmy Lin and Chris Dyer. (Available online, see http://www.umiacs.umd.edu/~jimmylin/book.html for info.)
  4. "HBase: The Definitive Guide" by Lars George. (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
  5. Check out Yahoo!'s Hadoop tutorial for additional information. Notice that it uses the old MapReduce API.

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.