CS 6240: Parallel Data Processing in MapReduce

This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.


News

[12/01/2011] Materials from Nov 30 lecture posted
[11/10/2011] Materials from Nov 9 lecture posted
[11/03/2011] Slides and audio from Nov 2 lecture posted
[10/20/2011] Audio from Oct 19 lecture posted on Blackboard


Lectures

Larger version of slides (2 per page)

(Future lectures and events are tentative.)

Date Topic Remarks and Reading Assignments
September 7 Introduction and first parallel algorithms  
September 14 More parallel algorithms; MapReduce Read the Google MapReduce paper. Look carefully at the word count and equi-join examples and make sure you can explain how the computation works.
September 21 MapReduce algorithm examples; handling failures Go over all the MapReduce algorithms we discussed in class and read the discussion about sorting in chapter 8 of the Tom White book. Then try to write the pseudo-code for Map and Reduce for all problems without looking at the lecture notes. Finally, look at the Grep.java and Sort.java examples that come with the Hadoop distribution. Match the Java code with your pseudo-code and execute it for some example data.
September 26 HW 1 due at 11pm Submit it through Blackboard.
September 28 MapReduce; Google File System; Hadoop specifics Read the Google File System paper. Read chapters 1, 2, 3, 4, 5, 6, and 7 in the Tom White book.
October 5 Pig; MapReduce design patterns Read the Pig paper. Read chapters 8 and 11 in the Tom White book.
October 12 MapReduce design patterns; joins Read the appropriate sections in the Lin/Dyer book (see below). For the joins, take a look at our paper.
October 19 Joins Read our paper.
October 20 HW 2 due at 11pm  
October 26 Midterm exam (6-8pm in usual classroom)  
November 2 Graph algorithms Read the appropriate sections in the Lin/Dyer book (see below). Create a small example graph and manually run the MapReduce program on the example to better understand what happens in each iteration.
November 3 Project proposals due at 11pm  
November 9 Dryad; databases Read the papers on Dryad, DryadLINQ, and parallel databases. For more information about transactions, consult any standard database textbook.
November 16 Project progress presentations in class  
November 23 No class: Thanksgiving.  
November 30 GPU computing (by Perhaad Mistry); Parallel computing classics It is important that you go through this excellent tutorial on parallel computing from LLNL. Also make sure you read this overview article on GPU computing. Perhaad's research is discussed in more detail in this paper. And MPI and OpenMP are discussed in two other nice tutorials at LLNL.
December 2 Project reports due at 11pm  
December 7 Final lecture Project presentations in class
December 14 Final exam 6-8pm in usual classroom  

Course Information

Instructor: Mirek Riedewald

TA: no TA

Meeting times: Wed 6 - 9 PM
Meeting location: Ryder Hall 429

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Reading Materials

  1. "Hadoop: The Definitive Guide" by Tom White, 2nd edition. (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
  2. "Hadoop in Action" by Chuck Lam  (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
  3. "Data-Intensive Text Processing with MapReduce" by Jimmy Lin and Chris Dyer. (Available online, see http://www.umiacs.umd.edu/~jimmylin/book.html for info.)

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.