CS 6240: Parallel Data Processing in MapReduce

This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.


News

Link to Piazza discussion forum: https://piazza.com/northeastern/spring2013/cs6240/home

Acknowledgment: This course was kindly supported by an AWS in Education Coursework Grant award from Amazon.com, Inc.

[04/08/2013] All lecture audio and slides up to Apr 2 lecture on Blackboard


Lectures

(Future lectures and events are tentative.)

Date Topic Remarks and Reading Assignments
Jan 8 Introduction, simple algorithms, measures of success  
Jan 15 Performance measures, Amdahl's law, MapReduce overview, word count, combiner, partitioner Read the Google MapReduce paper. Look carefully at the word count example and make sure you can explain how the computation works.
Jan 22 Failure handling, equi-join, reverse Web graph, inverted index, anatomy of a MapReduce execution, Google File System Read the relevant chapters in White's book: 1. Meet Hadoop, 2. MapReduce, 3. The Hadoop Distributed File System, 4. Hadoop I/O.
Read the Google File System paper.
Jan 29 Hadoop specifics, sorting, in-mapper combining, Pairs and Stripes Read the relevant chapters in White's book: 5. Developing a M-R Application, 6. How M-R Works, 7. M-R Types and Formats, 8. M-R Features.
Consult the Lin/Dyer and the Miner/Shook books about the design patterns.
Feb 5 Relative frequencies through order inversion, secondary sort through value-to-key conversion, Pig and PigLatin Read the relevant chapters in White's book: 11. Pig.
Consult the Lin/Dyer and the Miner/Shook books about the design patterns.
Feb 12 Relational databases, HBase, Hive Read the relevant chapters in White's book: 12. Hive, 13. HBase. For more details about HBase, consult the George book.
Feb 19 HW 2 discussion, Graph algorithms: single source shortest path, PageRank introduction Read the appropriate sections in the Lin/Dyer book (see below). Create a small example graph and manually run the MapReduce programs on the example to better understand what happens in each iteration.
Feb 26 PageRank, data mining in MapReduce: sampling, clustering Read more about PageRank here. For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks.
Mar 5 No class: Spring Break  
Mar 12 Data mining in MapReduce: classification and prediction For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks.
Mar 19 Midterm exam Same time and location as lecture.
Mar 26 Discussion of midterm solutions; ensemble predictions and how to cover all combinations; matrix multiplication and machine learning For more information about machine learning techniques that rely of matrix manipulations read this paper.
Apr 2 Testing and tuning; theta joins in MapReduce Read more about testing and tuning in White's book. The theta-join technique is discussed in our paper.
Apr 9 Project progress presentations  
Apr 16    
Apr 23 Project presentations  

Course Information

Instructor: Mirek Riedewald

TA: Moonyoung Kang

Meeting times: Tue 6 - 9 PM
Meeting location: 210 Shillman Hall

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Reading Materials

  1. "Hadoop: The Definitive Guide" by Tom White, 3rd edition. (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
  2. "MapReduce Design Patterns" by Donald Miner and Adam Shook (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
  3. "Hadoop in Practice" by Alex Holmes (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
  4. "Hadoop in Action" by Chuck Lam  (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
  5. "Data-Intensive Text Processing with MapReduce" by Jimmy Lin and Chris Dyer. (Available online, see http://www.umiacs.umd.edu/~jimmylin/book.html for info.)
  6. "HBase: The Definitive Guide" by Lars George. (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
  7. Check out Yahoo!'s Hadoop tutorial for additional information. Notice that it uses the old MapReduce API.

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.