CS 6240: Parallel Data Processing in MapReduce

This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.


News

Link to Piazza discussion forum: https://piazza.com/northeastern/fall2013/cs6240/home

Acknowledgment: This course was kindly supported by an AWS in Education Grant award from Amazon.com, Inc.

[11/20/2013] Audio and slides from 11/19 lecture available on Blackboard


Lectures

(Future lectures and events are tentative.)

Date Topic Remarks and Reading Assignments
Sep 10 Syllabus and overview, introduction, simple algorithms Read more about data centers and "data center as a computer" here.
Sep 17 Measures of success, Amdahl's Law, Google File System, MapReduce overview, Word Count Read the Google File System paper. Read the Google MapReduce paper. Look carefully at the word count example and make sure you can explain how the computation works. Read the following chapters in White's book: 1. Meet Hadoop, 2. MapReduce, 3. The Hadoop Distributed File System, 4. Hadoop I/O.
Sep 24 Anatomy of a MapReduce execution, combiner, partitioner, failure handling, Hadoop specifics Read the following chapters in White's book: 5. Developing a M-R Application, 6. How M-R Works, 7. M-R Types and Formats, 8. M-R Features.
Oct 1 Algorithm examples and helper functions (sorting, sampling etc.), design patterns (in-mapper combining) Make sure you can explain in detail how the sorting algorithm works. Consult the Miner/Shook book about some of the helper functions discussed. Consult the Lin/Dyer and the Miner/Shook books about the design patterns.
Oct 8 More algorithm examples (quantiles, equi-join) and more design patterns (Pairs and Stripes) Consult the Miner/Shook book about some of the algorithms discussed. Consult the Lin/Dyer and the Miner/Shook books about the design patterns.
Oct 15 More design patterns (order inversion, secondary sort), Pig and PigLatin Consult the Lin/Dyer and the Miner/Shook books about the design patterns. Read the following chapter in White's book: 11. Pig.
Oct 22 Graph algorithms Read the appropriate sections in the Lin/Dyer book. Create a small example graph and manually run the MapReduce programs on the example to better understand what happens in each iteration. Read more about PageRank here.
Oct 29 Relational databases, HBase  Take a look at the appropriate chapters in [M. Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems. Springer, 2011. Third edition.] to learn more about relational databases in a distributed context. Read the following chapter in White's book: 13. HBase. For more details about HBase, consult the George book.
Nov 5 Midterm exam Same time and location as lecture.
Nov 12 Hive, data mining in MapReduce (clustering, classification) Read the following chapter in White's book: 12. Hive. For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks.
Nov 19 Data mining in MapReduce (ensemble methods, matrix manipulation for machine learning) For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks. For more information about machine learning techniques that rely on matrix manipulations read this paper.
Nov 26 Project progress presentations  
Dec 3 Theta joins in MapReduce, testing and tuning, classic view of parallel computing vs. MapReduce  The theta-join technique is discussed in our paper. Read more about testing and tuning in White's book. Take a look at the nice parallel computing tutorial by LLNL. There are similar tutorials about MPI and OpenMP.
Dec 10 Project presentations  

Course Information

Instructor: Mirek Riedewald

TAs:

Meeting times: Tue 6 - 9 PM
Meeting location: check registrar system for up-to-date info

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Reading Materials

  1. "Hadoop: The Definitive Guide" by Tom White, 3rd edition. (Available from Safari Books Online.)
  2. "MapReduce Design Patterns" by Donald Miner and Adam Shook (Available from Safari Books Online.)
  3. "Hadoop in Practice" by Alex Holmes (Available from Safari Books Online.)
  4. "Hadoop in Action" by Chuck Lam  (Available from Safari Books Online.)
  5. "Data-Intensive Text Processing with MapReduce" by Jimmy Lin and Chris Dyer. (Available online, see http://www.umiacs.umd.edu/~jimmylin/book.html for info.)
  6. "HBase: The Definitive Guide" by Lars George. (Available from Safari Books Online.)
  7. Check out Yahoo!'s Hadoop tutorial for additional information. Notice that it uses the old MapReduce API.

Safari Books Online at NEU: http://proquest.safaribooksonline.com.ezproxy.neu.edu/ (might have changed in the meantime)

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.