CS 6240: Parallel Data Processing in MapReduce

This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.


News

Link to Piazza discussion forum: https://piazza.com/northeastern/spring2014/cs6240/home

Acknowledgment: This course was kindly supported by an AWS in Education Grant award from Amazon.com, Inc.

[04/11/2014] Reminder: no regular class on 4/15. Instead we have a double class on 4/22, from 11:30am until 4:35pm in our regular lecture hall.
[04/11/2014] All slides and audio of all lectures are now on Blackboard.


Lectures

(Future lectures and events are tentative.)

Date Topic Remarks and Reading Assignments
Jan 7 Syllabus and overview; introduction; simple algorithms; measures of success; Amdahl's Law Read more about data centers and "data center as a computer" here.
Jan 14 MapReduce overview: distributed file system, Word Count, anatomy of a MapReduce execution, partitioner, failure handling, Hadoop specifics Read the Google File System paper. Read the Google MapReduce paper. Look carefully at the word count example and make sure you can explain how the computation works. For a detailed discussion, consult the relevant chapters in White's book. For a more compact discussion, consult the Lin/Dyer book.
Jan 21 Fundamental techniques: combiner and in-mapper combining, sorting, secondary sort Make sure you can explain in detail how the sorting algorithm works. For a detailed discussion about sorting, consult the relevant chapters in White's book. For in-mapper combining and secondary sort, consult the Lin/Dyer book.
Jan 28 Algorithm examples and helper functions (order inversion, sampling, quantiles etc.) Consult the Miner/Shook book about some of the helper functions discussed.
Feb 4 More algorithm examples (equi-join); Pig and PigLatin Consult the Miner/Shook book about some of the algorithms discussed. Consult the Lin/Dyer and the Miner/Shook books about the design patterns. Read the following chapter in White's book: 11. Pig.
Feb 11 Relational databases; CAP; HBase; Hive Take a look at the appropriate chapters in [M. Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems. Springer, 2011. Third edition.] to learn more about relational databases in a distributed context. Read the following chapters in White's book:  12. Hive, 13. HBase. For more details about HBase, consult the George book.
Feb 18 Graph algorithms Read the appropriate sections in the Lin/Dyer book. Create a small example graph and manually run the MapReduce programs on the example to better understand what happens in each iteration. Read more about PageRank here.
Feb 25 Intelligent partitioning: Pairs and Stripes, theta-joins Read more about Pairs and Stripes in the Lin/Dyer book. The theta-join technique is discussed in our research paper.
Mar 4 No class: Spring Break  
Mar 11 Midterm exam Same time and location as lecture.
Mar 18 Data mining in MapReduce (clustering, classification) For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks.
Mar 25 Data mining in MapReduce (ensemble methods, regression, matrix manipulation for machine learning) For more information about machine learning techniques that rely on matrix manipulations read this paper.
Apr 1 Testing, tuning, and analysis; case studies: search log analysis, HBase for indexing/sorting Read more about testing and tuning in White's book.
Apr 8  Classic view of parallel computing vs. MapReduce Take a look at the parallel computing tutorial by LLNL. There are similar tutorials about MPI and OpenMP.
Apr 15 No class: Moved to Apr 22  
Apr 22 Project presentations Double class: 11:30am to 4:35pm

Course Information

Instructor: Mirek Riedewald

TAs:

Meeting times: Tue 1:35 - 4:35 PM
Meeting location: check registrar system for up-to-date info

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Reading Materials

  1. "Hadoop: The Definitive Guide" by Tom White, 3rd edition. (Available from Safari Books Online.)
  2. "MapReduce Design Patterns" by Donald Miner and Adam Shook (Available from Safari Books Online.)
  3. "Hadoop in Practice" by Alex Holmes (Available from Safari Books Online.)
  4. "Hadoop in Action" by Chuck Lam  (Available from Safari Books Online.)
  5. "Data-Intensive Text Processing with MapReduce" by Jimmy Lin and Chris Dyer. (Available online, see http://www.umiacs.umd.edu/~jimmylin/book.html for info.)
  6. "HBase: The Definitive Guide" by Lars George. (Available from Safari Books Online.)
  7. Check out Yahoo!'s Hadoop tutorial for additional information. Notice that it uses the old MapReduce API.

Safari Books Online at NEU: http://proquest.safaribooksonline.com.ezproxy.neu.edu/ (might have changed in the meantime)

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.