CS6240: Parallel Data Processing in MapReduce

CS 6240: Parallel Data Processing in MapReduce

This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.

News

Link to Piazza discussion forum: https://piazza.com/northeastern/spring2014/cs6240/home

Acknowledgment: This course was kindly supported by an AWS in Education Grant award from Amazon.com, Inc.

[04/11/2014] Reminder: no regular class on 4/15. Instead we have a double class on 4/22, from 11:30am until 4:35pm in our regular lecture hall.
[04/11/2014] All slides and audio of all lectures are now on Blackboard.

Lectures

(Future lectures and events are tentative.)

Date	Topic	Remarks and Reading Assignments
Jan 7	Syllabus and overview; introduction; simple algorithms; measures of success; Amdahl's Law	Read more about data centers and "data center as a computer" here.
Jan 14	MapReduce overview: distributed file system, Word Count, anatomy of a MapReduce execution, partitioner, failure handling, Hadoop specifics	Read the Google File System paper. Read the Google MapReduce paper. Look carefully at the word count example and make sure you can explain how the computation works. For a detailed discussion, consult the relevant chapters in White's book. For a more compact discussion, consult the Lin/Dyer book.
Jan 21	Fundamental techniques: combiner and in-mapper combining, sorting, secondary sort	Make sure you can explain in detail how the sorting algorithm works. For a detailed discussion about sorting, consult the relevant chapters in White's book. For in-mapper combining and secondary sort, consult the Lin/Dyer book.
Jan 28	Algorithm examples and helper functions (order inversion, sampling, quantiles etc.)	Consult the Miner/Shook book about some of the helper functions discussed.
Feb 4	More algorithm examples (equi-join); Pig and PigLatin	Consult the Miner/Shook book about some of the algorithms discussed. Consult the Lin/Dyer and the Miner/Shook books about the design patterns. Read the following chapter in White's book: 11. Pig.
Feb 11	Relational databases; CAP; HBase; Hive	Take a look at the appropriate chapters in [M. Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems. Springer, 2011. Third edition.] to learn more about relational databases in a distributed context. Read the following chapters in White's book: 12. Hive, 13. HBase. For more details about HBase, consult the George book.
Feb 18	Graph algorithms	Read the appropriate sections in the Lin/Dyer book. Create a small example graph and manually run the MapReduce programs on the example to better understand what happens in each iteration. Read more about PageRank here.
Feb 25	Intelligent partitioning: Pairs and Stripes, theta-joins	Read more about Pairs and Stripes in the Lin/Dyer book. The theta-join technique is discussed in our research paper.
Mar 4	No class: Spring Break
Mar 11	Midterm exam	Same time and location as lecture.
Mar 18	Data mining in MapReduce (clustering, classification)	For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks.
Mar 25	Data mining in MapReduce (ensemble methods, regression, matrix manipulation for machine learning)	For more information about machine learning techniques that rely on matrix manipulations read this paper.
Apr 1	Testing, tuning, and analysis; case studies: search log analysis, HBase for indexing/sorting	Read more about testing and tuning in White's book.
Apr 8	Classic view of parallel computing vs. MapReduce	Take a look at the parallel computing tutorial by LLNL. There are similar tutorials about MPI and OpenMP.
Apr 15	No class: Moved to Apr 22
Apr 22	Project presentations	Double class: 11:30am to 4:35pm

Course Information

Instructor: Mirek Riedewald

Office hours: Monday 3:00-4:30pm in 332 WVH
Send email (including the TAs) to set up an appointment if you cannot make it during these times.

TAs:

Rundong Li
- Office hours: Thursday 1:00-2:00pm and Friday 10am-noon in 472 WVH
Tejal Borkar
- Office hours: Wednesday 11am-1pm and Thursday 11am-noon in CCIS computer Lab

Meeting times: Tue 1:35 - 4:35 PM
Meeting location: check registrar system for up-to-date info

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Homework/project: 50%
Midterm exam: 40%
Participation: 5%
Review quizzes: 5%

Reading Materials

"Hadoop: The Definitive Guide" by Tom White, 3rd edition. (Available from Safari Books Online.)
"MapReduce Design Patterns" by Donald Miner and Adam Shook (Available from Safari Books Online.)
"Hadoop in Practice" by Alex Holmes (Available from Safari Books Online.)
"Hadoop in Action" by Chuck Lam (Available from Safari Books Online.)
"Data-Intensive Text Processing with MapReduce" by Jimmy Lin and Chris Dyer. (Available online, see http://www.umiacs.umd.edu/~jimmylin/book.html for info.)
"HBase: The Definitive Guide" by Lars George. (Available from Safari Books Online.)
Check out Yahoo!'s Hadoop tutorial for additional information. Notice that it uses the old MapReduce API.

Safari Books Online at NEU: http://proquest.safaribooksonline.com.ezproxy.neu.edu/ (might have changed in the meantime)

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.