CS 6240: Parallel Data Processing in MapReduce

This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.


News

Acknowledgment: This course was kindly supported by an AWS in Education Grant award from Amazon.com, Inc.


Lectures

(Future lectures and events are tentative.)

Week of: Topic Remarks
Jan 13 Syllabus
Overview: data and harware trends
Cloud computing
 
Jan 20 Scalability and metrics
Amdahl's Law
Google File System, Hadoop's HDFS
Assignment 1 out. Due Feb 1.
Jan 27 MapReduce and Hadoop  
Feb 3 Fundamental Techniques
(Includes: in-mapper combining, sorting, secondary
sorting)
Assignment 2 out. Due Feb 15.
Feb 10 Basic Algorithms
(Includes: order inversion, per-record computation,
group-by, global counters, random sampling and shuffling, quantiles,
top-k)
 
Feb 17 Basic Algorithms: Advanced (Includes: reduce-side join, replicated join,
semi-join with Bloom filter)
Assignment 3 out. Due Mar 1.
Feb 24 Pig and Pig Latin  
Mar 3 Relational Databases Assignment 4 out. Due Mar 22.
Mar 10 No class. Spring Break.  
Mar 17 CAP theorem
HBase
Hive
Project starts: team forming, proposal. Due Mar 29.
Mar 24 Midterm exam  
Mar 31 Graph Algorithms
(Includes: single source shortest path, PageRank)
Project progress report assignment out. Due Apr 12.
Apr 7 Intelligent Partitioning
(Includes: Pairs and Stripes, theta-join)
 
Apr 14 Data Mining 1: clustering, classification Project final report assignment out. Due Apr 26.
Project presentation assignment out. Due Apr 27.
Apr 21 Data Mining 2: ensemble methods, regression, matrix manipulation  
Apr 28 Project presentations Same time and location as lecture.

Course Information

Instructor: Mirek Riedewald

TAs:

Meeting times and location: check registrar system for up-to-date info

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Reading Materials

  1. "Hadoop: The Definitive Guide" by Tom White, 3rd edition. (Available from Safari Books Online.)
  2. "MapReduce Design Patterns" by Donald Miner and Adam Shook (Available from Safari Books Online.)
  3. "Hadoop in Practice" by Alex Holmes (Available from Safari Books Online.)
  4. "Hadoop in Action" by Chuck Lam  (Available from Safari Books Online.)
  5. "Data-Intensive Text Processing with MapReduce" by Jimmy Lin and Chris Dyer. (Available online, see http://www.umiacs.umd.edu/~jimmylin/book.html for info.)
  6. "HBase: The Definitive Guide" by Lars George. (Available from Safari Books Online.)
  7. Check out Yahoo!'s Hadoop tutorial for additional information. Notice that it uses the old MapReduce API.

Safari Books Online at NEU: http://proquest.safaribooksonline.com.ezproxy.neu.edu/ (might have changed in the meantime)

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.