CS 6240: Large-Scale Parallel Data Processing

Covers big-data analysis techniques that scale out with increasing number of compute nodes, e.g., for cloud computing. Focuses on approaches for problem and data partitioning that distribute work effectively while keeping total cost for computation and data transfer low. Deterministic and random algorithms from a variety of domains, including graphs, data mining, linear algebra, and information retrieval, are studied and analyzed in terms of their cost, scalability, and robustness against skew. Coursework emphasizes hands-on programming experience with modern state-of-the-art big-data processing technology. Students who do not meet course prerequisites may seek permission of instructor.


Course Material

Week 1

Introduction (read by end of week 1)

Parallel Processing Basics (read by end of week 1). Answer all self-check quiz questions by end of week 1.

Week 2

Introduction to Distributed Services (read by end of week 1)

Distributed File System (read by end of week 1)

Resource and Application Management (read by end of week 1)

Start working on HW 1. (due by the end of week 3)

Week 3

Overview of MapReduce and Spark (read by end of week 2)

Submit HW 1 by the end of week 3. Check for a possible early submission bonus.

Week 4

Fundamental Techniques (read by end of week 3)

Handout: Defining New Value and Key Types in Hadoop MapReduce

Start working on HW 2. (due by the end of week 5)

Week 5

Joins (read by end of week 4)

Submit HW 2 by the end of week 5. Check for a possible early submission bonus.

Week 6

Common Algorithm Building Blocks (read by end of week 5)

Start working on HW 3. (due by the end of week 7)

Week 7

Graph Algorithms (read by end of week 6)

Submit HW 3 by the end of week 7. Check for a possible early submission bonus.

Start exploring project topics and find team mates. You can already start with your project.

Week 8

Data Mining 1: K-means, decision trees (read by end of week 7)

Start working on HW 4. (due by the end of week 9)

Week 9

Data Mining 2: Ensembles (read by end of week 8)

Submit HW 4 by the end of week 9. Check for a possible early submission bonus.

Week 10

Intelligent Partitioning (read by end of week 9)

By now you should be working on the project. And you must have your project team and topic selected by the end of week 10.

Week 11

More about Spark (read by end of week 10)

Keep working on the project.

Week 12

Exam

Submit project intermediate report by the end of week 12. Check for a possible early submission bonus.

Week 13

Beyond MapReduce and Spark: CAP, HBase, and Hive (read after discussion in lecture)

Week 14

Submit project deliverables by the end of week 14. Check for a possible early submission bonus.

Week 15

Project presentations