CS 6240: Large-Scale Parallel Data Processing

Covers big-data analysis techniques that scale out with increasing number of compute nodes, e.g., for cloud computing. Focuses on approaches for problem and data partitioning that distribute work effectively while keeping total cost for computation and data transfer low. Deterministic and random algorithms from a variety of domains, including graphs, data mining, linear algebra, and information retrieval, are studied and analyzed in terms of their cost, scalability, and robustness against skew. Coursework emphasizes hands-on programming experience with modern state-of-the-art big-data processing technology. Students who do not meet course prerequisites may seek permission of instructor.


Course Material

Week 1

Introduction (read ASAP in week 1)

Parallel Processing Basics (read ASAP in week 1).

Start working on HW 1. (due by the end of week 2)

Week 2

Introduction to Distributed Services (read by end of week 1)

Distributed File System (read by end of week 1)

Resource and Application Management (read by end of week 1)

Submit HW 1 by the end of week 2. Check for a possible early submission bonus.

Week 3

Overview of MapReduce and Spark (read by end of week 2)

Start working on HW 2. (due by the end of week 4)

Week 4

Joins (read by end of week 3)

Submit HW 2 by the end of week 4. Check for a possible early submission bonus.

Week 5

Fundamental Techniques (read by end of week 4)

Start working on HW 3. (due by the end of week 6)

Week 6

Common Algorithm Building Blocks (read by end of week 5)

Submit HW 3 by the end of week 6. Check for a possible early submission bonus.

Week 7

Graph Algorithms (read by end of week 6)

Start working on HW 4. (due by the end of week 8)

Start exploring project topics and find team mates. You can already start with your project.

Week 8

Data Mining 1: K-means, decision trees (read by end of week 7)

Submit HW 4 by the end of week 8. Check for a possible early submission bonus.

Week 9

Data Mining 2: Ensembles (read by end of week 8)

By now you should be working on the project. And you must have your project team and topic selected by the end of week 9.

Week 10

Intelligent Partitioning (read by end of week 9)

Keep working on the project.

Week 11

More about Spark (read by end of week 10)

Submit project intermediate report by the end of week 11. Check for a possible early submission bonus.

Week 12

Exam

Week 13

Beyond MapReduce and Spark: CAP, HBase, and Hive (you may read this after the discussion in class)

Week 14

Submit project deliverables by the end of week 14. Check for a possible early submission bonus.

Week 15

Project presentations