CS 6240: Large-Scale Parallel Data Processing

Covers big-data-analysis techniques that scale out with increasing number of compute nodes, e.g., for cloud computing. Focuses on approaches for problem and data partitioning that distribute work effectively while keeping total cost for computation and data transfer low. Deterministic and random algorithms from a variety of domains, including graphs, data mining, linear algebra, and information retrieval, are studied and analyzed in terms of their cost, scalability, and robustness against skew. Coursework emphasizes hands-on programming experience with modern state-of-the-art big-data-processing technology. Students who do not meet course prerequisites may seek permission of instructor.


News

Most aspects of the course are managed through Northeastern Canvas (https://northeastern.instructure.com/). There you will find links to most other relevant services:

In addition to the Northeastern Canvas, we will also use Amazon's Canvas instance. This is how Amazon now makes their free AWS credits accessible for courses.

Please do not email your course-related questions or concerns. By default, use the Discussions feature in Canvas to post your course-related inquiries. For questions about protected information such as your grades, please use the Canvas Inbox feature and address the instructor and/or TAs directly. For re-grade requests, use the corresponding Gradescope feature.


Course Material

Please read the syllabus carefully.

Go to this page for the online modules. Please make sure you go through the material before the week it is discussed in class.

Office Hours

See Zoom meeting links and course calendar in Canvas.

Important dates