Time: Mondays 6pm  9pm
Room: Snell Library 035
JanWillem van de Meent [personal page]
Email:
Phone: +1 617 373 7696
Office Hours: WVH 478, Monday 4.00pm  5.30pm (or by appointment)
Yuan Zhong [personal page]
Email:
Office Hours: WVH 462 & 466B, Friday 3.00pm  4.30pm.
This course introduces a range of techniques in data mining and unsupervised machine learning:
This course is designed for MS students in computer science. Lectures will focus on developing a mathematical and algorithmic understanding of the methods commonly employed to solve unsupervised machine learning and data mining problems. Homework problem sets will ask students to implement algorithms and/or work out examples.
Students will also collaborate on a project in which they must complete a data mining task from start to finish, including preprocessing of data, analysis, and visualization of results.
CS 5800 or CS 7800, or consent of instructor. Students without the prerequisites should email a CV and transcripts to the instructor. If these materials are acceptable the student will be asked to complete the selftest prior to admission to the course.
Students are expected to have a good working knowledge of basic linear algebra, probability, statistics, and algorithms. A selftest will be provided in the first lecture to assess background knowledge. Students that have not taken CS 5800 or CS 7800 should email a CV and transcript to the instructor and will then be asked to complete the selftest prior to admission to the course.
This class is not structured to directly follow the outline of a text book. The schedule will list chapters from a number of text books as background reading for each lecture, as well as additional additional materials. Students are expected to read the materials in preparation of each lecture.
[HTF] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction., Springer 2013. [pdf]
[LRU] Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2014 [pdf]
[TSK] PangNing Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, 2005. [ch6, ch8]
[Aggarwal] Charu C. Aggarwal, Data Mining, The Textbook, Springer 2015. [pdf]
The HTF and LRU books are freely available from the authors’ websites. The Aggarwal book is available online to Northeastern students.
The homework in this class will consist of 5 problem sets. Submissions must be made via blackboard by 11.59pm on the due date. Please upload a single ZIP file containing both source code for programming problems as well as PDF files for math problems. Name this zip file:
Please follow the following guidelines:
Math Problems
Please submit math exercises as PDF files (preferably in LaTeX).
Programming Problems
You may use any programming language you like, as long as your submission contains clear instructions on how to compile and run the code.
Data File Path: Don’t use absolute path for data files in code. Please add a data folder to your project and refer to it using relative path.
3rd Party Jars: If you are using any 3rd party jar, make sure you attach that to submission.
Clarity: When coding up multiple variants of an algorithm, ensure that your code is properly factored into small, readable and clearly commented functions.
The TAs can deduct points for submissions that do not meet these guidelines at their discretion.
The goal of the project is to gain handson experience with analysis of a reallife dataset of your choice. You should select a problem and a dataset that can be analyzed using methods covered in class. The project should be conducted in groups of 24 people. Each group should work independently, but you are welcome to discuss technical issues on Piazza. Completion of the project will include a project proposal, two milestone project updates, a report, and a review of the project by another team.
Students are expected to attend lectures and actively participate by asking questions. While students are required to complete homework programming exercises individually, helping fellow students by explaining course material is encouraged. At the end of the semester, students will be able to indicate which of their peers most contributed to their understanding of the material, and bonus points will be awarded based on this feedback.
The final grade for this course will be weighted as follows:
Bonus points earned through class participation will be used to adjust the final grade upwards at the discretion of the instructor.
Students will be asked to indicate the amount of time spent on each homework, as well as the project. The will also be able to indicate what they think went well, and what they think did not go well. There will also be an opportunity to provide feedback on the class after the midterm exam.
Note: This schedule is subject to change and will be adjusted as needed throughout the semester.
Day  Lectures  Homework & Project  Reading 

09 Jan  Introduction & Background  Self test out  Aggarwal: 1.12.3; LRU: 12; Murray & Ghahramani Crib Sheet; CS 229 Notes on Probability & Linear Algebra 
16 Jan  (no class: Martin Luther King Day)  HW1 out Self test due (Fri) 

23 Jan  Frequent Item Sets & Association Rules  HW2 out HW1 due (Fri) 
TSK: 6.16.6 
30 Jan  Bayesian Regression, Gaussian Processes  Teams due (Fri)  Rassmussen & Williams: 1,2,4.14.2; 
06 Feb  Kernel Regression (cont), Linear Dimensionality Reduction  HW3 out HW2 due (Fri) 
LRU: 11.111.3; HTF 14.5; Cunningham & Ghahramani; van der Maaten & Hinton 
13 Feb  Kmeans, Hierarchical Clustering, DBSCAN  Abstracts due (Fri)  TSK: 8.18.4 
20 Feb  (no class: Presidents Day)  HW3 due (Fri)  
27 Feb  Evaluating Clustering, Gaussian Mixtures, EM  HW4 out Proposals due (Fri) 
Grosse & Srivastava 
06 Mar  (no class: Spring Break)  
13 Mar  Midterm exam (1h10m)  Milestone 1 due (Sun)  
20 Mar  Topic Models (2h), Project Pitches (1h)  HW4 due (Sun)  Blei 
27 Mar  Community Detection  Milestone 2 due (Sun)  LRU: 10.110.3; Fortunato: IVII 
03 Apr  Link Analysis, Hidden Markov Models  LRU: 5; HMM notes  
10 Apr  Recommender Systems  Reports due (Sun)  Aggarwal: 18.5 
17 Apr  (no class: Patriots Day)  Peer reviews due (Sun)  
24 Apr  Exam week starts 