CS 6220: Data Mining Techniques

This course covers various aspects of data mining including data preprocessing, classification, ensemble methods, association rules, sequence mining, and cluster analysis. The class project involves hands-on practice of mining useful knowledge from a large data set.


News

[04/12/2012] All class material is now available online. Good luck for the project and final exam preparation!


Lectures

(Future lectures and events are tentative.)

Date Topic Remarks and Homework
Jan 9 Introduction; Data Preprocessing Read chapters 1 (introduction), 2 (getting to know your data), and 3 (preprocessing) in the textbook. To get started with Weka and experience what we discussed in the lecture, do the optional homework that is available on Blackboard.
Jan 16 No class: MLK Day  
Jan 23 Classification/Prediction: decision trees and overfitting Read relevant sections in chapter 8 (classification: basic concepts). Make sure you know what overfitting means and that you understand the overfitting example (big versus small tree) we discussed in class.
Jan 30 Decision trees; statistical learning theory Read relevant sections in chapter 8 (classification: basic concepts). For more information, also look at references [1] for trees and [5] for statistical decision theory (see below).
Feb 6 Statistical learning theory; nearest neighbor; Bayes' theorem Read relevant sections in chapter 8 (classification: basic concepts) and 9 (classification: advanced methods). For more information, also look at reference [5] for statistical decision theory (see below).
Feb 7 HW 1 due (11pm)  
Feb 13 Naive Bayes; joint distribution; Bayesian networks Read relevant sections in chapter 8 (classification: basic concepts) and 9 (classification: advanced methods). For more information about Bayesian classification, also look at the other references, e.g., [2] and [6]. Go over the Naive Bayes example and the examples for computing probabilities of interest from the joint probability table. Make sure you can compute such probabilities for a given example.
Feb 20 No class: Presidents' Day  
Feb 24 Extra class: makeup day for Monday holidays
Bayesian networks; artificial neural networks
Read relevant sections in chapter 9 (classification: advanced methods). For more information about Bayesian classification, also look at the other references, e.g., [2] and [6]. Go carefully over the Bayesian network computation examples and also the backpropagation example in the textbook.
Feb 27 SVMs; regression Read relevant sections in chapter 9 (classification: advanced methods). If you are interested in more information about SVMs, let me know and I can point you to an interesting survey article.
Mar 1 HW 2 due (11pm)  
Mar 5 No class: Spring Break  
Mar 12 Midterm Exam Same time and location as lectures.
Mar 19 Accuracy and error measures; ensemble methods Study the various model quality measures. Read section 8.5 (model evaluation and selection) and the beginning of section 8.6 (techniques to improve classification accuracy).
Mar 26 Ensemble methods; frequent pattern mining Read section 8.6 (techniques to improve classification accuracy) and relevant sections in chapter 6 (mining frequent patterns, associations, and correlations). Run the Apriori algorithm manually on a small example. Observe when and how it is pruning the search space.
Apr 2 Frequent pattern mining Read the relevant sections in chapters 6 (mining frequent patterns, associations, and correlations) and 7 (advanced frequent pattern mining). For FP-growth and sequence mining, focus on the main ideas, not the algorithmic details.
Apr 7 Project report 1 due (11pm)  
Apr 9 Clustering Read the relevant sections in chapters 10 (cluster analysis: basic concepts and methods) and 11 (advanced cluster analysis).
Apr 14 Project report 2 due (11pm)  
Apr 16 No class: Patriots' Day  
Apr 19 Final project report due (11pm)  
April 23 Final exam  

Course Information

Instructor: Mirek Riedewald

TA: We have no TA this semester :-(

Lecture times: Mon 6 - 9 PM
Lecture location: Ryder Hall 429

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Textbook

Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufmann, 2011

Recommended books for further reading:

  1. "Data Mining" by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (http://www-users.cs.umn.edu/~kumar/dmbook/index.php)
  2. "Machine Learning" by Tom Mitchell (http://www.cs.cmu.edu/~tom/mlbook.html)
  3. "Introduction to Machine Learning" by Ethem ALPAYDIN (http://www.cmpe.boun.edu.tr/~ethem/i2ml/)
  4. "Pattern Classification" by Richard O. Duda, Peter E. Hart, David G. Stork (http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471056693.html)
  5. "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (http://www-stat.stanford.edu/~tibs/ElemStatLearn/)
  6. "Pattern Recognition and Machine Learning" by Christopher M. Bishop (http://research.microsoft.com/en-us/um/people/cmbishop/prml/)

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.