CS6220: Data Mining Techniques

CS 6220: Data Mining Techniques

This course covers various aspects of data mining including data preprocessing, classification, ensemble methods, association rules, sequence mining, and cluster analysis. The class project involves hands-on practice of mining useful knowledge from a large data set.

News

[04/12/2012] All class material is now available online. Good luck for the project and final exam preparation!

Lectures

(Future lectures and events are tentative.)

Date	Topic	Remarks and Homework
Jan 9	Introduction; Data Preprocessing	Read chapters 1 (introduction), 2 (getting to know your data), and 3 (preprocessing) in the textbook. To get started with Weka and experience what we discussed in the lecture, do the optional homework that is available on Blackboard.
Jan 16	No class: MLK Day
Jan 23	Classification/Prediction: decision trees and overfitting	Read relevant sections in chapter 8 (classification: basic concepts). Make sure you know what overfitting means and that you understand the overfitting example (big versus small tree) we discussed in class.
Jan 30	Decision trees; statistical learning theory	Read relevant sections in chapter 8 (classification: basic concepts). For more information, also look at references [1] for trees and [5] for statistical decision theory (see below).
Feb 6	Statistical learning theory; nearest neighbor; Bayes' theorem	Read relevant sections in chapter 8 (classification: basic concepts) and 9 (classification: advanced methods). For more information, also look at reference [5] for statistical decision theory (see below).
Feb 7	HW 1 due (11pm)
Feb 13	Naive Bayes; joint distribution; Bayesian networks	Read relevant sections in chapter 8 (classification: basic concepts) and 9 (classification: advanced methods). For more information about Bayesian classification, also look at the other references, e.g., [2] and [6]. Go over the Naive Bayes example and the examples for computing probabilities of interest from the joint probability table. Make sure you can compute such probabilities for a given example.
Feb 20	No class: Presidents' Day
Feb 24	Extra class: makeup day for Monday holidays Bayesian networks; artificial neural networks	Read relevant sections in chapter 9 (classification: advanced methods). For more information about Bayesian classification, also look at the other references, e.g., [2] and [6]. Go carefully over the Bayesian network computation examples and also the backpropagation example in the textbook.
Feb 27	SVMs; regression	Read relevant sections in chapter 9 (classification: advanced methods). If you are interested in more information about SVMs, let me know and I can point you to an interesting survey article.
Mar 1	HW 2 due (11pm)
Mar 5	No class: Spring Break
Mar 12	Midterm Exam	Same time and location as lectures.
Mar 19	Accuracy and error measures; ensemble methods	Study the various model quality measures. Read section 8.5 (model evaluation and selection) and the beginning of section 8.6 (techniques to improve classification accuracy).
Mar 26	Ensemble methods; frequent pattern mining	Read section 8.6 (techniques to improve classification accuracy) and relevant sections in chapter 6 (mining frequent patterns, associations, and correlations). Run the Apriori algorithm manually on a small example. Observe when and how it is pruning the search space.
Apr 2	Frequent pattern mining	Read the relevant sections in chapters 6 (mining frequent patterns, associations, and correlations) and 7 (advanced frequent pattern mining). For FP-growth and sequence mining, focus on the main ideas, not the algorithmic details.
Apr 7	Project report 1 due (11pm)
Apr 9	Clustering	Read the relevant sections in chapters 10 (cluster analysis: basic concepts and methods) and 11 (advanced cluster analysis).
Apr 14	Project report 2 due (11pm)
Apr 16	No class: Patriots' Day
Apr 19	Final project report due (11pm)
April 23	Final exam

Course Information

Instructor: Mirek Riedewald

Office hours: Wednesday 10:30am-noon in 332 WVH
Send email to set up an appointment if you cannot make it during these times

TA: We have no TA this semester :-(

Lecture times: Mon 6 - 9 PM
Lecture location: Ryder Hall 429

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Homework: 40%
Midterm exam: 25%
Final exam: 30%
Participation: 5%

Textbook

Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufmann, 2011

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.