CS 6220: Data Mining Techniques

This course covers various aspects of data mining including data preprocessing, classification, ensemble methods, association rules, sequence mining, and cluster analysis. The class project involves hands-on practice of mining useful knowledge from a large data set.


News

[04/22/2010] Slides from April 21 lecture and new homework posted
[04/20/2010] Final Milestone 4 results are now available
[04/19/2010] Test set 2 for Milestone 4 is now available
[04/18/2010] New Milestone 4 results on Test data 1 posted
[04/18/2010] Quiz 6 solution is now available
[04/16/2010] Slides from April 14 lecture and new homework posted
[04/13/2010] Current Milestone 4 results on Test data 1 posted
[04/12/2010] Slides with comments about project milestones and quizzes posted
[04/08/2010] Slides from April 7 lecture and new homework posted
[04/07/2010] Milestone 4 training and first test data set are now posted
[04/06/2010] Project Milestone 4 and data description document posted
[04/02/2010] Slides from March 31 lecture and new homework posted
[04/02/2010] Quiz 5 solution is now available
[03/26/2010] Slides from March 24 lecture and new homework posted
[03/19/2010] Slides from March 17 lecture and new homework posted
[03/19/2010] Project Milestone 3 posted
[03/12/2010] Quiz 4 solution is now available
[03/12/2010] Slides from March 10 lecture and new homework posted
[03/09/2010] Quiz 3 solution is now available
[02/28/2010] Project Milestone 2 posted
[02/26/2010] Slides from February 24 lecture and new homework posted.
[02/22/2010] Quiz 2 solution is now available.
[02/18/2010] Slides from February 17 lecture and new homework posted
[02/07/2010] Project Milestone 1 and Java code for creating plots posted
[02/05/2010] Quiz1 solution is now available
[02/05/2010] Slides from February 3 lecture and new homework posted
[01/29/2010] Slides from January 27 lecture and new homework posted
[01/26/2010] Homework for January 13 and 20 lectures posted
[01/25/2010] ***Room change***; starting this week, lectures will take place in 108 WVH
[01/21/2010] Slides from January 20 lecture posted
[01/20/2010] Office hours are now determined
[01/15/2010] Slides from January 13 lecture posted


Lectures

(Future lectures and events are tentative.)

Version of slides with space for notes:

Date Topic Remarks and Homework
January 13 Introduction; Data Preprocessing Read chapters 1 and 2 in the book.
January 20 Data Preprocessing (cont.); Classification and Prediction Read relevant sections in chapter 6.

Homework exercises: 2.2, 2.4(a,b,d,f,g), 2.6, 2.7(a), 2.9, 2.12, 2.13, 2.14, 6.1

For the example on slide 30, train the corresponding decision tree "by hand" using information gain for selecting a split attribute. Make sure you have at least one path of length 3 (root, inner node, leaf) or longer.
January 27 Classification and Prediction Read relevant sections in chapter 6. For the statistical decision theory part, take a look at the Hastie et al. book (book 5 below). However, it does not show the derivation.

Homework exercises: 6.2
February 3 Classification and Prediction Read relevant sections in chapter 6. For Bayesian networks, the book by Alpaydin (book 3 below) has a little more information than the textbook.

Homework exercises: 6.5

Compute P(buys_computer = 'yes' | X) and P(buys_computer = 'no' | X) for the data sample X and the training data given on slide 84. Do this without looking at the solution on slide 85. Once you are done, compare your results.

For the same data on slide 84, estimate the following conditional probabilities: P(age > 40 AND income = 'low' | buys_computer = 'no') and P(age > 40 AND income = 'low' AND student = 'no' | buys_computer = 'no'). How much would you trust these estimates to be close to the true probabilities?

Compute P(wealth = 'poor' | gender = 'female') and P(wealth = 'poor' | gender = 'male') for the example data on slide 102. Now compute P(wealth = 'rich' | gender = 'female'). What is the relationship between P(wealth = 'poor' | gender = 'female') and P(wealth = 'rich' | gender = 'female')?
February 10 No class due to snow storm (campus closed after noon)  
February 17 Classification and Prediction Read relevant sections in chapter 6. For understanding neural networks in more depth, Mitchell's Machine Learning book (book 2 below) is one of the best resources.

Homework exercises: 6.3

Practice inference with the Bayesian network on slides 114 and 115 for different examples, e.g., P(S | R and ~T). Compute P(M and L and R and S and T).

For function (X1 AND X2) go through a few iterations of the perceptron training algorithm on slide 140. Use training data set {(1,1,1), (1,-1,-1), (-1, 1, -1), (-1, -1, -1)} (each tuple encodes (X1, X2, Y), where Y is the target output), learning rate eta=0.15, and initial weights w0 = -1, w1 = -1, and w2 = 1.Remember that you have to apply the sign function to the linear function to get a prediction of +1 or -1 (see slide 137). Visualize what happens by plotting the points and the decision boundary in a 2-dimensional coordinate system like on slide 138 (X1 on horizontal axis, X2 on vertical axis). Plot the decision boundary every time you have completely processed a single training tuple. Remember that you might have to iterate through the training data set multiple times. Consider writing a small program that computes the new weights.

Now try training a perceptron for the same setup, except that you are now using function (X1 XOR X2) with training tuples {(1,1,-1), (1,-1,1), (-1, 1, 1), (-1, -1, -1)}. What happens?

Now use the case-updating algorithm and gradient descent (slides 142, 143) for the same XOR data set. Remember that function f is now the unthresholded simple weighted sum. Just run a few iterations and see how this algorithm is different from using the perceptron training rule. Again, writing a small program that computes the new weights when processing a training tuple might be worthwhile to both understand the algorithm and to avoid tedious manual computations.
February 19   Project Milestone 1 due at 4pm
Resources: Milestone 1 description and plotter code
February 24 Classification and Prediction Read relevant sections in chapter 6. More in-depth information about SVMs can be found in Hastie et al. (book 5 below) and in a nice survey paper by Burges (or search the web for "tutorial on support vector machines for pattern recognition"). Notice how both use different notation to denote the same concepts.

Homework:

To appreciate why the optimization criterion of Quadratic Programming is indeed quadratic, do the following: Consider a vector u = (u1,u2,u3)T, i.e., a vector with 3 rows and 1 column. Set q = 2, d = (1,2,3)T, and let R be a matrix with 3 rows and 3 columns, such that each entry in the matrix is equal to 1. Now compute the expression q + dT*u + uT*R*u, where operator "*" is the standard matrix multiplication.

In class we discussed how to make a set of tuples (x,y), where x is the input attribute and y is the output, linearly separable by using a simple quadratic basis function (slides 199-201). Now consider again the (X1 XOR X2) example from last week. Propose a transformation phi(X1, X2) that uses one or more polynomial basis functions (i.e., adds attributes that are computed as polynomials over X1 and X2) to make the four training tuples linearly separable. Try to find the simplest possible transformation, e.g., see if you can do it by adding a single new dimension. Recall that for linear separability you need to be able to separate the tuples of class 1 from those of class -1 with a hyperplane. Which hyperplane would work for your transformation? Can you also find the maximum-margin hyperplane for your solution?

For the same XOR classification problem above, write down the complete SVM optimization problem--optimization goal and all constraints--using the primal version with the slack variables (like on slide 192).
March 3 No class: Spring Break  
March 10 Classification and Prediction Read relevant sections in chapter 6.

Homework exercises: 6.8

Create the ROC curve for the data on slide 258 without looking at the solution on slide 259.

How does the ROC curve look like for a classifier that only outputs either "+" or "-" for each test tuple?

Why are Gini and information gain not useful when determining the split attribute for a node in a regression tree?

Given a set of training tuples that reach the leaf of a decision (=classification) tree, how do you determine the value that should be returned for a test tuple reaching the leaf? How do you determine it for a regression tree? Propose two options for each tree type.

March 15   Project Milestone 2 due at 4pm
Resources: Milestone 2 description
March 17 Classification and Prediction; Frequent Patterns in Sets and Sequences Read relevant sections in chapters 5 and 6. If you are curious to learn more about Additive Groves, you can take a look at our paper.

Homework exercises: 5.1

What are the key similarities and differences between bagging and boosting?

If a classifier has an accuracy of 95%, does that mean it is a good classifier? Justify your answer.

If a classifier has an area under the ROC curve of 0.95, does that mean it is a good classifier? Justify your answer.

What is the "Apriori principle" of frequent itemset mining? How does it help reduce the cost of frequent itemset mining?

For the database of market-basket transactions on slide 5, manually (= using paper and pencil) run the Apriori algorithm to find all itemsets with support of 2 or greater. In particular, for every k, perform the self-join Lk*Lk to generate Ck+1 and then prune itemsets in Ck+1 by checking if all their subsets of length k are in Lk. For support counting, you do not have to use the hash-tree. Just count the support of the remaining candidates in Ck+1 directly in the database.

March 24 Frequent Patterns in Sets and Sequences Read relevant sections in chapter 5; for more information also take a look at the Tan et al. book (book 1 below).

Homework exercises: 5.3, 5.13, 5.14 (read the discussion in the book about the lift measure first)

Which task of the Apriori algorithm is improved by using the hash tree?

Why is FP-Growth often more efficient than Apriori?

What are maximal and closed frequent itemsets and why do we care about them?

For the table on slide 51 and min_sup=2, do the following without further looking at slides 51 and 52: (1) create the itemset lattice, (2) annotate each lattice node with the IDs of the transactions supporting the itemset in that node, and (3) identify all frequent, closed, and maximal frequent itemsets in the lattice.

Go step-by-step through the example on slide 79, creating the C_i and L_i sets from scratch. Prune candidates in C_i also by using the anti-monotonicity of SUM (assuming all values are non-negative). Now do the same for constraint AVG(S.price)<3. Notice that AVG is not anti-monotone. What is different compared to the example with constraint Sum(S.price)<5?
March 31 Frequent Patterns in Sets and Sequences; Cluster Analysis Read relevant sections in chapters 5, 8.3, and 7. If you are interested in some recent work that deals with a common sequence mining challenge--bursts of common events--take a look here.

Homework: 7.1, 7.2, 7.3

Starting with the 51 length-2 candidates on slide 96 and assuming min_sup=2, do the following: (1) find all frequent length-2 sequences (L_2), then (2) create the set of all length-3 candidate sequences (C_3), and (3) prune all those length-3 candidate sequences that contain a length-2 subsequence that is not in L_2.

Run PrefixSpan for the example data on slide 100 with min_sup = 2. Try to find all frequent sub-sequences that start with "a" and all those that start with "b". How is PrefixSpan's way of exploring sub-sequences different from GSP? Think about differences and similarities between Apriori versus GSP and FP-growth versus PrefixSpan.
April 1   Project Milestone 3 due at 4pm
Resources: Milestone 3 description
April 7 Cluster Analysis

Comments about project milestones and quizzes
Read relevant sections in chapter 7. For additional material, also take a look at the Tan et al. book (book 1 below).

Homework: 7.6, 7.7

In class we discussed how the proximity matrix for hierarchical agglomerative clustering needs to be updated after each step (slides 67 to 69). The naive way of doing this would be to compute the distances between (C2 UNION C5) and the other clusters from scratch based on the individual data points in the clusters. Propose a more efficient update algorithm that can compute the new proximity matrix by only using the entries from the current proximity matrix, without having to access the individual data points in the clusters. Do this for the following similarity metrics: MIN, MAX, Group Average. If the computation is not possible based on the current proximity matrix alone, could you make it possible by adding a little extra information about each cluster?

Choose a few data points in 2-D space and then run the hierarchical agglomerative clustering algorithm on these points, creating the dendrogram. Try to find examples that illustrate how using the MIN and MAX metric results in different clusterings.

BIRCH first creates a rough hierarchical partitioning of the data, then runs a clustering algorithm of choice to refine the partitions. What is the motivation for this approach? Try to give more than one argument.
April 14 Cluster Analysis Read relevant sections in chapter 7. For additional material, also take a look at the Tan et al. book (book 1 below).

Homework: 7.8, 7.10, 7.15 (except CLARA, ROCK)

Explain the reason for the clustering results on slides 116 and 119 (DBSCAN problems).

What is the relationship between the Apriori algorithm for frequent itemset mining and the CLIQUE algorithm?

What is a similarity matrix and how does it help us determine if a clustering is good or not? How is it different from using the correlation between similarity matrix and cluster incidence matrix?
April 12   Preliminary Milestone 4 predictions for Test data 1 due at 11:59pm
April 16   New Preliminary Milestone 4 predictions for Test data 1 due at 11:59pm
April 19   Final project report due at 11:59pm
Resources: Milestone 4 description, data description, Training data, Test data 1, Test data 2
April 21 Cluster Analysis; Course Summary Read relevant sections in chapter 7. For additional material, also take a look at the Tan et al. book (book 1 below).

Study for the final exam.
April 28 Final Exam 6-8pm in 108 WVH

Results for Milestone 4

Team Name April 12 ACC   Team Name April 16 ACC   Team Name Apr 19 ACC   Team Name Test 2 ACC
T. N. 79.45   Team A 79.95   J57Klassy-Fire 81.54   J57Klassy-Fire 80.39
JZ 77.79   T. N. 79.60   Team A 79.95   P. T. and D. N. 79.59
Team A 77.74   J57Klassy-Fire 79.11   VOTERs 79.85   M. E. 78.67
J57Klassy-Fire 77.15   X. C. and X. C. 78.87   T. N. 79.69   Kone 78.46
Paras 76.53   P. T. and D. N. 78.81   JZ 79.29   Team A 78.15
X. C. and X. C. 76.30   JZ 78.81   M. E. 79.23   X. C. and X. C. 77.90
P. T. and D. N. 75.61   Kone 78.22   X. C. and X. C. 79.08   T. N. 77.00
Kone 75.48   A. P. 77.95   P. T. and D. N. 78.81   N. W. and P. G. 76.91
M. E. 74.16   VOTERs 77.28   Kone 78.75   VOTERs 76.52
A. P. 74.03   DK 77.12   A. P. 77.95   A. P. 75.34
V. P. 73.11   N. W. and P. G. 77.00   Paras 77.30   JZ 75.15
Spring 70.77   M. E. 76.61   DK 77.12   DK 73.97
N. W. and P. G. 70.39   Paras 76.53   N. W. and P. G. 77.00   Spring 73.59
VOTERs 68.72   V. P. 75.29   Spring 76.60   Paras 73.45
DK 67.56   Spring 75.23   V. P. 75.29   V. P. 70.57

Course Information

Instructor: Mirek Riedewald

TA: Bahar Qarabaqi

Meeting times: Wed 6 - 9 PM
Meeting location: 108 WVH

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Textbook

Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, 2006

Recommended books for further reading:

  1. "Data Mining" by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (http://www-users.cs.umn.edu/~kumar/dmbook/index.php)
  2. "Machine Learning" by Tom Mitchell (http://www.cs.cmu.edu/~tom/mlbook.html)
  3. "Introduction to Machine Learning" by Ethem ALPAYDIN (http://www.cmpe.boun.edu.tr/~ethem/i2ml/)
  4. "Pattern Classification" by Richard O. Duda, Peter E. Hart, David G. Stork (http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471056693.html)
  5. "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (http://www-stat.stanford.edu/~tibs/ElemStatLearn/)

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.