CS 6220: Data Mining Techniques

This course covers various aspects of data mining including OLAP technology, classification, ensemble methods, association rules, sequence mining, and cluster analysis. The class project involves hands-on practice of mining useful knowledge from a large database.


News

[12/04/2009] Slides from December 3 lecture posted
[12/04/2009] Quiz 6 solution is now available
[11/20/2009] Slides from November 19 lecture and new homework posted
[11/20/2009] Quiz 5 solution is now available
[11/13/2009] Slides from November 12 lecture and new homework posted
[11/08/2009] Quiz 4 solution is now available.
[11/08/2009] Slides from November 5 lecture and new homework posted
[11/02/2009] Slides from October 29 lecture and new homework posted
[10/27/2009] Project Milestone 1 requirements posted
[10/24/2009] Quiz 3 solution is now available
[10/24/2009] Slides from October 22 lecture and new homework posted
[10/19/2009] New homework posted
[10/16/2009] Slides from October 15 lecture posted
[10/09/2009] Quiz 2 solution is now available.
[10/09/2009] Slides from October 8 lecture and new homework posted
[10/09/2009] Graded Quiz 2 can be picked up from instructor's office
[10/07/2009] Project description document posted
[10/05/2009] Quiz 1 solution is now available.
[10/02/2009] Find a team mate for the class project ASAP. Email the TA and the instructor once you have formed a team or if you need help.
[10/02/2009] Slides from October 1 lecture and new homework posted
[10/01/2009] Graded Quiz 1 can be picked up from instructor's office
[09/28/2009] Slides from September 24 lecture and new homework posted
[09/18/2009] Slides from September 17 lecture and new homework posted
[09/16/2009] Room change: Starting September 17, lectures will take place in WVH 108.
[09/14/2009] Slides from September 10 lecture posted


Lectures

(Future lectures and events are tentative.)

Date Topic Remarks and Homework
September 10 Introduction; Data Preprocessing Read chapters 1 and 2 in the book.
September 17 Data Preprocessing (cont.); Classification and Prediction Read relevant sections in chapter 6.

Homework exercises: 2.1, 2.4, 2.6, 2.7, 2.9(c,e), 2.13(a,b), 2.14,
6.1, 6.11(a,b)

September 24 Classification and Prediction Read relevant sections in chapter 6.

Homework exercises: 6.2, 6.5, go thoroughly through the Naive Bayes example on slides 76/77

October 1 Classification and Prediction Read relevant sections in chapter 6.

Homework exercises: 6.3

Practice inference with the Bayesian network on slides 107 and 108 for different examples, e.g., P(S | R and ~T). Compute P(M and L and R and S and T).

For function (X1 AND X2) go through a few iterations of the perceptron training algorithm on slide 132. Use training data set {(1,1,1), (1,-1,-1), (-1, 1, -1), (-1, -1, -1)} (each tuple encodes (X1, X2, Y), where Y is the target output), learning rate eta=0.15, and initial weights w0 = -1, w1 = -1, and w2 = 1.Remember that you have to apply the sign function to the linear function to get a prediction of +1 or -1 (see slide 129). Visualize what happens by plotting the points and the decision boundary in a 2-dimensional coordinate system like on slide 130 (X1 on horizontal axis, X2 on vertical axis). Plot the decision boundary every time you have completely processed a training tuple. Remember that you might have to iterate through the training data multiple times.

Now try training a perceptron for the same setup, except that you are now using function (X1 XOR X2) with training tuples {(1,1,-1), (1,-1,1), (-1, 1, 1), (-1, -1, -1)}. What happens?

October 8 Classification and Prediction Read relevant sections in chapter 6. Take a look at the other recommended books to find out more about neural networks and SVMs.

Homework exercises: 6.11(c,d,e)

For the (X1 AND X2) and for the (X1 XOR X2) problem above, apply a few iterations of the epoch updating algorithm (slides 133-135). Now do the same for the case updating algorithm. Observe how they differ from each other and how the case updating algorithm is different from the perceptron training algorithm we used in the last homework.

In class we discussed how to make a set of tuples (x,y), where x is the input attribute and y is the output, linearly separable by using a simple quadratic basis function (slides 190-193). Now consider the (X1 XOR X2) example above. Propose a transformation phi(X1, X2) that uses one or more polynomial basis functions (i.e., adds attributes that are computed as polynomials over X1 and X2) to make the four training tuples linearly separable. Try to find the simplest possible transformation, e.g., see if you can do it by adding a single new dimension. Recall that for linear separability you need to be able to separate the tuples of class 1 from those of class -1 with a hyperplane. Which hyperplane would work for your transformation? Can you find the maximum-margin hyperplane for your solution?

October 15 Classification and Prediction Read relevant sections in chapter 6. Take a look at the other recommended books to find out more about SVMs.

Homework exercises: 6.8

Create the ROC curve for the data on slide 248 without looking at the solution on slide 249.

Why are Gini and information gain not useful when determining the split attribute for a node in a regression tree?

Given a set of training tuples that reach the leaf of a decision (=classification) tree, how do you determine the value that should be returned for a test tuple reaching the leaf? How do you determine it for a regression tree? Propose two options for each tree type.

How do you measure the distance between two tuples that contain categorical attribute values and attributes with diverse ranges? For example, how can you compute reasonable distances for (salary, age, jobTitle)-tuples (30K, 25, Sales), (40K, 60, Marketing), and (50K, 30, Sales)? Assume that salaries range from 10K to 100K and age from 0 to 100.

October 22 Ensemble Methods (end of classification and prediction series); Frequent Patterns in Sets and Sequences Read relevant sections in chapters 5 and 6. If you are curious to learn more about Additive Groves, you can take a look at our paper.

Homework exercises: 5.1

What are the key similarities and differences between bagging and boosting?

If a classifier has an accuracy of 95%, does that mean it is a good classifier? Justify your answer.

If a classifier has an area under the ROC curve of 0.95, does that mean it is a good classifier? Justify your answer.

What is the "Apriori principle" of frequent itemset mining? How does it help reduce the cost of frequent itemset mining?

For the database of market-basket transactions on slide 5, manually (= using paper and pencil) run the Apriori algorithm to find all itemsets with support of 2 or greater. In particular, for every k, perform the self-join Lk*Lk to generate Ck+1 and then prune itemsets in Ck+1 by checking if all their subsets of length k are in Lk. For support counting, you do not have to use the hash-tree. Just count the support of the remaining candidates in Ck+1 directly in the database.

October 29 Frequent Patterns in Sets and Sequences Read relevant sections in chapter 5.

Homework exercises: 5.3, 5.13, 5.14 (read the discussion about the lift measure for 5.14)

How can the hash tree help speed up the Apriori algorithm?

Why is FP-Growth often more efficient than Apriori?

What are maximal and closed frequent itemsets and why do we care about them? For the table on slide 51 and min_sup=2, do the following without further looking at slides 51 and 52: (1) create the itemset lattice, (2) annotate each lattice node with the IDs of the transactions supporting the itemset in that node, and (3) identify all frequent, closed, and maximal frequent itemsets in the lattice.

Go step-by-step through the examples on slides 78 and 79, creating the C_i and L_i sets from scratch. Now do the same for constraint AVG(S.price)<3. What is different compared to the example with constraint Sum(S.price)<5?

Starting with the 51 length-2 candidates on slide 96 and assuming min_sup=2, do the following: (1) find all frequent length-2 sequences (L_2), then (2) create the set of all length-3 candidate sequences (C_3), and (3) prune all those length-3 candidate sequences that contain a length-2 subsequence that is not in L_2.

Run PrefixSpan for the example data on slide 100 with min_sup = 2. Try to find all frequent sub-sequences that start with "a" and all those that start with "b". How is PrefixSpan's way of exploring sub-sequences different from GSP? Think about differences and similarities between Apriori versus GSP and FP-growth versus PrefixSpan.

November 5 Frequent Patterns in Sets and Sequences (cont.); Cluster Analysis Read relevant sections in chapters 5, 8.3, and 7. You can find out more about the algorithm for mining bursty sequences here.

Homework: 7.1, 7.2, 7.3

Event bursts can lead to high sequence mining cost and irrelevant results. Describe in a few sentences the main idea of the algorithm we discussed: How does it reduce mining cost and also eliminate irrelevant results? Does it guarantee that all relevant results will be found?

November 9   Milestone 1 report due at 11:59pm
November 12 Cluster Analysis Read relevant sections in chapter 7.

Homework: 7.6, 7.7

In class we discussed how one could compute the proximity matrix entry for the MIN distance between (C2 union C5) and C1 from the entries for the MIN distance between C2 and C1 and from the MIN distance between C5 and C1 (see slides 66-68). Show formally that this is also possible for the MAX distance. Can you do this also for the Group Average and Distance Between Centroids? If not, would it be possible by storing some auxiliary information for each entry in the proximity matrix?

For the example on slide 96, compute link({a,b,e}, {a,f,g}) and link({a,b,e}, {b,d,e}). Compare this to their Jaccard similarity scores, i.e., Jaccard({a,b,e}, {a,f,g}) and Jaccard({a,b,e}, {b,d,e}).

November 19 Cluster Analysis Read relevant sections in chapter 7.

Homework: 7.8, 7.10, 7.15 (except CLARA)

Explain the reason for the clustering results on slides 115 and 118 (DBSCAN problems).

What is the effect of increasing or decreasing the value of parameters sigma and xi for DENCLUE with Gaussian influence function?

November 26 No class: Thanksgiving  
December 3 Cluster Analysis (cont.); Data Warehousing and OLAP Read relevant sections in chapters 7 and 3
December 7   Final project report due at 11:59pm
December 10 OLAP (cont.); Course Summary  
December 17 Final Exam  

Course Information

Instructor: Mirek Riedewald

TA: Alper Okcan

Meeting times: Thu 6 - 9 PM
Meeting location: WVH 108

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Textbook

Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, 2006

Recommended books for further reading:

  1. "Data Mining" by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (http://www-users.cs.umn.edu/~kumar/dmbook/index.php)
  2. "Machine Learning" by Tom Mitchell (http://www.cs.cmu.edu/~tom/mlbook.html)
  3. "Introduction to Machine Learning" by Ethem ALPAYDIN (http://www.cmpe.boun.edu.tr/~ethem/i2ml/)
  4. "Pattern Classification" by Richard O. Duda, Peter E. Hart, David G. Stork (http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471056693.html)
  5. "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (http://www-stat.stanford.edu/~tibs/ElemStatLearn/)

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.