This course covers various aspects of data mining including data preprocessing, classification, ensemble methods, association rules, sequence mining, and cluster analysis. The class project involves handson practice of mining useful knowledge from a large data set.
[04/22/2010] Slides from April 21 lecture and new homework posted
[04/20/2010] Final Milestone 4 results are now available
[04/19/2010] Test set 2 for Milestone 4 is now available
[04/18/2010] New Milestone 4 results on Test data 1 posted
[04/18/2010]
Quiz 6 solution is now available
[04/16/2010] Slides from April 14 lecture and new homework posted
[04/13/2010] Current Milestone 4 results on Test data 1 posted
[04/12/2010] Slides with comments about project milestones and quizzes posted
[04/08/2010] Slides from April 7 lecture and new homework posted
[04/07/2010] Milestone 4 training and first test data set are now posted
[04/06/2010] Project Milestone
4 and data description
document posted
[04/02/2010] Slides from March 31 lecture and new homework posted
[04/02/2010] Quiz 5 solution is now
available
[03/26/2010] Slides from March 24 lecture and new homework posted
[03/19/2010] Slides from March 17 lecture and new homework posted
[03/19/2010] Project Milestone
3 posted
[03/12/2010] Quiz 4 solution is now
available
[03/12/2010] Slides from March 10 lecture and new homework posted
[03/09/2010] Quiz 3 solution is now
available
[02/28/2010] Project Milestone
2 posted
[02/26/2010] Slides from February 24 lecture and new homework posted.
[02/22/2010] Quiz 2 solution is now
available.
[02/18/2010] Slides from February 17 lecture and new homework posted
[02/07/2010] Project Milestone
1 and Java code for creating plots
posted
[02/05/2010] Quiz1 solution is now
available
[02/05/2010] Slides from February 3 lecture and new homework posted
[01/29/2010] Slides from January 27 lecture and new homework posted
[01/26/2010] Homework for January 13 and 20 lectures posted
[01/25/2010] ***Room change***; starting this week, lectures will take place
in 108 WVH
[01/21/2010] Slides from January 20 lecture posted
[01/20/2010] Office hours are now determined
[01/15/2010] Slides from January 13 lecture posted
(Future lectures and events are tentative.)
Version of slides with space for notes:
Date  Topic  Remarks and Homework 
January 13  Introduction; Data Preprocessing  Read chapters 1 and 2 in the book. 
January 20  Data Preprocessing (cont.); Classification and Prediction  Read relevant sections in chapter 6. Homework exercises: 2.2, 2.4(a,b,d,f,g), 2.6, 2.7(a), 2.9, 2.12, 2.13, 2.14, 6.1 For the example on slide 30, train the corresponding decision tree "by hand" using information gain for selecting a split attribute. Make sure you have at least one path of length 3 (root, inner node, leaf) or longer. 
January 27  Classification and Prediction  Read relevant sections in chapter 6. For the
statistical decision theory part, take a look at the Hastie et al. book
(book 5 below). However, it does not show the derivation. Homework exercises: 6.2 
February 3  Classification and Prediction  Read relevant sections in chapter 6. For Bayesian
networks, the book by Alpaydin (book 3 below) has a little more
information than the textbook. Homework exercises: 6.5 Compute P(buys_computer = 'yes'  X) and P(buys_computer = 'no'  X) for the data sample X and the training data given on slide 84. Do this without looking at the solution on slide 85. Once you are done, compare your results. For the same data on slide 84, estimate the following conditional probabilities: P(age > 40 AND income = 'low'  buys_computer = 'no') and P(age > 40 AND income = 'low' AND student = 'no'  buys_computer = 'no'). How much would you trust these estimates to be close to the true probabilities? Compute P(wealth = 'poor'  gender = 'female') and P(wealth = 'poor'  gender = 'male') for the example data on slide 102. Now compute P(wealth = 'rich'  gender = 'female'). What is the relationship between P(wealth = 'poor'  gender = 'female') and P(wealth = 'rich'  gender = 'female')? 
February 10  No class due to snow storm (campus closed after noon)  
February 17  Classification and Prediction  Read relevant sections in chapter 6. For
understanding neural networks in more depth, Mitchell's Machine Learning
book (book 2 below) is one of the best resources. Homework exercises: 6.3 Practice inference with the Bayesian network on slides 114 and 115 for different examples, e.g., P(S  R and ~T). Compute P(M and L and R and S and T). For function (X1 AND X2) go through a few iterations of the perceptron training algorithm on slide 140. Use training data set {(1,1,1), (1,1,1), (1, 1, 1), (1, 1, 1)} (each tuple encodes (X1, X2, Y), where Y is the target output), learning rate eta=0.15, and initial weights w0 = 1, w1 = 1, and w2 = 1.Remember that you have to apply the sign function to the linear function to get a prediction of +1 or 1 (see slide 137). Visualize what happens by plotting the points and the decision boundary in a 2dimensional coordinate system like on slide 138 (X1 on horizontal axis, X2 on vertical axis). Plot the decision boundary every time you have completely processed a single training tuple. Remember that you might have to iterate through the training data set multiple times. Consider writing a small program that computes the new weights. Now try training a perceptron for the same setup, except that you are now using function (X1 XOR X2) with training tuples {(1,1,1), (1,1,1), (1, 1, 1), (1, 1, 1)}. What happens? Now use the caseupdating algorithm and gradient descent (slides 142, 143) for the same XOR data set. Remember that function f is now the unthresholded simple weighted sum. Just run a few iterations and see how this algorithm is different from using the perceptron training rule. Again, writing a small program that computes the new weights when processing a training tuple might be worthwhile to both understand the algorithm and to avoid tedious manual computations. 
February 19  Project Milestone 1 due at 4pm Resources: Milestone 1 description and plotter code 

February 24  Classification and Prediction  Read relevant sections in chapter 6. More
indepth information about SVMs can be found in Hastie et al. (book 5
below) and in a nice
survey paper by Burges (or search the web for "tutorial on support
vector machines for pattern recognition"). Notice how both use different
notation to denote the same concepts. Homework: To appreciate why the optimization criterion of Quadratic Programming is indeed quadratic, do the following: Consider a vector u = (u_{1},u_{2},u_{3})^{T}, i.e., a vector with 3 rows and 1 column. Set q = 2, d = (1,2,3)^{T}, and let R be a matrix with 3 rows and 3 columns, such that each entry in the matrix is equal to 1. Now compute the expression q + d^{T}*u + u^{T}*R*u, where operator "*" is the standard matrix multiplication. In class we discussed how to make a set of tuples (x,y), where x is the input attribute and y is the output, linearly separable by using a simple quadratic basis function (slides 199201). Now consider again the (X1 XOR X2) example from last week. Propose a transformation phi(X1, X2) that uses one or more polynomial basis functions (i.e., adds attributes that are computed as polynomials over X1 and X2) to make the four training tuples linearly separable. Try to find the simplest possible transformation, e.g., see if you can do it by adding a single new dimension. Recall that for linear separability you need to be able to separate the tuples of class 1 from those of class 1 with a hyperplane. Which hyperplane would work for your transformation? Can you also find the maximummargin hyperplane for your solution? For the same XOR classification problem above, write down the complete SVM optimization problemoptimization goal and all constraintsusing the primal version with the slack variables (like on slide 192). 
March 3  No class: Spring Break  
March 10  Classification and Prediction  Read relevant sections in chapter 6. Homework exercises: 6.8 Create the ROC curve for the data on slide 258 without looking at the solution on slide 259. How does the ROC curve look like for a classifier that only outputs either "+" or "" for each test tuple? Why are Gini and information gain not useful when determining the split attribute for a node in a regression tree? Given a set of training tuples that reach the leaf of a decision (=classification) tree, how do you determine the value that should be returned for a test tuple reaching the leaf? How do you determine it for a regression tree? Propose two options for each tree type. 
March 15  Project Milestone 2 due at 4pm Resources: Milestone 2 description 

March 17  Classification and Prediction; Frequent Patterns in Sets and Sequences  Read relevant sections in chapters 5 and 6. If you
are curious to learn more about Additive Groves, you can take a look at
our paper. Homework exercises: 5.1 What are the key similarities and differences between bagging and boosting? If a classifier has an accuracy of 95%, does that mean it is a good classifier? Justify your answer. If a classifier has an area under the ROC curve of 0.95, does that mean it is a good classifier? Justify your answer. What is the "Apriori principle" of frequent itemset mining? How does it help reduce the cost of frequent itemset mining? For the database of marketbasket transactions on slide 5, manually (= using paper and pencil) run the Apriori algorithm to find all itemsets with support of 2 or greater. In particular, for every k, perform the selfjoin L_{k}*L_{k} to generate C_{k+1} and then prune itemsets in C_{k+1} by checking if all their subsets of length k are in L_{k}. For support counting, you do not have to use the hashtree. Just count the support of the remaining candidates in C_{k+1} directly in the database. 
March 24  Frequent Patterns in Sets and Sequences  Read relevant sections in chapter 5; for more
information also take a look at the Tan et al. book (book 1 below). Homework exercises: 5.3, 5.13, 5.14 (read the discussion in the book about the lift measure first) Which task of the Apriori algorithm is improved by using the hash tree? Why is FPGrowth often more efficient than Apriori? What are maximal and closed frequent itemsets and why do we care about them? For the table on slide 51 and min_sup=2, do the following without further looking at slides 51 and 52: (1) create the itemset lattice, (2) annotate each lattice node with the IDs of the transactions supporting the itemset in that node, and (3) identify all frequent, closed, and maximal frequent itemsets in the lattice. Go stepbystep through the example on slide 79, creating the C_i and L_i sets from scratch. Prune candidates in C_i also by using the antimonotonicity of SUM (assuming all values are nonnegative). Now do the same for constraint AVG(S.price)<3. Notice that AVG is not antimonotone. What is different compared to the example with constraint Sum(S.price)<5? 
March 31  Frequent Patterns in Sets and Sequences; Cluster Analysis  Read relevant sections in chapters 5, 8.3, and 7. If
you are interested in some recent work that deals with a common sequence
mining challengebursts of common eventstake a look
here. Homework: 7.1, 7.2, 7.3 Starting with the 51 length2 candidates on slide 96 and assuming min_sup=2, do the following: (1) find all frequent length2 sequences (L_2), then (2) create the set of all length3 candidate sequences (C_3), and (3) prune all those length3 candidate sequences that contain a length2 subsequence that is not in L_2. Run PrefixSpan for the example data on slide 100 with min_sup = 2. Try to find all frequent subsequences that start with "a" and all those that start with "b". How is PrefixSpan's way of exploring subsequences different from GSP? Think about differences and similarities between Apriori versus GSP and FPgrowth versus PrefixSpan. 
April 1  Project Milestone 3 due at 4pm Resources: Milestone 3 description 

April 7  Cluster Analysis Comments about project milestones and quizzes 
Read relevant sections in chapter 7. For additional
material, also take a look at the Tan et al. book (book 1 below). Homework: 7.6, 7.7 In class we discussed how the proximity matrix for hierarchical agglomerative clustering needs to be updated after each step (slides 67 to 69). The naive way of doing this would be to compute the distances between (C2 UNION C5) and the other clusters from scratch based on the individual data points in the clusters. Propose a more efficient update algorithm that can compute the new proximity matrix by only using the entries from the current proximity matrix, without having to access the individual data points in the clusters. Do this for the following similarity metrics: MIN, MAX, Group Average. If the computation is not possible based on the current proximity matrix alone, could you make it possible by adding a little extra information about each cluster? Choose a few data points in 2D space and then run the hierarchical agglomerative clustering algorithm on these points, creating the dendrogram. Try to find examples that illustrate how using the MIN and MAX metric results in different clusterings. BIRCH first creates a rough hierarchical partitioning of the data, then runs a clustering algorithm of choice to refine the partitions. What is the motivation for this approach? Try to give more than one argument. 
April 14  Cluster Analysis  Read relevant sections in chapter 7. For additional
material, also take a look at the Tan et al. book (book 1 below). Homework: 7.8, 7.10, 7.15 (except CLARA, ROCK) Explain the reason for the clustering results on slides 116 and 119 (DBSCAN problems). What is the relationship between the Apriori algorithm for frequent itemset mining and the CLIQUE algorithm? What is a similarity matrix and how does it help us determine if a clustering is good or not? How is it different from using the correlation between similarity matrix and cluster incidence matrix? 
April 12  Preliminary Milestone 4 predictions for Test data 1 due at 11:59pm  
April 16  New Preliminary Milestone 4 predictions for Test data 1 due at 11:59pm  
April 19  Final project report due at 11:59pm Resources: Milestone 4 description, data description, Training data, Test data 1, Test data 2 

April 21  Cluster Analysis; Course Summary  Read relevant sections in chapter 7. For additional
material, also take a look at the Tan et al. book (book 1 below). Study for the final exam. 
April 28  Final Exam  68pm in 108 WVH 
Team Name  April 12 ACC  Team Name  April 16 ACC  Team Name  Apr 19 ACC  Team Name  Test 2 ACC  

T. N.  79.45  Team A  79.95  J57KlassyFire  81.54  J57KlassyFire  80.39  
JZ  77.79  T. N.  79.60  Team A  79.95  P. T. and D. N.  79.59  
Team A  77.74  J57KlassyFire  79.11  VOTERs  79.85  M. E.  78.67  
J57KlassyFire  77.15  X. C. and X. C.  78.87  T. N.  79.69  Kone  78.46  
Paras  76.53  P. T. and D. N.  78.81  JZ  79.29  Team A  78.15  
X. C. and X. C.  76.30  JZ  78.81  M. E.  79.23  X. C. and X. C.  77.90  
P. T. and D. N.  75.61  Kone  78.22  X. C. and X. C.  79.08  T. N.  77.00  
Kone  75.48  A. P.  77.95  P. T. and D. N.  78.81  N. W. and P. G.  76.91  
M. E.  74.16  VOTERs  77.28  Kone  78.75  VOTERs  76.52  
A. P.  74.03  DK  77.12  A. P.  77.95  A. P.  75.34  
V. P.  73.11  N. W. and P. G.  77.00  Paras  77.30  JZ  75.15  
Spring  70.77  M. E.  76.61  DK  77.12  DK  73.97  
N. W. and P. G.  70.39  Paras  76.53  N. W. and P. G.  77.00  Spring  73.59  
VOTERs  68.72  V. P.  75.29  Spring  76.60  Paras  73.45  
DK  67.56  Spring  75.23  V. P.  75.29  V. P.  70.57 
Instructor: Mirek Riedewald
TA: Bahar Qarabaqi
Meeting times: Wed 6  9 PM
Meeting location: 108 WVH
CS 5800 or CS 7800, or consent of instructor
Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, 2006
Recommended books for further reading:
A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.
For more information, please refer to the Academic Integrity Web page.