CS6140 Machine Learning
    HW7 - Local Methods and Clustering
      
    Make sure you check the syllabus for the due date. 
    
    
    
    
    PROBLEM 1 [50 points] Local Classification (notes, slides)
    
    For the Spambase
    dataset, use local classification on the test set (10-fold cross
    validation). The principle is: predict a label for each testpoint
    based on the neighbours within a window. For this to wrok, you first
    have to decide on a similarity/kernel function between dataoints
    K(x,z) - for example Gaussian,
      Laplacian, identity, etc.
    
    A) Fixed Window. Fix an appropriate window size W around the test
    datapoint, and predict the label as the majority or average of
    trainig nighbours within the window
    
    B) KNearestNeighbours:  Pick K (for example  K=9)
    and  predict the  label  as the average or majority
    between the closest K training neighbours.
    
    C) Kernel density estimation. Separately for each label class C
    (+/-), estimate P(x|C) using the density estimator given by the
    kernel K restricted to the training points form class C (no need for
    physical windows, as the kernel is weightitng each point by
    similarity). Then predict the class by largest Bayesian
    probabilities P(C|x) = P(C) * P(x|C) /P(x).
    
    PROBLEM 2  Clustering
      20Newsgroup dataset [Extra Credit]
     
    First, decide on a similarity measure for the 20Newsgroup dataset. Use entropy
    and purity for evaluation (using the true class labels). For this problem, to get 50 points, you need
      to make clustering work for one of the A), B), C) below, not all
      three.
    
    A) Use K-Means clustering on the 20Newsgroups datapoints. You can
    choose appropriately K in advance (knowing the proper number of
    classes), but you'll have to play with the initial centroids. 
    
    B) Use hiererchichal clustering  on the 20Newsgroups dataset,
    with average-link distance between clusters.
    
    C) [Extra Credit] Apply K-medoids,  where the K-Means centers
    are not average centroids, but instead "middle" points.
    
    PROBLEM 3 Clustering images [Extra Credit]
    Same as problem 2 (both K-Means and hierarchical clustering), on the
    digits dataset (training data,
    labels.  testing data, labels). You can choose K=10
    in advance.
    
    PROBLEM 4 Clustering with EM [Extra Credit]
    
        Clustering: Apply the EM algorithm on the entire
    Spambase
    data trainset (both positive and negative) using a mixture of
    K=9  gaussians, each gaussian 57-dim. For each cluster
    =gaussian/component, consider the training datapoints that are most
    associated with it (highest probability in the mixture, out of K
    probabilities), and observe the highest count of these points:
    positive or negative - this is the prediction label of the cluster.
    
        Use the following testing schema: Assign each
    test datapoint to the cluster given by the highest probability
    gaussian component.  Use the cluster label to make a
    prediction. Present the accuracy.
    
        Alternative testing: use each cluster label to
    produce a mixture-score prediction, weighted by probability for each
    component: F(x) = label_cluster1 * P(x|cluster1) + label_cluster2
    *P(x|cluster2) +.... + label_clusterK *P(x|clusterK). Present the
    AUC.
    
    PROBLEM 5 [Extra Credit]
    Prove of the harmonic functions property discussed in class based
      on this
      paper. Specifically, prove that to minimize the energy function
    
    f must be harmonic, i.e. for all unlabeled datapoints j, it must
      satisfy
    