CS6140 Machine Learning
Make sure you check the syllabus for the due date.
PROBLEM 1 [50 points] Local Classification (notes, slides)For the Spambase
dataset, use local classification on the test set
(10-fold cross validation). The principle is: predict a label for each
testpoint based on the neighbours within a window. For this to wrok,
you first have to decide on a similarity/kernel function between
dataoints K(x,z) - for example Gaussian, Laplacian, identity, etc.
A) Fixed Window. Fix an appropriate window size W around the test
datapoint, and predict the label as the majority or average of trainig
nighbours within the window
B) KNearestNeighbours: Pick K (for example K=9) and
predict the label as the average or majority between the
closest K training neighbours.
C) Kernel density estimation. Separately for each label class C (+/-),
estimate P(x|C) using the density estimator given by the kernel K
restricted to the training points form class C (no need for physical windows, as
the kernel is weightitng each point by similarity). Then predict the
class by largest Bayesian probabilities P(C|x) = P(C) * P(x|C) /P(x).
PROBLEM 1 [50 points] Clustering
First, decide on a similarity measure for the 20Newsgroup dataset. Use entropy and purity for evaluation (using the true class labels). For this problem, to get 50 points, you need to make clustering work for one of the A), B), C) below, not all three.
A) Use K-Means clustering on the 20Newsgroups datapoints. You can choose
appropriately K in advance (knowing the proper number of classes), but
you'll have to play with the initial centroids.
B) Use hiererchichal clustering on the 20Newsgroups dataset, with average-link distance between clusters.
C) [Extra Credit] Apply K-medoids, where the K-Means centers are not average centroids, but instead "middle" points.
PROBLEM 3[extra credit]
Same as problem 2, both K-Means and hierarchical clustering, on the digits dataset from HW5(training data, labels. testing data, labels). You can choose K=10 in advance.
PROBLEM 3[extra credit]
Clustering: Apply the EM algorithm on the entire Spambase
(both positive and negative) using a mixture of K=9 gaussians,
each gaussian 57-dim. For each cluster =gaussian/component, consider
the training datapoints that are most associated with it (highest
probability in the mixture, out of K probabilities), and observe the
highest count of these points: positive or negative - this is the
prediction label of the cluster.
Use the following testing schema: Assign each test datapoint to the
cluster given by the highest probability gaussian component. Use
the cluster label to make a prediction. Present the accuracy.
Alternative testing: use each cluster label to produce a mixture-score
prediction, weighted by probability for each component: F(x) = label_cluster1 *
P(x|cluster1) + label_cluster2 *P(x|cluster2) +.... + label_clusterK *P(x|clusterK). Present the AUC.