CS6220 Unsupervised Data Mining

HW2 KMEANS and Gaussian Mixtures

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

We are not looking for very long answers (if you find yourself writing more than one or two pages of typed text per problem, you are probably on the wrong track). Try to be concise; also keep in mind that good ideas and explanations matter more than exact details.

Submit all code files Dropbox (create folder HW1 or similar name). Results can be pdf or txt files, including plots/tabels if any.

"Paper" exercises: submit using Dropbox as pdf, either typed or scanned handwritten.

DATATSET : SpamBase: emails (54-feature vectors) classified as spam/nospam

DATATSET : 20 NewsGroups : news articles

DATATSET : MNIST : 28x28 digit B/W images

DATATSET : FASHION : 28x28 B/W images

https://en.wikipedia.org/wiki/MNIST_database
http://yann.lecun.com/exdb/mnist/
https://www.kaggle.com/zalando-research/fashionmnist

PROBLEM 1: KMeans Theory

Given Kmeans Objective discussed in class with Euclidian distance
some text

A) prove that E step update on membership (\pi) achieves the minimum objective given the current centroids( \mu)
B) prove that M step update on centroids (\mu) achieves the minimum objective given the current memberships( \pi)
C) Explain why KMeans has to stop (converge), but not necessarily to the global minimum objective value.

PROBLEM 2 : KMeans on data

Using Euclidian distance or dot product similarity (choose one per dataset, you can try other similarity metrics).
A) run KMeans on the MNIST Dataset, try K=10
B) run KMeans on the FASHION Dataset, try K=10
C) run KMeans on the 20NG Dataset, try K=20
You can use a library for distance/similarity but you have to implement your own kmeans (EM steps, termination criteria etc).
For all three datasets, evaluate the KMeans objective for a higher K (for example double) or smaller K(for example half).
For all three datasets, evaluate external clustering performance using data labels and performance metrics Purity and Gini Index (see [A] book section 6.9.2).

D) run soft Kmeans for MNIST dataset, use K=10. You can try beta=0.1, beta=1, beta=10. Evaluate performance.

PROBLEM 3 : Gaussian Mixture on toy data

You are required to implemet the main EM loop, but can use math API/functions provided by your language to calculate normal densities, covariance matrix, etc.
A) The gaussian 2-dim data on file 2gaussian.txt has been generated using a mixture of two Gaussians, each 2-dim, with the parameters below. Run the EM algorithm with random initial values to recover the parameters.
mean_1 [3,3]); cov_1 = [[1,0],[0,3]]; n1=2000 points
mean_2 =[7,4]; cov_2 = [[1,0.5],[0.5,1]]; ; n2=4000 points
You should obtain a result visually like this (you dont necessarily have to plot it)

B) Same problem for 2-dim data on file 3gaussian.txt , generated using a mixture of three Gaussians. Verify your findings against the true parameters used generate the data below.
mean_1 = [3,3] ; cov_1 = [[1,0],[0,3]]; n1=2000
mean_2 = [7,4] ; cov_2 = [[1,0.5],[0.5,1]] ; n2=3000
mean_3 = [5,7] ; cov_3 = [[1,0.2],[0.2,1]] ); n3=5000

Additional notes helpful for implementing Gaussian Mixtures:
https://xcorr.net/2008/06/11/log-determinant-of-positive-definite-matrices-in-matlab/
http://andrewgelman.com/2016/06/11/log-sum-of-exponentials/
https://hips.seas.harvard.edu/blog/2013/01/09/computing-log-sum-exp/

PROBLEM 4 EM for coin flips

Three coins A,B,C with head-prob p_A, p_B, p_C can be chosen for each of N sessions. Once a coin is chosen for the session, that coin is flipped D times.
For D=20 and N=50, and fixed non-uniform selection coin probabilities pi_A, pi_B, pi_C, which sum to 1, we have this outcome , each row corresponds to a session with 20 binary 1=head 0=tail.

Compute the probabilities to select each coin to session (3 mixture "pi" probabilities), and also the bias probabilities (3 param "p" probabilities).

HINT: for each session, since the D flips are independent of each other, what matters is the number of heads out of the batch size D. If chance of head is p_ for each flip, then probability of observing x heads is binomial(x|p_, D). Here is a technical brief note .
In English: binomial(X|p_,D) = probability to get x heads out of D coin flips, if coin has head-bias p_.

PROBLEM 5 [optional, no credit] : Gaussian Mixture on real data

Run EM to obtain a Gaussian Mixture on FASHION dataset (probably won't work) and on SPAMBASE dataset (should work). Use a library/package (such as scikit-learn) and at first use the option that imposes a diagonal covariance matrix. Sampling data might be necessary to complete the run.
Supervised GDA. Train a different GMM for each label (separately on train points with that label); each GMM with small K. For testing points x, apply each GMM(x) and Bayes theorem to obtain the label with max posterior probability. Evaluate with accuracy on test data.