CS6140 Machine Learning

HW3

EM class notes,   Other EM notes1, notes2, notes3, notes4, notes5

Multivariate Normal,   Multivariate - mariginal and conditional,    Correlation Coefficient

You have to write your own code, and to understand how 2-dim gaussians work (languages like Python or Matlab can help in dealing with multidimensional data, including the means and covariances in 2-dim). However before you do, you can look for other EM code to help you understand the E and M steps.




PROBLEM 1 [50]

Run the Gaussian Discriminant Analysis on the spambase data. Use the k-folds from the previous problem (1 for testing, k-1 for training, for each fold)
Since you have 57 real value features, each of the  2gaussians (for + class and for - class) will have a mean  vector with 57 components, and a they will have
(you can use a Matlab or Python of Java built in function to estimated covariance matrices, but the estimator is easy to code up). Looking at the result, does it appear that the gaussian assumption (normal distributed data) holds for this particular dataset?

PROBLEM 2 [50]

In this assignment, you will create a Naive Bayes classifier for detecting e-mail spam, and you will test your classifier on a publicly available spam dataset. You will experiment with three methods for modeling the distribution of features, and you will test your classifier using 10-fold cross-validation.

PROBLEM 3 [50 points]

Rerun the HW2 Naive Bayes classifier on Spam Data. For each feature (1-dim) use a mixture of K Gaussians as your model ; run the EM to estimate the 3*K parameters for each feature : mean1, var1, w1; mean2,var2, w2;... meanK,varK,wK; constrained by w1+w2 +...+wK=1. (You would need a separate mixture for positive and negative data, for each feature). We observed best results for K=9.
    Testing: for each testing point , apply the Naive Bayes classifier as before:  take the log-odd of product of probabilities over features mixtures (and the prior), separately for positive side and negative side; use the overall probability ratio as an output score, and compute the AUC for testing set.  Do this for a 10-fold cross validation setup. Is the overall 10-fold average AUC better than before, when each feature model was a single gaussian?


PROBLEM 4 [50 points]

A) The gaussian 2-dim data on file  2gaussian.txt  has been generated  using a mixture  of  two Gaussians, each  2-dim, with the parameters below. Run the EM algorithm with random initial values to recover the parameters.
mean_1 [3,3]); cov_1 = [[1,0],[0,3]]; n1=2000 points
mean_2 =[7,4]; cov_2 = [[1,0.5],[0.5,1]]; ; n2=4000 points
You should obtain a result visually like this (you dont have necessarily to plot it)

A) Same problem for 2-dim data on file 3gaussian.txt , generated using a mixture of three Gaussians. Verify your  findings against the true parameters used generate the data below.
mean_1 = [3,3] ; cov_1 = [[1,0],[0,3]]; n1=2000
mean_2 = [7,4] ; cov_2 = [[1,0.5],[0.5,1]] ; n2=3000
mean_3 = [5,7] ; cov_3 = [[1,0.2],[0.2,1]]    ); n3=5000




PROBLEM 5 - EXTRA CREDIT [50 points]

Cheng's  note summarizing E an M steps for this problem.

    A.Generate mixture data (coin flips). Pick a p,r,pi as in the EM example discussed in class (or in notes). Say p=.75, r=.4, pi=.8, but you should try this for several sets of values. Generate the outcome of the coin experiment by first picking a coin (pi probability for first coin, 1-pi probability for the second coin), then flip that coin K times (use K=10) with probability of head (p if first coin is picked,  r if the second coin is picked) and finally write down a 1 if the head is seen, 0 for the tail. Repeat this M=1000 times or more; so your outcome is a stream of M sequences of K flips : (1001110001; 0001110001; 1010100101 etc)

    B.Infer parameters from data.
Now using the stream of 1 and 0 observed, recover p,r,pi using the EM algorithm; K is known in advance. Report in a table the parameter values found by comparison with the ones used to generate data; try several sets of (p,r,pi). Here are some useful notes,  and  other readings (1 , 2 , 3 , 4) for the coin mixture.

    C(more coins). 
Repeat A) and B) with T coins instead of two. You will need more mixture parameters.



PROBLEM 6 [20 points]

Print this pdf of the EM code demoed in class. (Alternatively, you can print the code yourself from an editor).
Annotate  on the right  the sections of the code with your explanation of what the code does. Submit on paper.


PROBLEM 7 [20 points]

a) Prove that

bayes_cond


b) You are given a coin which you know is either fair or double-headed. You believe
that the a priori odds of it being fair are F to 1; i.e., you believe that the a priori probability of the coin
being fair is F/(F+1) . You now begin to flip the coin in order to learn more. Obviously, if you ever see a tail,
you know immediately that the coin is fair. As a function of F, how many heads in a row would you need to see before becoming convinced that there is a better than even chance that the coin is double-headed?

PROBLEM 8 [20 points]

a) Somebody tosses a fair coin and if the result is heads, you get nothing, otherwise you get $5. How much would you be pay to play this game? What if the win is $500 instead of $5?

b) Suppose you play instead the following game: At the beginning of each game you pay an entry fee of $100. A coin is tossed until a head appears, counting n = the number of tosses it took to see the first head. Your reward is 2^n (that is: if a head appears first at the 4th toss, you get $16). Would you be willing to play this game (why)?

c) Lets assume you answered "yes" at part b (if you did not, you need to fix your math on expected values). What is the probability that you make a profit in one game?

d) [ExtraCredit] After about how many games (estimate) the probability of making a profit overall is bigger than 50% ?



PROBLEM 9 [20 points]

DHS CH2, Pb 43

PROBLEM 9 part 2 [Extra Credit]

a) DHS CH2, Pb 45




PROBLEM 10 [Extra Credit]

b) DHS CH2, Pb 44



PROBLEM 11 [EXTRA CREDIT, requireds independent study of Hidden Markov Models, DHS ch 3.10]

DHS Pb 3.50 (page 154-155)
The standard method for calculating the probability of a sequence in a given HMM is to use the forward probabilities αi(t).