CS6140 Machine Learning

HW3

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.



PROBLEM 1 [20 points]

Print this pdf of the EM code demoed in class. (Alternatively, you can print the code yourself from an editor).
Annotate  on the right  the sections of the code with your explanation of what the code does. Submit on paper.

PROBLEM 2 [20 points]

    A.Generate mixture data (coin flips). Pick a p,r,pi as in the EM example discussed in class (or in notes). Say p=.75, r=.4, pi=.8, but you should try this for several sets of values. Generate the outcome of the coin experiment by first picking a coin (pi probability for first coin, 1-pi probability for the second coin), then flip that coin K times (use K=10) with probability of head (p if first coin is picked,  r if the second coin is picked) and finally write down a 1 if heads are seen, 0 if tails. Repeat this M=1000 times or more; so your outcome is a stream of M sequences of K flips : (1001110001; 0001110001; 1010100101 etc)

    B.Infer parameters from data.
Now using the stream of 1 and 0 observed, recover p,r,pi using the EM algorithm; K is known in advance. Report in a table the parameter values found by comparison with the ones used to generate data; try several sets of (p,r,pi). Here are some useful notes,  and  other readings (1 , 2 , 3 , 4) for the coin mixture.

    C. [Extra Credit]
Repeat A and B with C coins instead of two. You will need more mixture parameters.

PROBLEM 3 [50 points]

UCI datasets: AGR, BAL, BAND, CAR, CMC, CRX, MONK, NUR, TIC, VOTE. (These are archives which I downloaded a while ago. For more details and more datasets visit http://archive.ics.uci.edu/ml/)

The relevant files in each folder are only two:  
    * .config : # of datapoints, number of discrete attributes, # of continuous (numeric) attributes. For the discrete ones, the possible values are provided, in order, one line for each attribute. The next line in the config file is the number of classes and their labels.
    * .data: following the .config convention the datapoints are listed, last column are the class labels.
You should write a reader that given the .config file, reads the data from the .data file.

Use K-fold cross-validation :
* split into K folds
* run your algorithm K times each time training on K-1 of them and testing on the remaining one
* average the error across the K runs.

A. Implement the Adaboost algorithm with decision stumps as weak learners, as described in class. Run it on the MCI data and report the results. The datasets CAR, CRX, VOTE are required, the others are Extra Credit.

B. Run the algorithm for each dataset using c% of the datapoints chosen randomly for training, for several c values: 5, 10, 15, 20, 30, 50, 80. Test on a fixed fold (not used for training). For statistical significance, you can repeat the experiment with different randomly selected data or you can use cross-validation.


PROBLEM 3 [EXTRA CREDIT]

Do PB1 with weak learners being  full decision trees  instead of stumps. The final classifier is referred as "boosted trees". Compare the results.

PROBLEM 4 [30 points]

Run your code from PB3 to perform active learning. Specifically:

- start with a training set of about 5% of the data (selected randomly)

- iterate M episodes: train the Adaboost for T rounds; from the datapoints not in the training set, select the one that is closest to the separation surface (smallest discriminant) and add it to the training set. Repeat until the size of the training set reaches 50% of the data.

How is the performance improving with the training set increase? Compare the performance of the Adaboost algorithm on the c% randomly selected training set with c% actively-built training set for several values of c : 5, 10, 15, 20, 30, 50. Perhaps you can obtain results like these


PROBLEM 5 [20 points]

What is the VC dimension for the class of hyphothesis of:

a) unions of two rectangles

b) circles

c) triangles

d) multidimensional "sphere" given by f(x) = sign [(x-c)(x-c) -b] in the Euclidian space with m dimensions m . Justify your answers !





PROBLEM 6 [EXTRA CREDIT, requireds independent study of Hidden Markov Models, DHS ch 3.10]

DHS Pb 3.50 (page 154-155)
The standard method for calculating the probability of a sequence in a given HMM is to use the forward probabilities αi(t).