CS6140 Machine Learning

HW2

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

We are not looking for very long answers (if you find yourself writing more than 1-2 page(s) of typed text per problem, you are probably on the wrong track). Try to be concise; also keep in mind that good ideas and explanations matter more than exact details.



PROBLEM 1 [50 points]

Consider the following neural network (left graph), with 8 input units (for data with 8 features), 3 hidden units and 8 output units, and assume the nonlinear functions are all sigmoid.

a)The 8 training data inputs are identical with the outputs, as shown in the right side table. Implement this network and the backpropagation algorithm to compute all the network weights; you should initialize the weights with nontrivial values (i.e not values that already minimize the erorr).

HINT: on the trained network, you should obtain values for the hidden units somehow similar with the ones shown in the table (up to symmetry). Feel free to make changes to the algorithm that suit your implementation, but briefly document them.

b) Since the outputs and inputs are identical for each datapoint, one can view this network as an encoder-decoder mechanism. (this is one of the uses of neural networks). In this context, explain the purpose of the training algorithm. (I expect a rather nontechnical --but documented-- answer).

c) Can this encoder-decoder scheme work with 1 hidden unit? Can it work with 2 hidden units? What if there are more than 8 inputs and outputs? Justify your answers mathematically.

Submit a concise report.

PROBLEM 2 [50]

In this assignment, you will create a Naive Bayes classifier for detecting e-mail spam, and you will test your classifier on a publicly available spam dataset. You will experiment with three methods for modeling the distribution of features, and you will test your classifier using 10-fold cross-validation.


PROBLEM 2.1 [Extra Credit].

Run the Gradient Discriminant Analysis on the spambase data. Use the k-folds from the previous problem (1 for testing, k-1 for training, for each fold)
Since you have 57 real value features, each gaussian will have a mean  vector with 57 components, and a they will have a common (shared) covariance matrix size 57x57. Looking at the result, does it appear that the gaussian assumption (normal distributed data) holds for this particular dataset?

PROBLEM 3 [20 points]

a) Prove that

bayes_cond


b) You are given a coin which you know is either fair or double-headed. You believe
that the a priori odds of it being fair are F to 1; i.e., you believe that the a priori probability of the coin
being fair is F/(F+1) . You now begin to flip the coin in order to learn more. Obviously, if you ever see a tail,
you know immediately that the coin is fair. As a function of F, how many heads in a row would you need to see before becoming convinced that there is a better than even chance that the coin is double-headed?

PROBLEM 4 [15 points]

DHS CH2, Pb 43

PROBLEM 4 part 2 [Extra Credit]

a) DHS CH2, Pb 45

PROBLEM 6 [15 points]

 Read prof Andrew Ng's lecture on ML practice advice. Write a brief summary (1 page) explaining the quantities in the lecture and the advice.

PROBLEM 7 [15 points]

b) DHS CH2, Pb 44

PROBLEM 8 [15 points]

DHS ch6, Pb1
Show that if the transfer function of the hidden units is linear, a three-layer network is equivalent to a two-layer one. Explain why, therefore, that a three-layer network with linear hidden units cannot solve a non-linearly separable problem such as XOR or n-bit parity.

PROBLEM 9 [extra credit]

For a function f(x1,x2,..., xn) with real values, the "Hessian" is the matrix of partial second derivatives

Consider the log-likelihood function for logistic regression

Show that its Hessian matrix H is negative semidefinite, i.e. for any vector z satisfies

Remark: This fact is sometimes written H 0 and implies the log-likelihood function is concave.

Hint: