CS6140 Machine Learning

HW1

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

We are not looking for very long answers (if you find yourself writing more than one or two pages of typed text per problem, you are probably on the wrong track). Try to be concise; also keep in mind that good ideas and explanations matter more than exact details.


PROBLEM 1 [60 points]

Housing data, training and testing sets (description). The last column are the labels. For both subproblems, submit a report with a plot (if necessary), results and comments. DO NOT SUBMIT CODE; but have it ready for a demo if we ask. The report is expected to be about 2 pages or so (be concise). Generally, the grading on programming assignments will be done during the demo.

a) Using housing data, build a decision tree (or regression tree) from the training set. Since the features are numeric values, you will need to use thresholds mechanisms. How well does it do on the testing set?

b) Using housing data, apply regression on the training set to find a linear fit with the labels.  Implement both the gradient descent regression and the linear algebra exact solution. Compare the training and testing errors (sum of square differences between prediction and actual label).

TIP: You can try to normalize the columns (features) if the gradient descent does not seem to converge to the exact solution. Do not normalize labels.
TIP2: Learning rate necessary for gradient descent to work might be smaller that you think (0.0001 or so); better, try a variable learning rate that changes across iterations.


PROBLEM 2 [20 points]

DHS chapter8, Pb1. Given an arbitrary decision tree, it might have repeated queries (splits) on some paths root-leaf. Prove that there exists an equivalent decision tree only with distinct splits on each path.

DHS chapter8,
a) Prove that for any arbitrary tree, with possible unequal branching ratios throughout, there exists a binary tree that implements the same classification functionality.
    b) Consider a tree with just two levels  - a root node  connected to B leaf nodes (B>=2) . What are  then upper and the lower limits on the number of levels in a functionally equivalent binary tree, as a function of B?
    c)  As in b), what are the upper and lower limits on number of nodes in a functionally equivalent binary tree?

PROBLEM 3 [15 points]

DHS chapter8,
Consider training a binary decision tree using entropy splits.
    a) Prove that the decrease in entropy by a split on a binary yes/no feature can never be greater than 1 bit.
    b)  Generalize this result to the case  of arbitrary branching  B>1.

PROBLEM 4 [15 points]

DHS chapter5, (for this problem in particular, it is better if you first read DHS chapter 5)
Figure below illustrates the two most popular methods for designing a c-category classifier from linear boundary segments. Another method is to save the full (c choose 2)  linear ωi/ωj boundaries, and classify any point by taking a vote based on all these boundaries. Prove whether the resulting decision regions must be convex. If they need not be convex, construct a non-pathological example yielding at least one non-convex decision region.


PROBLEM 5 [Extra Credit]

DHS chapter5
A classifier is said to be a piecewise linear machine if its discriminant functions have the form

where


(a) Indicate how a piecewise linear machine can be viewed in terms of a linear machine for classifying subclasses of patterns.
(b) Show that the decision regions of a piecewise linear machine can be non convex and even multiply connected.
(c) Sketch a plot of gij(x) for a one-dimensional example in which n1 = 2 and n2 = 1 to illustrate your answer to part (b)

PROBLEM 6 [Extra Credit]

DHS chapter5,
The convex hull of a set of vectors xi,i = 1,...,n is the set of all vectors of the form



where the coefficients αi are nonnegative and sum to one. Given two sets of vectors, show that either they are linearly separable or their convex hulls intersect. (Hint: Suppose that both statements are true, and consider the classification of a point in the intersection of the convex hulls.)

PROBLEM 7 [10 points]

Write down explicit formulas for normal equations solution presented in class for the case of one input dimension.

(Essentially assume the data is (x_i,y_i) i=1,2,..., m and you are looking for h(x) = ax+b that realizes the minimum mean square error. The problem asks you to write down explicit formulas for a and b.)

HINT: Do not simply copy the formulas from here (but do read the article): either take the general formula derived in class and make the calculations (inverse, multiplications, transpose) for one dimension or derive the formulas for a and b from scratch; in either case show the derivations. You can compare your end formulas with the ones linked above.

PROBLEM 8 [15 points]

a) Somebody tosses a fair coin and if the result is heads, you get nothing, otherwise you get $5. How much would you be pay to play this game? What if the win is $500 instead of $5?

b) Suppose you play instead the following game: At the beginning of each game you pay an entry fee of $100. A coin is tossed until a head appears, counting n = the number of tosses it took to see the first head. Your reward is 2^n (that is: if a head appears first at the 4th toss, you get $16). Would you be willing to play this game (why)?

c) Lets assume you answered "yes" at part b (if you did not, you need to fix your math on expected values). What is the probability that you make a profit in one game?

d) [ExtraCredit] After about how many games (estimate) the probability of making a profit overall is bigger than 50% ?

PROBLEM 9 [ExtraCredit]

With the notation used in class (and notes), prove that