# HW4

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

### PROBLEM 1 [50 points]

UCI datasets: AGR, BAL, BAND, CAR, CMC, CRX, MONK, NUR, TIC, VOTE. (These are archives which I downloaded a while ago. For more details and more datasets visit http://archive.ics.uci.edu/ml/)

The relevant files in each folder are only two:
* .config : # of datapoints, number of discrete attributes, # of continuous (numeric) attributes. For the discrete ones, the possible values are provided, in order, one line for each attribute. The next line in the config file is the number of classes and their labels.
* .data: following the .config convention the datapoints are listed, last column are the class labels.
You should write a reader that given the .config file, reads the data from the .data file.

Use K-fold cross-validation :
* split into K folds
* run your algorithm K times each time training on K-1 of them and testing on the remaining one
* average the error across the K runs.

A. Implement the Adaboost algorithm with decision stumps as weak learners, as described in class. Run it on the UCI data and report the results. The datasets  CRX, VOTE are required

B. Run the algorithm for each of the required datasets using c% of the datapoints chosen randomly for training, for several c values: 5, 10, 15, 20, 30, 50, 80. Test on a fixed fold (not used for training). For statistical significance, you can repeat the experiment with different randomly selected data or you can use cross-validation.

C(extra credit) Run boosting on the other datasets. Some of them are multiclass, so you will have to have a muticlass-boosting implementation. The easiest "multiclass" is to run binary boosting one-vs-the-others separately for each class.

### PROBLEM 2 [Extra Credit]

Do PB1 with weak learners being  full decision trees  instead of stumps. The final classifier is referred as "boosted trees". Compare the results. Hints: there are two possible ways of doing this. Option 1. The weights are taken into account when we decide the best split. This requires you to change the decision tree training algorithm. Option 2. We can simulate the weights by sampling. In each round, when we want to train the decision tree, we construct a new data set based on the old one. We sample from the old data set k times with replacement. In each sampling, each data point x_i in the old data set has probability w_i of being chosen into the new data set. k can be as large as the size old data set, or smaller. We only need to make sure there are enough data points sampled for a decision tree to be trained reliably. Then we train a decision tree on the new data set without considering weights. Weights are already considered in sampling. In this way, you don't need to modify the decision tree training algorithm. More generally, any weak learner, whether it can handle weights naturally or not, can be trained like this. Once the decision tree is trained, the new data set is discarded. The only use of the newly constructed data set is in building the tree. Any other computation is based on the original data set.

### PROBLEM 3 [50 points]

Run your code from PB1 to perform active learning. Specifically:

- iterate M episodes: train the Adaboost for T rounds; from the datapoints not in the training set, select the 2% ones that are closest to the separation surface (boosting score F(x) closest to ) and add these to the training set (with labels). Repeat until the size of the training set reaches 50% of the data.

How is the performance improving with the training set increase? Compare the performance of the Adaboost algorithm on the c% randomly selected training set with c% actively-built training set for several values of c : 5, 10, 15, 20, 30, 50. Perhaps you can obtain results like these

### PROBLEM 4 [50 points]

Run Boosting with ECOC functions on the 20Newsgroup dataset with unigram extracted features by Cheng. Also, as an extra credit problem, you can try the Letter Recognition Data Set.

ECOC are a better muticlass approach than one-vs-the-rest. Each ECOC function partition the multiclass dataset into two labels; then Boosting runs binary. Having K ECOC  functions means having K binary boosting models. On prediction, each of the K models predicts 0/1 and so the prediction is a "codeword" of lenght K 11000110101... from which the actual class have to be identified.

### PROBLEM 5 [Extra Credit]

Run gradient boosting with regression stumps/trees on 20Newsgroup dataset dataset.

### PROBLEM 6 [20 points]

What is the VC dimension for the class of hyphothesis of:

a) unions of two rectangles

b) circles

c) triangles

d) multidimensional "sphere" given by f(x) = sign [(x-c)(x-c) -b] in the Euclidian space with m dimensions${ℝ}^{m}$. Justify your answers !