CS6140 Machine Learning

HW4B Ensembles: Gradien Boosting, Bagging, VC dimension

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.




PROBLEM 5 [optional, no credit]

What is the VC dimension for the class of hypothesis of (you can choose that plus side must be inside, or that either side is inside)

a) unions of two rectangles with edges vertical/horizontal (not angled)

b) circles

c) triangles

d) multidimensional "sphere" given by f(x) = sign [(x-c)(x-c) -b] in the Euclidean space with m dimensions m . Justify your answers !

For this problem we care more about the existential side of VC dimension: if K is the answer, show that there is a set of K points that can be shattered. It is not required to formally prove  that no set of K+1 points can be shattered.

PROBLEM 6 [50 points] Bagging

Bagging setup:
  • Training: Run your Decision Tree classifier separately (from scratch) T=50 times. Each Decision Tree is trained on a sample-with-replacement set from the training dataset (every Decision Tree has its own sampled-training-set). You can limit the depth of the tree in order to simplify computation.
  • Sampling with replacement: Say there are N datapoints in the training set. Sampling with replacement, uniformly, for a sample of size N, works in the following way: in a sequence, independently of each other, select randomly and uniformly N times from the training datapoints. Once a datapoint is selected, it is still available for further sampling (hence "with replacement" methodology). Each sampled-training-set will have N datapoints; some points will be sampled overall more than once (duplicates) while other datapoints will not be sampled at all.
  • Testing: for a test datapoint, will run all T Decision Trees and average the predictions to obtain the final prediction.
  • Run bagging on Spambase dataset. Compare results with boosting.



    PROBLEM 7 [50 points] Gradient Boosted Trees for Regression

    Run gradient boosting with regression trees on housing dataset. Essentially repeat the following procedure i=1:10 rounds on the training set. Start in round i=1with labels Y_x as the original labels.

    The overall prediction function is the sum of the trees. Report training and testing error.


    PROBLEM 8 [Extra Credit]

    Run Boosting with ECOC functions on the Letter Recognition Data Set (also a multiclass dataset).


    PROBLEM 9 [Extra Credit] Boosted Decision Trees

    Do PB1 with weak learners being  full decision trees  instead of stumps. The final classifier is referred as "boosted trees". Compare the results. Hints: there are two possible ways of doing this.


    PROBLEM 10 [Extra Credit] Rankboost

    - Implement rankboost algorithm following the rankboost paper and run it on TREC queries.


    PROBLEM 11 [Extra Credit] Gradient Boosted Trees

    Run gradient boosting with regression stumps/trees on 20Newsgroup dataset dataset.