CS1500 Algorithms and Data Structures for Engineering, FALL 2012

HW5 Decision Tree


Write a non-recursive DFS Traversal function.


A. In machine Learning, a dataset consists on records (or objects, or datapoints, or instances) each organized with features (or attributes). For example: if each datapoint is a house, the features might be the square ft, number of bedrooms, year of construction, purchase price, etc. A special attribute denoted "label" or "target" forms usually the goal of the classification: knowing the other attributes, the task is to predict the label.
Read data form here (coming from here) into a matrix M such that M[i][j] is the j-th attribute value of datapoint i. There are 178 records, each with 13 attributes - so your M should be about 178 rows and 13 columns. The first attribute is the label, with three possible label values : 1,2 or 3.

B. Implement a Decision Tree Structure (Machine Learning).  The tree nodes are like the one in class (parent, lchild, rchild pointers), but the data consists in an array of 178 values either 0 or 1 indicating which datapoints from the set are "present" at the node. Each node also contains a value, an indice of one of the 13 attributes, and a real value threshold.
class treenode{
    short data[178];
    double value;
    int splitfeature;
    double splitthreshold;
    treenode* parent;
    treenode* lchild;
    treenode* rchild;
At root, all datapoints are present so the data array at root has all  178 values 1. When a node has children, we call that it "split": each datapoint present at the parent node has to go either left or right. (so if the root has two children nodes, some datapoints will be present at leftchild but on not at the rightchild and viceversa). A split is characterized by an attribute A (one of the 13) and a threshold T: out of the datapoints present at the node, the ones for which the attribute A has a value smaller than T go "present" at the leftchild (and not present at rightchild); similarly the datapoints that have attribute A value bigger than T go "present" at the right child.
To get you started, for each split read an attribute A and a threshold T from the keyboard, then perform the split at the current node.

(EXTRA CREDIT) C. The goal of the tree so to create splits such that the children nodes are less ambiguous on the label attribute than the parent node. That is, say one node has 30 datapoints, 10 labeled "1", 10 labeled "2" and 10 labeled "3". A "good split" example (by a certain attribute A and threshold T) is a split that puts on the leftchild the 10 datapoints labeled "1" and at the rightchild the 20 labeled "2" or "3". If a node has datapoints of only one label, then no split is necessary.
Your task is to greedily choose the attribute A and threshold T at each node such that the tree becomes less ambiguous as fast as possible ( each leaf node is dominated by a particular label)