MODULE 1 : NORMAL-EQ-REGRESSION, DECISION TREES STICKY POINTS * FEATURE NORMALIZATION make sure to normalize the entire column a good idea to look at your values after normalization same normalization for both training and test save normalization parameters? * DATA PARTITION INTO TRAIN/TEST randomly typically 90/10 or 80/20 * DATA PARTITION: CROSS VALIDATION k-folds, use k=10 train on each k-1 folds, test on the omitted one requires 10 separate training runs average error *DT: INFORMATION GAIN (OR ENTROPY BEFORE AND AFTER SPLIT) entropy formula entropy is a measure of randomness less entropy = less randomness = more consistency = better prediction = less need for splitting the data at the note *DT: REGRESSION SPLIT CRITERIA - least squares, same as the regression error or objective. Also the same as empirical variance *DT: HOW TO LOOK FOR BEST THRESHOLD without trying all possible feature values: sample by buckets or ranges always try all features *DT: CACHING “BEST FEATURE/THRESHOLD” NOT ACTUALLY USED later if the same node is tried for splitting, the best feature/threshold is already calculated *DT: AVOID DEEP TREES deep trees can eventually work on very small/focused set of datapoints making generalization difficult - thus overfitting not a problem with depth if dataset at node is still reasonable large *REGRESSION/NORMALEQ: DERIVATION OPTIONAL no big worries if student doesnt follow the matrix manipulations, its really for math-loving people *REGRESSION/NORMALEQ: HOW TO COMPUTE THE PSEUDOINVERSE definitely use a procedure/package, dont implement pseudoinverse yourself try to run the pseudoinverse operation separately, perhaps first on smaller matrices *REGRESSION/NORMALEQ: NUMERICAL STABILITY look at eigen values: small eigenvalues can cause trouble *REGRESSION/NORMALEQ: ADD BIAS “1” COLUMN thus 1 more dimension for the “free regression coefficient”, total dimensions are d+1, so d+1 regression coefficients make sure to also add the 1 column to test sets (better add it to all data before partitioning)