CS6140 12F: Homework 03

Assigned: Wednesday, October 03, 2012
Due: Wednesday, October 17, 2012

Last modified:

General Instructions

Feel free to work with others on this assignment. However, you must acknowledge with whom you worked, you must write your own code, and you must create your own report.

Assignment

In this assignment, you will test regression and logistic regression trained via gradient descent on the Spambase data set, and you will also run the perceptron algorithm on a linearly separable data set that will be supplied.

Linear Regression
Here we will work with the Spambase dataset from HW02, testing your implementations using Fold 1 as described in HW02.
- Step 1: Precondition your data. Gradient descent will often perform much better on data that has been "normalized" so that the individual features are on a comparable scale. One commonly used normalization is the z-score, sometimes called the standard score. To compute the z-score corresonding to a feature value, one must first compute the mean mu and standard deviation sd of the feature. You should compute these values yourself, in code, but you can check your results against the Spambase page describing various simple statistics over those features.
  Once you have computed mu and sd (separately for each feature), the z-score for any feature value x is simply
  (x - mu)/sd.
  For example, Feature 1 has a mean of 0.10455 and a standard deviation of 0.30536. If a particular example has a Feature 1 value of 0.5, the corresponding z-score would be
  (0.5 - 0.10455)/0.30536 = 1.295.
  Note that the z-score is simply the number of standard deviations a feature value is above or below the mean value of that feature.
  - Precondition the Spambase dataset using z-scores as described above.
- Step 2: Partition the data into 10 folds, as described in HW02. In this assignment, you will only work with Fold 1, i.e., you should test on all examples whose index ends in a 1, and you should train on all the remaining examples.
  Note: The Spambase data set consists of a large block of spam messages followed by a large block of non-spam messages. Depending on how you create your folds, your training and testing data sets will also have this property. For any batch learning algorithm, this is not a problem, since the learner or updates will look at the entire data set at once (consider, for example, batch gradient descent). However, for an on-line learner (such as stochastic gradient descent), this can be problematic, since the algorithm will process (and make updates) on all positive examples followed by all negative examples---convergence will likely be hindered. To improve convergence for stochastic gradient descent, you may consider randomizing the lines in your training set (i.e., after you have created Fold 1). There are a number of Unix tools that can accomplish this easily, such as rl, sort -R, and various short bash scripts.
- Step 3: Create a linear regression learner that is trained via stochastic gradient descent.
  Initialization: Start with all weights being zero.
  Learning rate parameter: One of the challenges in implementing gradient descent is choosing a good fixed learning rate lambda or devising a variable learning rate schedule. If the learning rate is set too low, then gradient descent will converge very slowly; on the other hand, if the learning rate is set too high, gradient descent may actually diverge. A compromise is to set an initial "fast" learning rate and to decrease the learning rate as gradient descent converges.
  For this assignment, you will explore the effect of different fixed learning rates on gradient descent. (You are welcome to explore the use of a variable learning rate, if you like.)
- Step 4: Train your linear regressor. Try various fixed learning rate parameters lambda over a wide range (e.g., 1, 0.1, 0.01, 0.001, 0.0001, etc.), and train your linear regressor via stochastic gradient descent. Explore the phenomenon discussed above, that for too large a lambda, gradient descent will diverge, and for too small a lambda, gradient descent will converge, but very slowly.
  In order to analyze your convergence rate, after each complete pass through the training data, you should compute the value of your error function
  J(w) = sum_t (h_w(x_t) - y_t)²
  J(w) is your overall sum squared error or SSE, but an easier number to interpret is root mean squared error or RMSE. RMSE is computed by starting with the sum squared error, dividing by the number of training examples m to obtain the mean squared error (MSE), and finally taking the square root to obtain RMSE.
  RMSE = sqrt(SSE/m)
  Note that optimizing for SSE, MSE, or RMSE are all equivalent, and since SSE is the simplest, we typically use that for J(w) in our gradient descent formulation. (In fact, to simplify things even further, we often use one-half the SSE as our J(w), as this eliminates the constant 2 that we would otherwise have to drag around in our calculations.)
  However, RMSE has the advantage that it is independent of the training set size (unlike SSE) and measured in the same units as the underlying data (unlike MSE and SSE). Thus, if you're trying to predict a house price in dollars, your RMSE is measured in dollars. Furthermore, RMSE is an estimate of the standard deviation of your error, and one can then use this value to obtain confidence intervals. For example, given an unbiased predictor with normally distributed errors, you'd expect 95% of your predictions to be within +/- 2*RMSE of the actual value. Thus, if you're RMSE is $1,500, then you might expect 95% of your house price predictions to be within +/- $3,000 of their actual values. See discussions of standard error for a complete explanation.
  Note: One would not ordinarily evaluate the error function or RMSE after every pass through the data, as this effectively doubles the computation required by gradient descent. We are doing this in order to explore the effect of the learning rate parameter on the convergence of gradent descent.
  1. Run your gradient descent algorithm until your RMSE appears to have converged. (Specify your convergence criterion.)
  2. Repeat the above for both a lower and a higher learning rate parameter. (You may choose the lower and higher rate, but please specify it.)
  3. Create one plot showing all three learning curves, i.e., RMSE vs. iteration number for each of the three learning rates.
- Step 5: Evaluate your results.
  1. ROC curve: You will have trained three learners. Pick one of them (presumably the one with the lowest RMSE) and plot an ROC curve corresponding to its performance on the test data.
  2. AUC: An ROC curve visualizes the tradeoff between false positive rate and true positive rate (which is 1 minus the false negative rate) at various operating points. One measure for summarizing this data is to compute the area under the ROC curve or AUC. The AUC, as its name suggests, is simply the area under the ROC curve, which for a perfect curve is 1 and for a random predictor is 1/2. This area has an interesting probabilistic interpretation: it is the chance that the predictor will rank a randomly chosen positive example above a randomly chosen negative example.
    The AUC can be calculated fairly simply using the trapezoidal rule for estimating the area under a curve: If our m ROC data points are
    (x₁, y₁), (x₂, y₂), ..., (x_m, y_m)
    in order, where
    (x₁, y₁) = (0,0)
    and
    (x_m, y_m) = (1,1)
    then the AUC is calculated as follows:
    (1/2) sum_{k=2 to m} (x_k - x_k-1) (y_k + y_k-1).
    Calculate the AUC of your classifier, and compare this value to the AUC of your classifiers from HW02.
- Step 6: Batch gradient descent. Repeat Steps 3 through 5 above for batch gradient descent. Note that this should involve only a minor change in your code.
  Compare the convergence rates for batch vs. stochastic gradient descent. How many passes through the data are required to obtain a "good" predictor and/or RMSE?
Logistic Regression
Repeat Steps 3 through 6 above for logistic regression with both stochastic and batch gradient descent. Note that this should involve only a minor change in your code, as discussed in class.
Perceptron Algorithm
Here you will create a perceptron learning algorithm, as described in class, and test it on a linearly separable data set that will be supplied.
- Step 1: Dowload the perceptron learning data set that I have created. The data set is tab delimited with 5 fields, where the first 4 fields are feature values and the last field is the {+1,-1} label; there are 1,000 total data points.
- Step 2: Create a perceptron learning algorithm, as described in class.
- Step 3: Run your perceptron learning algorithm on the data set provided. Keep track of how many iterations you perform until convergence, as well as how many total updates (corresponding to mistakes) that occur through each iteration. After convergence, your code should output the raw weights, as well as the normalized weights corresponding to the linear classifier
  w₁ x₁ + w₂ x₂ + w₃ x₃ + w₄ x₄ = 1
  (You will create the normalized weights by dividing your perceptron weights w₁, w₂, w₃, and w₄ by -w₀, the weight corresponding to the special "offset" feature.)
- Step 4: Output the result of your perceptron learning algorithm as described above. Your output should look something like the following:
```
[jaa@jaa-laptop Perceptron]$ perceptron.pl perceptronData.txt

Iteration 1, total mistakes 152
Iteration 2, total mistakes 225
Iteration 3, total mistakes 283
Iteration 4, total mistakes 339
Iteration 5, total mistakes 341
Iteration 6, total mistakes 341

Classifier weights: -17 1.62036704608359 3.27065807088159 4.63999040888332 6.79421449422058 8.26056991916346 9.36697370729981

Normalized with threshold: 0.0953157085931524 0.192391651228329 0.272940612287254 0.399659676130622 0.485915877597851 0.550998453370577
```
  (Note: The output above corresponds to running on a different data set than yours which has six dimensions as opposed to four. Your results will be different, but you should convey the same information as above.)
Prepare a Report
You should prepare a report describing your results above for (1) linear regression and logistic regression with stochastic and batch gradient descent and (2) the perceptron algorithm. You should also submit your code. You may hand in your report on paper or via e-mail, and you should submit your code via e-mail.