Last modified:
Here we will work with the Spambase dataset from HW02, testing your implementations using Fold 1 as described in HW02.
Once you have computed mu and sd (separately for each feature), the z-score for any feature value x is simply
(x - mu)/sd.
For example, Feature 1 has a mean of 0.10455 and a standard deviation of 0.30536. If a particular example has a Feature 1 value of 0.5, the corresponding z-score would be
(0.5 - 0.10455)/0.30536 = 1.295.
Note that the z-score is simply the number of standard deviations a feature value is above or below the mean value of that feature.
Note: The Spambase data set consists of a large block of spam messages followed by a large block of non-spam messages. Depending on how you create your folds, your training and testing data sets will also have this property. For any batch learning algorithm, this is not a problem, since the learner or updates will look at the entire data set at once (consider, for example, batch gradient descent). However, for an on-line learner (such as stochastic gradient descent), this can be problematic, since the algorithm will process (and make updates) on all positive examples followed by all negative examples---convergence will likely be hindered. To improve convergence for stochastic gradient descent, you may consider randomizing the lines in your training set (i.e., after you have created Fold 1). There are a number of Unix tools that can accomplish this easily, such as rl, sort -R, and various short bash scripts.
Initialization: Start with all weights being zero.
Learning rate parameter: One of the challenges in implementing gradient descent is choosing a good fixed learning rate lambda or devising a variable learning rate schedule. If the learning rate is set too low, then gradient descent will converge very slowly; on the other hand, if the learning rate is set too high, gradient descent may actually diverge. A compromise is to set an initial "fast" learning rate and to decrease the learning rate as gradient descent converges.
For this assignment, you will explore the effect of different fixed learning rates on gradient descent. (You are welcome to explore the use of a variable learning rate, if you like.)
In order to analyze your convergence rate, after each complete pass through the training data, you should compute the value of your error function
J(w) = sumt (hw(xt) - yt)2
J(w) is your overall sum squared error or SSE, but an easier number to interpret is root mean squared error or RMSE. RMSE is computed by starting with the sum squared error, dividing by the number of training examples m to obtain the mean squared error (MSE), and finally taking the square root to obtain RMSE.
RMSE = sqrt(SSE/m)
Note that optimizing for SSE, MSE, or RMSE are all equivalent, and since SSE is the simplest, we typically use that for J(w) in our gradient descent formulation. (In fact, to simplify things even further, we often use one-half the SSE as our J(w), as this eliminates the constant 2 that we would otherwise have to drag around in our calculations.)
However, RMSE has the advantage that it is independent of the training set size (unlike SSE) and measured in the same units as the underlying data (unlike MSE and SSE). Thus, if you're trying to predict a house price in dollars, your RMSE is measured in dollars. Furthermore, RMSE is an estimate of the standard deviation of your error, and one can then use this value to obtain confidence intervals. For example, given an unbiased predictor with normally distributed errors, you'd expect 95% of your predictions to be within +/- 2*RMSE of the actual value. Thus, if you're RMSE is $1,500, then you might expect 95% of your house price predictions to be within +/- $3,000 of their actual values. See discussions of standard error for a complete explanation.
Note: One would not ordinarily evaluate the error function or RMSE after every pass through the data, as this effectively doubles the computation required by gradient descent. We are doing this in order to explore the effect of the learning rate parameter on the convergence of gradent descent.
The AUC can be calculated fairly simply using the trapezoidal rule for estimating the area under a curve: If our m ROC data points are
(x1, y1), (x2, y2), ..., (xm, ym)
in order, where
(x1, y1) = (0,0)
and
(xm, ym) = (1,1)
then the AUC is calculated as follows:
(1/2) sumk=2 to m (xk - xk-1) (yk + yk-1).
Calculate the AUC of your classifier, and compare this value to the AUC of your classifiers from HW02.
Compare the convergence rates for batch vs. stochastic gradient descent. How many passes through the data are required to obtain a "good" predictor and/or RMSE?
Repeat Steps 3 through 6 above for logistic regression with both stochastic and batch gradient descent. Note that this should involve only a minor change in your code, as discussed in class.
Here you will create a perceptron learning algorithm, as described in class, and test it on a linearly separable data set that will be supplied.
w1 x1 + w2 x2 + w3 x3 + w4 x4 = 1
(You will create the normalized weights by dividing your perceptron weights w1, w2, w3, and w4 by -w0, the weight corresponding to the special "offset" feature.)
[jaa@jaa-laptop Perceptron]$ perceptron.pl perceptronData.txt Iteration 1, total mistakes 152 Iteration 2, total mistakes 225 Iteration 3, total mistakes 283 Iteration 4, total mistakes 339 Iteration 5, total mistakes 341 Iteration 6, total mistakes 341 Classifier weights: -17 1.62036704608359 3.27065807088159 4.63999040888332 6.79421449422058 8.26056991916346 9.36697370729981 Normalized with threshold: 0.0953157085931524 0.192391651228329 0.272940612287254 0.399659676130622 0.485915877597851 0.550998453370577
(Note: The output above corresponds to running on a different data set than yours which has six dimensions as opposed to four. Your results will be different, but you should convey the same information as above.)
You should prepare a report describing your results above for (1) linear regression and logistic regression with stochastic and batch gradient descent and (2) the perceptron algorithm. You should also submit your code. You may hand in your report on paper or via e-mail, and you should submit your code via e-mail.