Machine Learning: Overfitting, underfitting

It's not enough for a machine learning algorithm to optimize its cost on your data set. If your algorithm works well with points in your data set, but not on new points, then the algorithm overfit the data set. And if your algorithm works poorly even with points in your data set, then the algorithm underfit the data set.

Underfitting is easy to check as long as you know what the cost function measures. The definition of the cost function in linear regression is half the mean squared error. That is, if the mean error for each point is z, then the cost will be 0.5z2. So if your output ranges from, say, 100 to 1000, and your cost is 1, then the mean error would be 1.4, which represents a mean error of anywhere from 1.4% to 0.14%, and that may be good enough. If, however, your cost is 50, then the mean error would be 10, which is anywhere from 10% to 1%, which is probably bad.

If you cost ends up high even after many iterations, then chances are you have an underfitting problem. Or maybe your learning algorithm is just not good for the problem.

Underfitting is also known as high bias, since it means your algorithm has such a strong bias towards its hypothesis, that it does not fit the data well. It also means that the hypothesis space the learning algorithm explores is too small to properly represent the data.

Checking for overfitting is also fairly easy. Split the data set so that 80% of it is your training set and 20% is a cross-validation set. Train on the training set, then measure the cost on the cross-validation set. If the cross-validation cost is much higher than the training cost, then chances are you have an overfitting problem.

Overfitting is also known as high variance, since it means that the hypothesis space the learning algorithm explores is too large to properly constrain the possible hypotheses.


Dealing with overfitting

  • Throw features away. The hypothesis space is too large, and perhaps some features are faking the learning algorithm out. Throwing features away shrinks the hypothesis space.
  • Add regularization if there are many features. Regularization forces the magnitudes of the parameters to be smaller, thus shrinking the hypothesis space. It works like this:
First, add a new term to the cost function which penalizes the magnitudes of the parameters (except for θ0, which corresponds to the faked x0 feature):
Eq2 1
Again, note that the summation over the squared parameters starts at 1, not 0. λ is a parameter which adjusts the penalization, which means the size of the hypothesis space. Small values increase the hypothesis space, while larger values shrink the hypothesis space. Of course, too large a value may lead to too small a hypothesis space, which leads to underfitting.
We can start with λ=1 and then increase or decrease logarithmically, measuring the training and cross-validation cost each time. However, when measuring, use the definition of the training cost without regularization. This is because you just want to see the mean squared error, and the cost contributed by the parameters isn't an error.
Now, we need the gradients with respect to each parameter. This is simply:
And now the learning algorithm uses this gradient instead.

Dealing with underfitting

  • More data will not generally help. It will, in fact, likely increase the training error.
  • However, more features can help, because that expands the hypothesis space. This includes making new features from existing features.
  • More parameters can also help expand the hypothesis space. For linear regression, the number of parameters equals the number of features, but for other learning algorithms, the number of parameters can be greater.


Training curves and metaparameter selection

What values of metaparameters (regularization penalty λ, number of features, number of parameters, number of data points) is good? Perform various runs where you fix all metaparameters except for one, and then plot out the training and cross-validation costs versus the metaparameter you are tuning, and choose the value of the metaparameter that minimizes the cross-validation cost.

However, when reporting the error rate or cost metric for your chosen metaparameters, you will have to set aside some portion of the data which was touched neither by the learning algorithm (during training) nor by you (during metaparameter selection). Thinking about it differently, the learning algorithm performs gradient descent based on the values of the metaparameters, and then you perform gradient descent on the metaparameters themselves. That means you have to set aside some portion of the data as cross-validation!

So we end up with three subsets of data, a training set for the learning algorithm, a cross-validation set for metaparameter selection, and a test set for final results. Generally this can be split 80%, 10%, 10%, or 60%, 20%, 20% if there is enough data.