Machine Learning: Linear Regression

I took Stanford's online Machine Learning class taught by Andrew Ng, director of SAIL, and one of the coauthors of the paper resulting from the recent Google "Cortex" project. While I'm fairly up-to-date with evolutionary computing, I felt that I could use a refresher in non-evolutionary techniques, and I'm really glad I did! Andrew's lectures were well-organized and highly intelligible, and the areas where I felt he went too slowly can be forgiven since his audience is undergrads. The videos can be played at varying speeds, and I alternated between 125% and 150%.

I'll post a series of articles on each subject that was covered, more for my own reference, but perhaps readers might find these useful as well.

Linear Regression

Suppose we did a study where each record (or point) in the data set consists of measurements for a number of different features. For example, the features in a medical study could be the numerical results of blood tests, and the features in a botanical study could be the numerical results of the sizes of various plant parts. The features in an economics study could be economic indices. In other words, each record is multivariate.

What we want to do is call one of the features the output, and call the rest of the features the inputs, and find out if there is a way to predict the output given the inputs. For example, is there a way to predict cholesterol level given blood sugar level, vitamin D levels, and white blood cell count? Specifically, if we call the inputs x1, x2, ..., xN, and the output y, then we want to find some function h ("h" stands for hypothesis) such that y ≈ h(x1, x2, ..., xN).

With linear regression, we further state that h is a linear combination of the inputs:


The thetas are the unknown coefficients of each input — that is, they are the parameters of the function — and the goal of linear regression is to find the thetas which, considered over the data set, makes hθ ≈ y. We determine how well the function, given a particular set of parameters, fits the data set by using a cost function. The higher the cost, the worse the fit. Ideally, we would like to see a zero cost, meaning a perfect fit, and there's no such thing as a negative cost.


m is the number of points in the data set, x(i) is the set of inputs for the ith data point, and y(i) is the output for the ith data point, so i goes from 1 to m. All we're doing here is taking the difference between the predicted output and the actual output, squaring it (so that it doesn't matter what direction the error is in, and zero means a perfect match), and then taking the average over all points in the data set. There's also a division by two, which makes the later step a little cleaner.

Scaling the cost by a constant, such as 1/2, doesn't matter, because we're still maintaining the property that higher costs mean worse fits, and a zero cost is a perfect fit.

What we want to do is find a set of parameters θ so that Jθ is minimized. We will probably not be able to get the cost down to zero, but we will get as close as possible.

While it is possible to calculate the parameters directly from the data set under most circumstances using an equation known as the normal equation, this isn't nearly as interesting as searching for the solution, and when the hypothesis function gets any more complicated than a linear function, there is in general no direct solution, and the parameters must be searched for.

The way we will search for the solution is to randomly choose the parameters, see what the cost is, and then move the parameters in a downhill direction. This is called gradient descent. The first thing we do is compute the gradient of the cost function with respect to each parameter. Actually, the thing we do before the first thing is to fake up an input variable which is always 1:


Here, we faked up an input feature x0 which is always 1. This makes the math more regular and work out more cleanly. Now we can take the gradient:


To change the parameters so that the cost moves downwards, we choose some rate α, multiply each gradient by that, and move each parameter in the downhill direction:


We can then calculate the new gradient using the new parameters, move again, and repeat until we're not reducing the cost any more.

Practical considerations

  • Because of the simple nature of the cost function, there are no local optima. There is only one global optimum.
  • Each feature should be scaled so that it is approximately in the range -1 to +1. It doesn't matter so much if the feature is only positive or only negative, or even if the feature goes a little bit outside this range (although -3 to +3 is a strict limit). This is so that the gradient doesn't move one parameter by a lot but the rest of the parameters by only a little just because the range of that parameter is large relative to the others.
  • You want to choose α to be as large as possible without overshooting and kicking the parameters violently uphill. Monitor the cost function, and if it increases, your rate is too high. In general, start with a low rate such as 0.03, and logarithmically increase the rate (i.e. 0.03, 0.1, 0.3, 1, and so on) if it seems that the cost is going down too slowly, or decrease the rate (i.e. 0.03, 0.01, 0.003, and so on) if the cost increases.
  • Randomly set the theta parameters to be in the range -1 to +1. Or, they can even be all zero. But it's good to get in the habit of going with a random starting point.
  • You can create more features by combining existing features by, for example, multiplication, division or squaring. Examine the data set manually to see if you can spot anything which appears nonlinear, and attempt to linearize the nonlinearity. For example, if one feature seems to be following a square law, then square root the feature before running linear regression.
  • Underfitting and overfitting will be handled in a later article.