# Machine Learning: Linear and Logistic Regression Unified

Warning: very twisty math ahead. Feel free to skip.

Linear and logistic regression use different cost functions:

\begin{align*} \text{linear:} & \begin{cases} h_\theta = \sum_j \theta_jx_j \\ J_\theta = \frac{1}{2m}\sum_i \left ( h_\theta( x^{(i)} ) - y^{(i)} \right )^2 \end{cases} \\ \\ \text{logistic:} & \begin{cases} h_\theta = \frac{1}{1+e^{-\sum_j \theta_jx_j}} \\ J_\theta = -\frac{1}{m}\sum_i \left [ y^{(i)} \ln h_\theta( x^{(i)} ) + \left (1-y^{(i)} \right ) \ln \left (1-h_\theta( x^{(i)} )\right ) \right ] \end{cases} \end{align*}

The real question is, why are the cost functions so different? It turns out that we can derive the cost functions from the same principles.

We start by claiming that whatever function (that is, model of reality) we choose, the outputs in the data set are based on that function of the inputs, plus some randomness. Nearly everything in reality is probabilistic to a greater or lesser extent. This is the whole basis of the field of statistics.

So let's write out the relationship between y, the actual output, and h, the estimated output:

$y^{(i)} - h_\theta(x^{(i)}) = \varepsilon ^{(i)}$

In the above equation, ε(i) represents the error between y(i) and h(i). ε is, therefore, a random variable. Remember that we are assuming a reality that is probabilistic, which means that given the inputs x(i) reality will generate y(i) with some probability. The goal of the model is to get rid of as much of the random variation as possible, leaving us only with some small error. The claim is that this error always has mean 0 and is Gaussian. This simply means that our model's output is centered smack in the middle of the probability distribution for reality's output, and that a real output farther away from the mean is less likely.

Now the probability distribution of ε(i) given a particular data point x(i) and a particular set of parameters θ (because, after all, it is our choice of data point and parameters that leads to the error) is, as we said, Gaussian with mean 0, and some standard deviation σ. We're going to assume that the standard deviation in the output is the same no matter where in the input space we are. The technical term for this is homoscedasticity (homo-, meaning the same, and Greek skedasis, meaning a dispersal) so now you can bring that up at parties.

We write the probability density:

$p(\varepsilon^{(i)} | x^{(i)}, \theta) = \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{(\varepsilon^{(i)})^2}{2\sigma^2}}$

Now, note that y is simply ε plus a function of x, which is just another way of saying that the output of reality is probabilistic, but specifically, given our model, it must be Gaussian:

$p(y^{(i)} | x^{(i)}, \theta) = \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{(\varepsilon^{(i)})^2}{2\sigma^2}} = \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{\left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2}{2\sigma^2}}$

Now, let's find the probability for the entire data set. This is called the likelihood, and of course it still depends on our choice of parameters. This is just the probabilities of each of the data points, multiplied together:

$p(Y | X, \theta) = \prod_i p(y^{(i)} | x^{(i)}, \theta) = \left ( \frac{1}{\sqrt{2\pi} \sigma} \right )^m e^{-\sum_i\frac{\left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2}{2\sigma^2}}$

And now here is the key: this is precisely what we need to maximize. We want to maximize the probability that we get our data set output given the inputs and our parameters. Or, we want to get, as it is known, maximum likelihood.

Now, since the logarithm maintains the property that it is monotonically increasing, that is, log(x) > log(y) if x > y, we can take the log of the probability and maximize that. This just makes the later math easier:

$\ln p(Y | X, \theta) = \ln \left ( \frac{1}{\sqrt{2\pi} \sigma} \right )^m e^{-\sum_i\frac{\left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2}{2\sigma^2}} = \frac{m}{\sqrt{2\pi} \sigma} - \frac{1}{2\sigma^2}\sum_i \left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2$

Since this is maximization, we can feel free to get rid of any additive and positive multiplicative constants:

\begin{align*} \arg \max_\theta \ln p(Y | X, \theta) &= \arg \max_\theta \frac{m}{\sqrt{2\pi} \sigma} - \frac{1}{2\sigma^2}\sum_i \left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2\\ &= \arg \max_\theta - \sum_i \left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2 \\ &= \arg \min_\theta \sum_i \left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2 \end{align*}

arg maxθ just means the value of θ which maximizes. In the last step, I've simply turned the maximization problem into a minimization problem by reversing the sign. And so we see that this is exactly, except for a constant, the cost function for linear regression.

For logistic regression, our interpretation of h was that it is the probability that the data point is in the class. Again, we're assuming that the actual class is probabilistic, but here our model directly tells us the probability of the output being what it was in the data set. Because of that interpretation, we don't need to mess around with Gaussian errors, and we can go directly to the probability distribution:

$p(y^{(i)} = 1 | x^{(i)}, \theta) = h_\theta(x^{(i)})$

And now, the probability that we get our data set is just the probability that each of our outputs classifies the data points correctly is (using one particular formulation out of many possible that is nice when we take logs later):

$p(Y | X, \theta) = \prod_i h_\theta(x^{(i)})^{y^{(i)}} \left ( 1-h_\theta(x^{(i)}) \right )^{1-y^{(i)}}$

Taking logs and maximizing/minimizing:

\begin{align*} \arg \max_\theta \ln p(Y | X, \theta) &= \arg \max_\theta \sum_i \left [ y^{(i)} \ln h_\theta(x^{(i)}) + (1-y^{(i)}) \ln \left ( 1-h_\theta(x^{(i)}) \right ) \right ]\\ &= \arg \min_\theta -\sum_i \left [ y^{(i)} \ln h_\theta(x^{(i)}) + (1-y^{(i)}) \ln \left ( 1-h_\theta(x^{(i)}) \right ) \right ] \end{align*}

And this is exactly, except for a constant, the cost function for logistic regression.

So in summary, what we've done is try to get the probability that we get the data set's outputs given the data set's inputs and a choice of parameters. This is what we want to maximize.

Finding this probability in turn depends on finding the individual probabilities for each data point. In logistic regression, this is a direct consequence of the definition of h, but for linear regression it is based in the assumption that the output, once we subtract out our model, is Gaussian distributed with mean 0.

Once we have the overall probability, we seek to maximize it, and taking logs and changing sign to turn it into a minimization gets us our cost function.