# Machine Learning: Feedforward backpropagation neural networks

If we take a logistic function as in logistic regression, and feed the outputs of many logistic regressions into another logistic regression, and do this for several levels, we end up with a neural network architecture. This works nicely to increase the number of parameters as well as the number of features from the basic set you have, since a neural network's hidden layers act as new features.

Each non-input neuron in a layer gets its inputs from every neuron from the previous layer, including a fixed bias neuron which acts as the x0 = 1 term we always have.

Rather than θ, we now call the parameters weights, and the outputs are now called activations. The equation for the output (activation) of neuron p in layer l is:

\begin{align*} z^{(l)}_q &= \sum_{p=0}^{s_{l-1}} w_{pq}^{(l-1)}a_p^{(l-1)}\\ a^{(l)}_q &= g\left (z^{(l)}_q \right ) \end{align*}

Breaking it down:

• w(l-1)pq is the weight from neuron p in layer l-1 to neuron q in layer l
• a(l-1)pq is the activation of neuron p in layer l-1, and of course when p=0, the activation is by definition 1.
• z(l)is the usual sum, specifically for neuron q in layer l.
• g is some function, which we can take to be the logistic function.

So we see that the output of any given neuron is a logistic function of its inputs.

We will define the cost function for the entire output, for a single data point, to be as follows:

$J^{(i)} = \frac{1}{2}\sum_{p=1}^{s_L} \left ( y_p^{(i)} - a_p^{(i, L)} \right )^2$

Note that we are using the linear regression cost, because we will want the output to be an actual output rather than a classification. The cost can be defined using the logistic cost function if the output is a classification.

Now, the algorithm proceeds as follows:

1. Compute all the activations for a single data point
2. For each output neuron q, compute:

$\delta^{(L)}_q = -(y_q^{(i)} - a_q^{(i, L)})(1 - a_q^{(i, L)})a_q^{(i, L)}$

3. For each non-output neuron p, working backwards in layers from layer L-1 to layer 1, compute:

$\delta^{(l)}_p = (1 - a_p^{(i, l)})a_p^{(i, l)}\sum_{q=1}^{s_{l+1}} w^{(l)}_{pq}\delta^{(l+1)}_q$

4. Compute the weight updates as follows:

$w^{(l)}_{pq} \leftarrow w^{(l)}_{pq} + \alpha a^{(l)}_p \delta^{(l+1)}_q$

The last step can, in fact, be delayed. Simply present multiple data points, or even the entire training set, adding up the changes to the weights, and then only update the weights afterwards.

Because it is extraordinarily easy to get the implementation wrong, I highly suggest the use of a neural network library such as the impressively expansive Encog as opposed to implementing it yourself. Also, many neural network libraries include training algorithms other than backpropagation.

### The Concrete Example

I used Encog to train a neural network on the concrete data from the earlier post. I first took the log of the output, since that seemed to represent the data better and led to less network error. Then I normalized the data, except I used the range 0-1 for both the days input and the strength output, since that seemed to make sense, and also led to less network error.

Here's the Java code I used. Compile it with the Encog core library in the classpath. The only argument to it is the path to the Concrete_Data.csv file.

The network I chose, after some experimentation checking for under- and overfitting, was an 8:20:10:1 network. I used this network to train against different sizes of training sets to see the learning curves. Each set of data was presented to the network for 10,000 iterations of an algorithm called Resilient Backpropagation, which has various advantages over backpropagation, namely that the learning rate generally doesn't have to be set.

As before, the blue line is the training cost, the mean squared error against the training set, and the red line is the cross-validation cost, the mean squared error against the cross-validation set. This is generally what I would expect for an algorithm that is neither underfitting nor overfitting. Overfitting would show a large gap between training and cross-validation, while underfitting would show high errors for both.

If we saw underfitting, then we would have to increase the parameter space, which would mean increase the number of neurons in the hidden layers. If we saw overfitting, then decreasing the parameters space would be appropriate, so decreasing the number of neurons in the hidden layers would help.

Since the range of the output is 0-1, over the entire training set we get an MSE (training) of 0.0003, which means the average error per data point is 0.017. This doesn't quite tell the whole story, because if an output is supposed to be, say, 0.01, and error of 0.017 means the output wasn't very well-fit. Instead, let's just look at the entire data set, ordered by value, after denormalization:

The majority of errors fall under 10%, which is probably good enough. If I were concerned with the data points whose error was above 10%, I might be tempted treat those data points as "difficult", try to train a classifier to train data points as "difficult" or "not difficult", and then train different regression networks on each class.

The problem with that is that I could end up overfitting my data again, this time manually. If I manually divide my points into "difficult" and "not difficult" points, then what is the difference between that and having more than two classes? How about as many classes as there are data points?

What would be nice is if I could have an automatic way to determine if there is more than one cluster in my data set. One clustering algorithm will be the subject of the next post.

# Machine Learning: Linear Regression Example: Concrete

There is a fun archive of machine learning data sets maintained by UC Irvine. For a concrete example, let's take the Concrete Compressive Strength data set and try linear regression on it. (Get it? Concrete? Ha ha ha!) There are 1030 points in the data set, eight input features, and one output feature. Here is the basic info:

Feature #

Name Range
1 Amount of Cement (kg/m3) 102 - 540
2 Amount of Blast Furnace Slag (kg/m3) 0 - 359.4
3 Amount of Fly Ash (kg/m3) 0 - 200.1
4 Amount of Water (kg/m3) 121.75 - 247
5 Amount of Superplasticizer (kg/m3) 0 - 32.2
6 Amount of Coarse Aggregate (kg/m3) 801 - 1145
7 Amount of Fine Aggregate (kg/m3) 594 - 992.6
8 Mixture Age (days) 1 - 365
Output Compressive Strength (MPa) 2.3 - 82.6

In keeping with the principle that the ranges of the features should be scaled to the range (-1, 1), we will subtract the midpoint of each range from each feature and divide by the new maximum. So, for example, the midpoint of feature 1 is 321, so subtracting brings the range to (-219, 219), and dividing by 219 brings the range to (-1, 1).

Here's a plot of feature 1 versus the output. There's a lot of variation, but it does sort of look roughly correlated.

Here's the Octave code: concrete_regression.m. You'll also need to open Concrete_Data.xls and export to CSV to Concrete_Data.csv so that Octave can read the file. Place both Octave and CSV files in the same directory, change to that directory, run Octave, and then call concrete_regression().

Here is the learning curve and the parameters found. The training cost is in blue, while the cross-validation cost is in red.

\begin{align*} h_\theta &= 0.313 + 0.362x_1 + 0.138x_2 + 0.011x_3 -0.365x_4\\ &+ 0.182x_5 -0.111x_6 -0.190x_7 +0.361x_8 \end{align*}

The training and cross-validation costs are very close to each other, which is good. It means that the learned parameters are quite representative of the entire data set, so there is no overfitting.

However, the cost appears to be quite high: about 0.037. This means that the output is, on average, off by 0.27. Which is, by nearly any standard, terrible. Clearly there is some intense underfitting going on, and the only remedy is to get more features and more parameters.

But we could combine the features in infinite ways. How are we going to find good ways to combine the features? We'll take a look at neural networks in the next article, which essentially combines the features for us, and gives us more parameters as well.

# Machine Learning: Linear and Logistic Regression Unified

Warning: very twisty math ahead. Feel free to skip.

Linear and logistic regression use different cost functions:

\begin{align*} \text{linear:} & \begin{cases} h_\theta = \sum_j \theta_jx_j \\ J_\theta = \frac{1}{2m}\sum_i \left ( h_\theta( x^{(i)} ) - y^{(i)} \right )^2 \end{cases} \\ \\ \text{logistic:} & \begin{cases} h_\theta = \frac{1}{1+e^{-\sum_j \theta_jx_j}} \\ J_\theta = -\frac{1}{m}\sum_i \left [ y^{(i)} \ln h_\theta( x^{(i)} ) + \left (1-y^{(i)} \right ) \ln \left (1-h_\theta( x^{(i)} )\right ) \right ] \end{cases} \end{align*}

The real question is, why are the cost functions so different? It turns out that we can derive the cost functions from the same principles.

We start by claiming that whatever function (that is, model of reality) we choose, the outputs in the data set are based on that function of the inputs, plus some randomness. Nearly everything in reality is probabilistic to a greater or lesser extent. This is the whole basis of the field of statistics.

So let's write out the relationship between y, the actual output, and h, the estimated output:

$y^{(i)} - h_\theta(x^{(i)}) = \varepsilon ^{(i)}$

In the above equation, ε(i) represents the error between y(i) and h(i). ε is, therefore, a random variable. Remember that we are assuming a reality that is probabilistic, which means that given the inputs x(i) reality will generate y(i) with some probability. The goal of the model is to get rid of as much of the random variation as possible, leaving us only with some small error. The claim is that this error always has mean 0 and is Gaussian. This simply means that our model's output is centered smack in the middle of the probability distribution for reality's output, and that a real output farther away from the mean is less likely.

Now the probability distribution of ε(i) given a particular data point x(i) and a particular set of parameters θ (because, after all, it is our choice of data point and parameters that leads to the error) is, as we said, Gaussian with mean 0, and some standard deviation σ. We're going to assume that the standard deviation in the output is the same no matter where in the input space we are. The technical term for this is homoscedasticity (homo-, meaning the same, and Greek skedasis, meaning a dispersal) so now you can bring that up at parties.

We write the probability density:

$p(\varepsilon^{(i)} | x^{(i)}, \theta) = \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{(\varepsilon^{(i)})^2}{2\sigma^2}}$

Now, note that y is simply ε plus a function of x, which is just another way of saying that the output of reality is probabilistic, but specifically, given our model, it must be Gaussian:

$p(y^{(i)} | x^{(i)}, \theta) = \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{(\varepsilon^{(i)})^2}{2\sigma^2}} = \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{\left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2}{2\sigma^2}}$

Now, let's find the probability for the entire data set. This is called the likelihood, and of course it still depends on our choice of parameters. This is just the probabilities of each of the data points, multiplied together:

$p(Y | X, \theta) = \prod_i p(y^{(i)} | x^{(i)}, \theta) = \left ( \frac{1}{\sqrt{2\pi} \sigma} \right )^m e^{-\sum_i\frac{\left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2}{2\sigma^2}}$

And now here is the key: this is precisely what we need to maximize. We want to maximize the probability that we get our data set output given the inputs and our parameters. Or, we want to get, as it is known, maximum likelihood.

Now, since the logarithm maintains the property that it is monotonically increasing, that is, log(x) > log(y) if x > y, we can take the log of the probability and maximize that. This just makes the later math easier:

$\ln p(Y | X, \theta) = \ln \left ( \frac{1}{\sqrt{2\pi} \sigma} \right )^m e^{-\sum_i\frac{\left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2}{2\sigma^2}} = \frac{m}{\sqrt{2\pi} \sigma} - \frac{1}{2\sigma^2}\sum_i \left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2$

Since this is maximization, we can feel free to get rid of any additive and positive multiplicative constants:

\begin{align*} \arg \max_\theta \ln p(Y | X, \theta) &= \arg \max_\theta \frac{m}{\sqrt{2\pi} \sigma} - \frac{1}{2\sigma^2}\sum_i \left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2\\ &= \arg \max_\theta - \sum_i \left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2 \\ &= \arg \min_\theta \sum_i \left ( y^{(i)} - h_\theta(x^{(i)}) \right )^2 \end{align*}

arg maxθ just means the value of θ which maximizes. In the last step, I've simply turned the maximization problem into a minimization problem by reversing the sign. And so we see that this is exactly, except for a constant, the cost function for linear regression.

For logistic regression, our interpretation of h was that it is the probability that the data point is in the class. Again, we're assuming that the actual class is probabilistic, but here our model directly tells us the probability of the output being what it was in the data set. Because of that interpretation, we don't need to mess around with Gaussian errors, and we can go directly to the probability distribution:

$p(y^{(i)} = 1 | x^{(i)}, \theta) = h_\theta(x^{(i)})$

And now, the probability that we get our data set is just the probability that each of our outputs classifies the data points correctly is (using one particular formulation out of many possible that is nice when we take logs later):

$p(Y | X, \theta) = \prod_i h_\theta(x^{(i)})^{y^{(i)}} \left ( 1-h_\theta(x^{(i)}) \right )^{1-y^{(i)}}$

Taking logs and maximizing/minimizing:

\begin{align*} \arg \max_\theta \ln p(Y | X, \theta) &= \arg \max_\theta \sum_i \left [ y^{(i)} \ln h_\theta(x^{(i)}) + (1-y^{(i)}) \ln \left ( 1-h_\theta(x^{(i)}) \right ) \right ]\\ &= \arg \min_\theta -\sum_i \left [ y^{(i)} \ln h_\theta(x^{(i)}) + (1-y^{(i)}) \ln \left ( 1-h_\theta(x^{(i)}) \right ) \right ] \end{align*}

And this is exactly, except for a constant, the cost function for logistic regression.

So in summary, what we've done is try to get the probability that we get the data set's outputs given the data set's inputs and a choice of parameters. This is what we want to maximize.

Finding this probability in turn depends on finding the individual probabilities for each data point. In logistic regression, this is a direct consequence of the definition of h, but for linear regression it is based in the assumption that the output, once we subtract out our model, is Gaussian distributed with mean 0.

Once we have the overall probability, we seek to maximize it, and taking logs and changing sign to turn it into a minimization gets us our cost function.