neural networks

Machine Learning: Autoencoders

An autoencoding algorithm is an unsupervised learning algorithm which seeks to recreate its input after processing the output. The layer or layers between input and output then become a representation of the input, which may have fewer or more dimensions than the input.

If the internal layer is further restricted so that only a very few of its components are active for any given input, then it is a sparse autoencoder. Generally, if the dimension of the internal layer is less than that of the input layer, then the autoencoder is performing, appropriately enough, dimension reduction. If, however, the number of dimensions is greater, then we enter the realm of feature detection, which, to me anyway, is a much more interesting application of autoencoding. In addition, feature detection appears to be how the brain handles input.

One of the challenges of feature detection is to ensure the internal layers don't degenerate to a trivial representation of the input, that is, simply repeating the input so that each feature is simply an input feature.

I'll start by talking about autoencoding via backpropagation. Before we tackle this, I'd like to rehash the mathematics of backpropagation, but this time in matrix form, which will be much easier to handle. So feel free to skip if you're not really interested.


Backpropagation, a more thorough derivation

We start with the same diagram as before:

Neural network

This time, however, we'll use matrix notation. The equation for the vector of activations for layer l is as follows:


  • a(l) is a column vector of sl elements (i.e. an sl x 1 matrix), the activations of the neurons in layer l,
  • b(l) is a column vector of sl elements (i.e. an sl x 1 matrix), the biases for layer l, equivalent to a fixed input 1 multiplied by a bias weight, separated out so we don't have to deal with a separate and somewhat confusing input augmentation step,
  • W(l-1) is an sl x sl-1 matrix for the weights between layer l-1 and layer l, and
  • g is a squashing function, which we can take to be the logistic function (for range 0 to 1) or the tanh function (for range -1 to 1). Or really any differentiable function.

A quick sanity check for z = Wa + b: W is sl x sl-1, a is sl-1 x 1, so multiplying W x a cancels out the middle, yielding sl x 1, which is consistent with the definitions for z and b.

Now, the cost function for a single data point x(i),y(i) is as follows:

|| a - y || is simply the Euclidean distance between a and y, otherwise known as the L2 norm. Note also that it is a scalar, and not a vector or matrix.

The cost over all data points, and adding a regularization term, is:

That last term simply means to take every weight between every neuron and every other neuron in every layer, square it, and add. We don't take any of the bias terms into the regularization term, as usual.

Now, first, we want to determine how gradient descent moves W(L-1) and b(L):

This just says that we move W downhill in "J-space" with respect to W, and the same with b. Note that since W(L-1) is an sL x sL-1 matrix, then so too must the derivative of J with respect to W(L-1) be. And now let's compute those derivatives. First, the derivative with respect to the weights in the last layer:

Note that we just called the derivative of g with respect to its argument, g'. For the logistic and tanh functions, these are nice, compact derivatives:

Since the argument of g (being z(L,i)) is an sL x 1 matrix, so too is its derivative. a(L-1,i) is an sL-1 x 1 matrix, its transpose is a 1 x sL-1 matrix, and thus g' x a is an sL x sL-1 matrix, which is consistent with what we wanted the size of the derivative of J with respect to W(L-1) to be. 

And now with respect to the bias on the last layer:

Let us define:

Note that this is an sL x 1 matrix. It is the contribution to the weight or bias gradient due to an "error" in output. We can now define our derivatives more compactly:

Now, what about the derivatives with respect to the previous layer weights and bias? The key insight in backpropagation is that we can generalize these derivatives as follows. For l from L to 2 (we start from L because these are recursive equations) we have:

A rigorous mathematical treatment for this is so completely outside the scope of this article as to be invisible :) But the general argument is that delta represents the contribution of a layer to the gradient based on the error between desired output and generated output. For the final layer, this is straightforward, and we can directly calculate it. However, for an internal layer, it is as if the errors from the next layer have propagated backwards through the weights, and so we can calculate, from output to input, the contributions of each layer.


Backpropagation, the algorithm

First, zero out an accumulator for each layer. The accumulators have the same dimensions as the weight and bias matrices. So for l from 2 to L:

Second, compute all the forward activations a for a single data point. So, for l from 2 to L, we have:

Compute the delta terms for l from L to 2, and add to the accumulators:

Next, after doing the above two steps for each data point, we compute the gradients for l from 2 to L:

Finally, we use these gradients to go downhill, for l from 2 to L:

That is one round of updates. We start from zeroing out the accumulators to do the next iteration, and continue until it doesn't look like the cost is getting any lower.

Instead of the above, we could provide a function which, given W and b, computes the cost and the derivatives. Then we give that function to a library which does minimization. Sometimes minimization libraries do a better job at minimizing than manually doing gradient descent, and some of the libraries don't need a learning parameter (alpha).


Adding a sparseness criterion

The whole reason for going through the derivation and not going straight to the algorithm was so that we could add a sparseness measure in the cost function, and see how that affects the algorithm.

First, if we have d dimensions in the input, then an autoencoder will be a d:1:d network.

We will first determine the average activation of layer 2 over all data points:

Note that this is an s2 x 1 matrix. To be sparse, we want the values of each element to be very low. If we're using a logistic function, this means near to zero. If we're using the tanh function, near to -1, but we will rescale the average activation to lie between 0 and 1 by adding 1 and dividing by 2.

Let us denote our target sparsity for each element as ρ, so that we want our measured sparsity to be close to that. Clearly we don't want ρ=0, because that would give us a trival solution: zero weights everywhere.

For a sparsity cost, we will use the following measure, known as the Kullback-Leibler divergence:

Note that the sum applies element-by-element to the measured sparsity vector, and so the cost is a scalar. This cost is zero when each measured sparsity element is equal to the desired sparsity, and rises otherwise.

We add this cost to our main cost function as follows:

where β is just a parameter whereby we can tune the importance of the sparsity cost.

Without going through the derivation, we use the following altered delta for layer 2 during backpropagation:

That scalar term added due to sparsity is computed only after all data points have been fed forwards through the network, because that is the only way to determine the average activation of layer 2. It is independent, therefore, of i. So the modification to backpropagation would require this:

  1. Going through the data set, compute all the way up to layer 2, and accumulate the sum of the activations for each neuron in layer 2.
  2. Divide the sums by m, the number of points in the data set.
  3. Perform one iteration of backpropagation.
  4. Go back to step 1.


Why I don't like this

It is said that backpropagation is not biologically plausible, that is, it cannot be the algorithm used by the brain. There are several reasons for this, chief among which is that errors do not propagate backwards in the brain.

A sparse backpropagating autoencoder is doubly implausible, because not only does it rely on backpropagation, but it also requires that we wait until all data points are presented before determining the average activation. It would be much nicer if we had something more biologically plausible, if only because I have the suspicion that any algorithm that is not biologically plausible cannot lead to human-level intelligence.

So in the next article, I'll talk about a biologically plausible algorithm called the reverse Boltzmann machine.

Machine Learning: Feedforward backpropagation neural networks

If we take a logistic function as in logistic regression, and feed the outputs of many logistic regressions into another logistic regression, and do this for several levels, we end up with a neural network architecture. This works nicely to increase the number of parameters as well as the number of features from the basic set you have, since a neural network's hidden layers act as new features.

Neural network

Each non-input neuron in a layer gets its inputs from every neuron from the previous layer, including a fixed bias neuron which acts as the x0 = 1 term we always have.

Rather than θ, we now call the parameters weights, and the outputs are now called activations. The equation for the output (activation) of neuron p in layer l is:

Breaking it down:

  • w(l-1)pq is the weight from neuron p in layer l-1 to neuron q in layer l
  • a(l-1)pq is the activation of neuron p in layer l-1, and of course when p=0, the activation is by definition 1.
  • z(l)is the usual sum, specifically for neuron q in layer l.
  • g is some function, which we can take to be the logistic function.

So we see that the output of any given neuron is a logistic function of its inputs.

We will define the cost function for the entire output, for a single data point, to be as follows:

Note that we are using the linear regression cost, because we will want the output to be an actual output rather than a classification. The cost can be defined using the logistic cost function if the output is a classification.

Now, the algorithm proceeds as follows:

  1. Compute all the activations for a single data point
  2. For each output neuron q, compute:

  3. For each non-output neuron p, working backwards in layers from layer L-1 to layer 1, compute:

  4. Compute the weight updates as follows:


The last step can, in fact, be delayed. Simply present multiple data points, or even the entire training set, adding up the changes to the weights, and then only update the weights afterwards.

Because it is extraordinarily easy to get the implementation wrong, I highly suggest the use of a neural network library such as the impressively expansive Encog as opposed to implementing it yourself. Also, many neural network libraries include training algorithms other than backpropagation.


The Concrete Example

I used Encog to train a neural network on the concrete data from the earlier post. I first took the log of the output, since that seemed to represent the data better and led to less network error. Then I normalized the data, except I used the range 0-1 for both the days input and the strength output, since that seemed to make sense, and also led to less network error.

Here's the Java code I used. Compile it with the Encog core library in the classpath. The only argument to it is the path to the Concrete_Data.csv file.

The network I chose, after some experimentation checking for under- and overfitting, was an 8:20:10:1 network. I used this network to train against different sizes of training sets to see the learning curves. Each set of data was presented to the network for 10,000 iterations of an algorithm called Resilient Backpropagation, which has various advantages over backpropagation, namely that the learning rate generally doesn't have to be set.

Learning curve neural

As before, the blue line is the training cost, the mean squared error against the training set, and the red line is the cross-validation cost, the mean squared error against the cross-validation set. This is generally what I would expect for an algorithm that is neither underfitting nor overfitting. Overfitting would show a large gap between training and cross-validation, while underfitting would show high errors for both. 

If we saw underfitting, then we would have to increase the parameter space, which would mean increase the number of neurons in the hidden layers. If we saw overfitting, then decreasing the parameters space would be appropriate, so decreasing the number of neurons in the hidden layers would help.

Since the range of the output is 0-1, over the entire training set we get an MSE (training) of 0.0003, which means the average error per data point is 0.017. This doesn't quite tell the whole story, because if an output is supposed to be, say, 0.01, and error of 0.017 means the output wasn't very well-fit. Instead, let's just look at the entire data set, ordered by value, after denormalization:

Errors training neural

Errors cv neural

The majority of errors fall under 10%, which is probably good enough. If I were concerned with the data points whose error was above 10%, I might be tempted treat those data points as "difficult", try to train a classifier to train data points as "difficult" or "not difficult", and then train different regression networks on each class.

The problem with that is that I could end up overfitting my data again, this time manually. If I manually divide my points into "difficult" and "not difficult" points, then what is the difference between that and having more than two classes? How about as many classes as there are data points?

What would be nice is if I could have an automatic way to determine if there is more than one cluster in my data set. One clustering algorithm will be the subject of the next post.