# Machine Learning: Restricted Boltzmann Machine

The Restricted Boltzmann Machine is an autoencoder which uses a biologically plausible algorithm. It uses a kind of Hebbian learning, which is the biologically plausible idea that "neurons that fire together, wire together".

Suppose we have an input layer of dimension Nin and an output layer of dimension Nout, with no hidden layer in between. All input units are connected to all output units, but input units are not connected to each other, and output units are not connected to each other, which is where the Restricted comes in the name.

Let the input units be binary, so either 0 or 1. Further, each output unit uses the logistic function, so that the output of unit j is:

\begin{align*} z_j &= b_j + \sum_{i=1}^{N_{in}} w_{ij}x_i \\ p_j &= \frac{1}{1+e^{-z_j}} \\ y_j &= \begin{cases} 1 & \text{ if } uniformrandom(0,1) < p_j \\ 0 & \text{ otherwise } \end{cases} \end{align*}

bj is the bias term for output unit j, and wij is the weight from input unit i to output unit j. Note that we're treating the logistic function as a probability, but this time the outputs are binary rather than the probability, so that an output unit j is on with probability pj. That is, the output unit is stochastic.

Now, to make this an autoencoder, we want to feed the outputs backwards to the inputs, which works as follows:

\begin{align*} z_i &= c_i + \sum_{j=1}^{N_{out}} w_{ij}y_j \\ p_i &= \frac{1}{1+e^{-z_i}} \\ x'_i &= \begin{cases} 1 & \text{ if } uniformrandom(0,1) < p_i \\ 0 & \text{ otherwise } \end{cases} \end{align*}

We're just doing the same thing in reverse, except that there is a bias ci now associated with each input unit. There is some evidence that this kind of feedback happens biologically.

After this downward pass, we perform one more upward pass, but this time using the probability input rather than the reconstructed input:

\begin{align*} z'_j &= b_j + \sum_{i=1}^{N_{in}} w_{ij}p_i \\ p'_j &= \frac{1}{1+e^{-z'_j}} \\ y'_j &= \begin{cases} 1 & \text{ if } uniformrandom(0,1) < p'_j \\ 0 & \text{ otherwise } \end{cases} \end{align*}

We do this for every input sample, saving all the values for each sample. When all the m samples have been presented (or, in batch learning, when a certain proportion m of the samples have been presented) we update the biases and weights as follows:

\begin{align*} \Delta w_{ij} &= \eta\frac{1}{m}\sum_{k=1}^{m} x_i p_j - \eta\frac{1}{m}\sum_{k=1}^{m} p_i p'_j \\ &= \eta \left ( \langle x_i p_j \rangle - \langle p_i p'_j \rangle \right ) \\ \Delta b_j &= \eta \left ( \langle p_j \rangle - \langle p'_j \rangle \right )\\ \Delta c_i &= \eta \left ( \langle x_i \rangle - \langle p_i \rangle \right )\\ \end{align*}

And that's the Restricted Boltzmann Machine. There are other refinements such as momentum and regularization which I won't cover, but which are implemented in the example Octave file. There are also many extremely helpful hints as to parameter settings in Hinton's paper A practical guide to training restricted Boltzmann machines.

As an example (Octave code here), I set up a 9x9 array of inputs which correspond to pixels (0=off, 1=on), so my input dimension is 81. I generate a bunch of samples by placing some random vertical and horizontal lines in, but with more lines being less likely. According to the paper, a good initial guess as to the number of output units is based on the number of bits a "good" model of the input data would need, multiplied by 10% of the number of training cases, divided by the number of weights per output unit.

Each sample has zero horizontal lines with probability 0.7, one horizontal line with probability 0.21, and two horizontal lines with probability 0.09, with a vanishingly small probability of three lines, and likewise with vertical lines, and each line can be at any of nine positions, which means that a "good enough" model of the input would need something like 9 bits for the horizontal lines and 9 bits for the vertical lines, or 18 bits. With 3000 or so samples, 300 x 18 / 81 = 67 output units.

I also used 2000 epochs, averaging weights through time, a momentum term which starts at 0.5, then switches to 0.9 halfway through, a learning rate which starts at 1, then switches to 3 halfway through, and a regularization parameter of 0.001.

I set aside 10% of the samples as cross-validation, and measure the training and cross-validation costs. The cost (per sample per pixel) is defined as:

$J = -\frac{1}{mN_{in}}\sum_{i=1}^{N_{in}} \sum_{k=1}^m \left ( x_i^{(k)} \ln{p'_i^{(k)}} + \left ( 1- x_i^{(k)} \right ) \ln{\left (1-p'_i^{(k)} \right )} \right )$

In other words, this is the usual logistic cost function, except that instead of the output, we're using the probability during the downward pass that an input pixel is 1.

Here are 16 example input samples:

And here are the trajectories of the training and cross-validation costs over the 2000 epochs. Strictly speaking, the cost is a better measure than error in the reconstructed input since it is not as affected by chance pixel flips during reconstruction (output to input) as a direct pixel-to-pixel comparison. For example, if an input pixel should be on, and it is only turned on 90% of the time during reconstruction, then the cost for that pixel is -ln 0.9 =  0.105. If we were to actually reconstruct that pixel, then 10% of the time the error would be 1, and 90% of the time the error would be 0; but we only do a single evaluation. So the cost gives us a better idea of how likely a pixel is to be correct.

Blue is the training cost, while red is the cross-validation cost. We can see that the cross-validation cost is a little higher than the training cost, indicating that there is likely no over fitting or under fitting going on. The sudden acceleration at epoch 1000 is due to changing the momentum at that point. The end cost is about 0.004 while the cross-validation cost is about 0.005.

Here is a view of what the weights for each of the 67 output units encode:

Interestingly, only one output unit seems to encode a single line. The others all seem to encode linear combinations of lines, some much more strongly than others. The data shows that on average, 38 of the 67 output units are active (although not all the same ones), while at most 51 are active (again, not all the same ones).

Varying the number of output units affects the final cost, apparently with order less than log N. The end cost for 200 units is about 0.002, as is the cross validation cost. The average number of activated outputs appears to be a little over half.

We can learn a second layer of output units by running all the input samples through to the first output layer, and then using their binary outputs as inputs to another RBM layer. We would be using probabilistic binary outputs, so it is important to have enough samples that the next layer gets a good idea of what the input distribution is. We can use the probability outputs directly, but I've found, at least with this toy problem, that this doesn't seem to lead to significantly better results.

To try this, I'll use 100 units in the first layer, which could be overkill, and a variable number of units in the second layer, from 10 to 200. To get the cost, I can run the output all the way back to the input layer. Here's the result in log cost per pixel:

So this isn't so good: the error rate for the second layer is much higher than that for the first layer. One possibility is that the first layer is so good that the second layer is not necessary at all. But then I would have thought we would get at least the same error rate.

That's where I'll leave this article right now. Possible future investigations would be more complex inputs, and why layer 2 refuses to be as good as layer 1.

# Machine Learning: Autoencoders

An autoencoding algorithm is an unsupervised learning algorithm which seeks to recreate its input after processing the output. The layer or layers between input and output then become a representation of the input, which may have fewer or more dimensions than the input.

If the internal layer is further restricted so that only a very few of its components are active for any given input, then it is a sparse autoencoder. Generally, if the dimension of the internal layer is less than that of the input layer, then the autoencoder is performing, appropriately enough, dimension reduction. If, however, the number of dimensions is greater, then we enter the realm of feature detection, which, to me anyway, is a much more interesting application of autoencoding. In addition, feature detection appears to be how the brain handles input.

One of the challenges of feature detection is to ensure the internal layers don't degenerate to a trivial representation of the input, that is, simply repeating the input so that each feature is simply an input feature.

I'll start by talking about autoencoding via backpropagation. Before we tackle this, I'd like to rehash the mathematics of backpropagation, but this time in matrix form, which will be much easier to handle. So feel free to skip if you're not really interested.

### Backpropagation, a more thorough derivation

This time, however, we'll use matrix notation. The equation for the vector of activations for layer l is as follows:

\begin{align*} z^{(l)} &= W^{(l-1)}a^{(l-1)} + b^{(l)}\\ a^{(l)} &= g \left ( z^{(l)} \right ) \end{align*}

where:

• a(l) is a column vector of sl elements (i.e. an sl x 1 matrix), the activations of the neurons in layer l,
• b(l) is a column vector of sl elements (i.e. an sl x 1 matrix), the biases for layer l, equivalent to a fixed input 1 multiplied by a bias weight, separated out so we don't have to deal with a separate and somewhat confusing input augmentation step,
• W(l-1) is an sl x sl-1 matrix for the weights between layer l-1 and layer l, and
• g is a squashing function, which we can take to be the logistic function (for range 0 to 1) or the tanh function (for range -1 to 1). Or really any differentiable function.

A quick sanity check for z = Wa + b: W is sl x sl-1, a is sl-1 x 1, so multiplying W x a cancels out the middle, yielding sl x 1, which is consistent with the definitions for z and b.

Now, the cost function for a single data point x(i),y(i) is as follows:

$J^{(i)}(W, b) = \frac{1}{2}\left \| a^{(L,i)} - y^{(i)} \right \|^2$

|| a - y || is simply the Euclidean distance between a and y, otherwise known as the L2 norm. Note also that it is a scalar, and not a vector or matrix.

The cost over all data points, and adding a regularization term, is:

$J(W, b) = \frac{1}{m}\sum_i^m \frac{1}{2}\left \| a^{(L,i)} - y^{(i)} \right \|^2 + \frac{\lambda}{2}\sum_{l,u,v} \left ( W_{uv}^{(l)} \right )^2$

That last term simply means to take every weight between every neuron and every other neuron in every layer, square it, and add. We don't take any of the bias terms into the regularization term, as usual.

Now, first, we want to determine how gradient descent moves W(L-1) and b(L):

\begin{align*} W^{(L-1)} &\leftarrow W^{(L-1)} - \alpha \frac{\partial J(W,b)}{\partial W^{(L-1)}} \\ b^{(L)} &\leftarrow b^{(L)} - \alpha \frac{\partial J(W,b)}{\partial b^{(L)}} \end{align*}

This just says that we move W downhill in "J-space" with respect to W, and the same with b. Note that since W(L-1) is an sL x sL-1 matrix, then so too must the derivative of J with respect to W(L-1) be. And now let's compute those derivatives. First, the derivative with respect to the weights in the last layer:

\begin{align*} \frac{\partial J(W,b)}{\partial W^{(L-1)}} &= \frac{1}{m} \sum_i^m \left ( \left \| a^{(L,i)} - y^{(i)} \right \| \frac{\partial a^{(L,i)}}{\partial W^{(L-1)}} \right ) + \lambda W^{(L-1)} \\ \frac{\partial a^{(L,i)}}{\partial W^{(L-1)}} &= \frac{\partial g\left (z^{(L,i)} \right )}{\partial W^{(L-1)}} \\ &= \frac{\partial g\left (z^{(L,i)} \right )}{\partial z^{(L,i)}} \left ( \frac{\partial z^{(L,i)} }{\partial W^{(L-1)}}\right )^\top \\ \frac{\partial g\left (z^{(L,i)} \right )}{\partial z^{(L,i)}} &= g'\left (z^{(L,i)} \right ) \\ \frac{\partial z^{(L,i)} }{\partial W^{(L-1)}} &= a^{(L-1, i)} \\ \therefore \frac{\partial J(W,b)}{\partial W^{(L-1)}} &= \frac{1}{m} \sum_i^m \left ( \left \| a^{(L,i)} - y^{(i)} \right \| g'\left (z^{(L,i)} \right )(a^{(L-1, i)})^\top \right ) + \lambda W^{(L-1)} \end{align*}

Note that we just called the derivative of g with respect to its argument, g'. For the logistic and tanh functions, these are nice, compact derivatives:

\begin{align*} g'_{\text{logistic}}(z) &= g(z) \left ( 1-g(z) \right ) \\ g'_{\text{tanh}}(z) &= 1-g^2(z) \end{align*}

Since the argument of g (being z(L,i)) is an sL x 1 matrix, so too is its derivative. a(L-1,i) is an sL-1 x 1 matrix, its transpose is a 1 x sL-1 matrix, and thus g' x a is an sL x sL-1 matrix, which is consistent with what we wanted the size of the derivative of J with respect to W(L-1) to be.

And now with respect to the bias on the last layer:

\begin{align*} \frac{\partial J(W,b)}{\partial b^{(L)}} &= \frac{1}{m} \sum_i^m \left ( \left \| a^{(L,i)} - y^{(i)} \right \| \frac{\partial a^{(L,i)}}{\partial b^{(L)}} \right ) \\ \frac{\partial a^{(L,i)}}{\partial b^{(L)}} &= \frac{\partial g\left (z^{(L,i)} \right )}{\partial b^{(L)}} \\ &= \frac{\partial g\left (z^{(L,i)} \right )}{\partial z^{(L,i)}} \left ( \frac{\partial z^{(L,i)} }{\partial b^{(L)}}\right )^\top \\ \frac{\partial g\left (z^{(L,i)} \right )}{\partial z^{(L,i)}} &= g'\left (z^{(L,i)} \right ) \\ \frac{\partial z^{(L,i)} }{\partial b^{(L)}} &= 1 \\ \therefore \frac{\partial J(W,b)}{\partial b^{(L)}} &= \frac{1}{m} \sum_i^m \left ( \left \| a^{(L,i)} - y^{(i)} \right \| g'\left (z^{(L,i)} \right ) \right ) \end{align*}

Let us define:

$\delta^{(L,i)} = \left \| a^{(L,i)} - y^{(i)} \right \| g'\left (z^{(L,i)} \right )$

Note that this is an sL x 1 matrix. It is the contribution to the weight or bias gradient due to an "error" in output. We can now define our derivatives more compactly:

\begin{align*} \frac{\partial J}{\partial W^{(L-1)}} &= \frac{1}{m}\sum_i^m \delta^{(L,i)} (a^{(L-1,i)})^\top + \lambda W^{(L-1)}\\ \frac{\partial J}{\partial b^{(L)}} &= \frac{1}{m}\sum_i^m \delta^{(L,i)} \end{align*}

Now, what about the derivatives with respect to the previous layer weights and bias? The key insight in backpropagation is that we can generalize these derivatives as follows. For l from L to 2 (we start from L because these are recursive equations) we have:

\begin{align*} \frac{\partial J}{\partial W^{(l-1)}} &= \frac{1}{m}\sum_i^m \delta^{(l,i)} (a^{(l-1,i)})^\top + \lambda W^{(l-1)}\\ \frac{\partial J}{\partial b^{(l)}} &= \frac{1}{m}\sum_i^m \delta^{(l,i)} \\ \delta^{(l,i)} &= \begin{cases} & \left \| a^{(L,i)} - y^{(i)} \right \| g' \left ( z^{(L,i)} \right )\text{ if } l=L \\ & \left ( (W^{(l)})^\top\delta^{(l+1,i)} \right ) g' \left ( z^{(l,i)} \right ) \text{ if } l

A rigorous mathematical treatment for this is so completely outside the scope of this article as to be invisible :) But the general argument is that delta represents the contribution of a layer to the gradient based on the error between desired output and generated output. For the final layer, this is straightforward, and we can directly calculate it. However, for an internal layer, it is as if the errors from the next layer have propagated backwards through the weights, and so we can calculate, from output to input, the contributions of each layer.

### Backpropagation, the algorithm

First, zero out an accumulator for each layer. The accumulators have the same dimensions as the weight and bias matrices. So for l from 2 to L:

\begin{align*} D_w^{(l-1)} &\leftarrow 0 \\ D_b^{(l)} &\leftarrow 0 \\ \end{align*}

Second, compute all the forward activations a for a single data point. So, for l from 2 to L, we have:

\begin{align*} z^{(l)} &= W^{(l-1)}a^{(l-1)} + b^{(l)}\\ a^{(l)} &= g \left ( z^{(l)} \right ) \end{align*}

Compute the delta terms for l from L to 2, and add to the accumulators:

\begin{align*} \delta^{(l,i)} &= \begin{cases} \left \| a^{(l,i)} - y^{(i)} \right \| g' \left ( z^{(l,i)} \right ) & \text{ if } l=L \\ \left ( (W^{l})^\top \delta^{(l+1,i)}\right ) g' \left ( z^{(l,i)} \right ) & \text{ if } l

Next, after doing the above two steps for each data point, we compute the gradients for l from 2 to L:

\begin{align*} \frac{\partial J}{\partial W^{(l-1)}} &= \frac{1}{m} D_w^{(l-1)} + \lambda W^{(l-1)} \\ \frac{\partial J}{\partial b^{(l)}} &= \frac{1}{m} D_b^{(l)} \end{align*}

Finally, we use these gradients to go downhill, for l from 2 to L:

\begin{align*} W^{(l-1)} &\leftarrow W^{(l-1)} - \alpha \frac{\partial J}{\partial W^{(l-1)}} \\ b^{(l)} &\leftarrow b^{(l)} - \alpha \frac{\partial J}{\partial b^{(l)}} \\ \end{align*}

That is one round of updates. We start from zeroing out the accumulators to do the next iteration, and continue until it doesn't look like the cost is getting any lower.

Instead of the above, we could provide a function which, given W and b, computes the cost and the derivatives. Then we give that function to a library which does minimization. Sometimes minimization libraries do a better job at minimizing than manually doing gradient descent, and some of the libraries don't need a learning parameter (alpha).

The whole reason for going through the derivation and not going straight to the algorithm was so that we could add a sparseness measure in the cost function, and see how that affects the algorithm.

First, if we have d dimensions in the input, then an autoencoder will be a d:1:d network.

We will first determine the average activation of layer 2 over all data points:

$\hat{\rho} = \frac{1}{m}\sum_i a^{(2)} ( x^{(i)} )$

Note that this is an s2 x 1 matrix. To be sparse, we want the values of each element to be very low. If we're using a logistic function, this means near to zero. If we're using the tanh function, near to -1, but we will rescale the average activation to lie between 0 and 1 by adding 1 and dividing by 2.

Let us denote our target sparsity for each element as ρ, so that we want our measured sparsity to be close to that. Clearly we don't want ρ=0, because that would give us a trival solution: zero weights everywhere.

For a sparsity cost, we will use the following measure, known as the Kullback-Leibler divergence:

$J_{\hat{\rho}} = \sum_k^{s_2} \rho \ln \frac{\rho}{\hat{\rho}_k} + (1-\rho) \ln \frac{1-\rho}{1-\hat{\rho}_k}$

Note that the sum applies element-by-element to the measured sparsity vector, and so the cost is a scalar. This cost is zero when each measured sparsity element is equal to the desired sparsity, and rises otherwise.

We add this cost to our main cost function as follows:

$J_\text{sparse}(W,b) = J(W,b) + \beta J_{\hat{\rho}}$

where β is just a parameter whereby we can tune the importance of the sparsity cost.

Without going through the derivation, we use the following altered delta for layer 2 during backpropagation:

$\delta^{(2,i)} = \left ( (W^{(2)})^\top \delta^{(3,i)} + \beta \left ( -\frac{\rho}{\hat{\rho}} + \frac{1-\rho}{1-\hat{\rho}} \right )\right ) g'(z^{(2,i)})$

That scalar term added due to sparsity is computed only after all data points have been fed forwards through the network, because that is the only way to determine the average activation of layer 2. It is independent, therefore, of i. So the modification to backpropagation would require this:

1. Going through the data set, compute all the way up to layer 2, and accumulate the sum of the activations for each neuron in layer 2.
2. Divide the sums by m, the number of points in the data set.
3. Perform one iteration of backpropagation.
4. Go back to step 1.

### Why I don't like this

It is said that backpropagation is not biologically plausible, that is, it cannot be the algorithm used by the brain. There are several reasons for this, chief among which is that errors do not propagate backwards in the brain.

A sparse backpropagating autoencoder is doubly implausible, because not only does it rely on backpropagation, but it also requires that we wait until all data points are presented before determining the average activation. It would be much nicer if we had something more biologically plausible, if only because I have the suspicion that any algorithm that is not biologically plausible cannot lead to human-level intelligence.

So in the next article, I'll talk about a biologically plausible algorithm called the reverse Boltzmann machine.

# Machine Learning: K-Means Clustering

K-means clustering is the first unsupervised learning algorithm in this series. Unsupervised means that the answer is not available to the learning algorithm beforehand, just the cost of a potential solution. To me, unsupervised learning algorithms are more exciting than supervised learning algorithms because they seem to transcend human intelligence in a way. An unsupervised learning algorithm will seek out patterns in data without any (or with few) hints. This seems especially important when, as the human, we don't know what the hints could possibly be.

The Google "Visual Cortex" project shows how powerful unsupervised learning algorithms can be: from millions of unlabeled images, the algorithm found generalized categories such as human faces and cat faces. It is easy to see that if the same thing could be done with an audio stream or a text stream, the streams could be combined at a high enough level for association to produce sounds and text for images, images for text and sounds, and at a high enough level, reasoning.

The K-means clustering algorithm treats data as if it were in clusters centered around some number of points k, one cluster per point. Conceptually, the algorithm picks k centroid points, assigns each point in the data to a cluster based on how close it is to which cluster's centroid, moves each centroid to the center of its cluster, and repeats. The result is a set of centroids which minimizes the distances between each point and its associated cluster's centroid.

The cost function is:

$J = \frac{1}{m}\sum_i^k \sum_{j \in C_i} (x_j - \mu_i)^2$

where Ci is cluster i, and μi is the centroid for cluster i.

There are a few methods for picking the initial centroids. One method, the Forgy method, involves picking k random points from the data set to be the initial centroids. Another method, the Random Partition method, assigns each data point to a random cluster, then produces the initial centroids for each cluster. Regardless of the initial method, the algorithm proceeds by repeating the following two steps:

First, produce the clusters by assigning each data point to one cluster. This means comparing the distance of a point to each centroid, and assigning the point to the cluster whose centroid yields the lowest distance.

Second, calculate the centroid of each resulting cluster.

Repeat these steps until the total cost does not change.

### A Concrete Example

I implemented the above algorithm in Java and ran it on the usual concrete strength data set. As usual, I set aside 20% of the data set as a cross-validation. But a problem quickly became apparent: how many clusters should I use? Clearly the more clusters, the less the overall cost would be simply because there would be more centroids.

One solution is to try different numbers of centroids and ask if there is an obvious point where there is not a lot of improvement in the cost. Here is what I found from k=2 through 10:

Blue is the training cost, while red is the cross-validation cost. Interestingly, the cross-validation cost was always below the training cost, indicating that the cross-validation points represent well the training points. There is clearly no overfitting because there is no large gap between costs. However, there is no obvious point at which increasing the number of clusters doesn't help much.

The other solution to the number of clusters relies on evaluating different numbers of clusters only after later processing. If downstream processing works better with a certain number of clusters, then that number of clusters should be chosen. So, for example, if I put each cluster's data through a neural network, how good is the error for each number of clusters?

I trained an 8:10:10:1 neural network on each cluster of points, so k=2 had 2 networks, and k=10 had 10 networks. I used fewer hidden neurons than before, on the theory that each cluster has less data, meaning that I can probably get away with a smaller parameter space. Here are the results:

Clearly the more clusters and the more networks, the better the output. Perhaps because more networks means smaller clusters, which in turn means less variation to account for. Interestingly, 8 clusters works about as well as 5 clusters, and it's only with 9 and 10 clusters that more advantage is found. In any case, choosing k=10, here are the errors:

Compared to training a single 8:20:20:1 network on the entire data set, clustering has definitely reduced the errors. Most errors in the training set are now under 5% (down from 10% before), and even the one troubling point from before (error 100%) has been knocked down to an error of 83%. The low errors in the cross-validation points -- which, remember, the network has never seen -- all lead us to believe that the networks trained are not overfit.

I would still want to look at those high-error points, perhaps even asking for the experimental data for those points to be rechecked or even rerun. But for now, I would be happy with this artificially intelligent concrete master.

For the next article, I'm going to go off the syllabus of the Machine Learning course, and talk about one of my favorite unsupervised learning algorithms, the autoencoder.