Machine Learning: Restricted Boltzmann Machine

The Restricted Boltzmann Machine is an autoencoder which uses a biologically plausible algorithm. It uses a kind of Hebbian learning, which is the biologically plausible idea that "neurons that fire together, wire together".

Suppose we have an input layer of dimension Nin and an output layer of dimension Nout, with no hidden layer in between. All input units are connected to all output units, but input units are not connected to each other, and output units are not connected to each other, which is where the Restricted comes in the name.

Let the input units be binary, so either 0 or 1. Further, each output unit uses the logistic function, so that the output of unit j is:

\begin{align*} z_j &= b_j + \sum_{i=1}^{N_{in}} w_{ij}x_i \\ p_j &= \frac{1}{1+e^{-z_j}} \\ y_j &= \begin{cases} 1 & \text{ if } uniformrandom(0,1) < p_j \\ 0 & \text{ otherwise } \end{cases} \end{align*}

bj is the bias term for output unit j, and wij is the weight from input unit i to output unit j. Note that we're treating the logistic function as a probability, but this time the outputs are binary rather than the probability, so that an output unit j is on with probability pj. That is, the output unit is stochastic.

Now, to make this an autoencoder, we want to feed the outputs backwards to the inputs, which works as follows:

\begin{align*} z_i &= c_i + \sum_{j=1}^{N_{out}} w_{ij}y_j \\ p_i &= \frac{1}{1+e^{-z_i}} \\ x'_i &= \begin{cases} 1 & \text{ if } uniformrandom(0,1) < p_i \\ 0 & \text{ otherwise } \end{cases} \end{align*}

We're just doing the same thing in reverse, except that there is a bias ci now associated with each input unit. There is some evidence that this kind of feedback happens biologically.

After this downward pass, we perform one more upward pass, but this time using the probability input rather than the reconstructed input:

\begin{align*} z'_j &= b_j + \sum_{i=1}^{N_{in}} w_{ij}p_i \\ p'_j &= \frac{1}{1+e^{-z'_j}} \\ y'_j &= \begin{cases} 1 & \text{ if } uniformrandom(0,1) < p'_j \\ 0 & \text{ otherwise } \end{cases} \end{align*}

We do this for every input sample, saving all the values for each sample. When all the m samples have been presented (or, in batch learning, when a certain proportion m of the samples have been presented) we update the biases and weights as follows:

\begin{align*} \Delta w_{ij} &= \eta\frac{1}{m}\sum_{k=1}^{m} x_i p_j - \eta\frac{1}{m}\sum_{k=1}^{m} p_i p'_j \\ &= \eta \left ( \langle x_i p_j \rangle - \langle p_i p'_j \rangle \right ) \\ \Delta b_j &= \eta \left ( \langle p_j \rangle - \langle p'_j \rangle \right )\\ \Delta c_i &= \eta \left ( \langle x_i \rangle - \langle p_i \rangle \right )\\ \end{align*}

And that's the Restricted Boltzmann Machine. There are other refinements such as momentum and regularization which I won't cover, but which are implemented in the example Octave file. There are also many extremely helpful hints as to parameter settings in Hinton's paper A practical guide to training restricted Boltzmann machines.

As an example (Octave code here), I set up a 9x9 array of inputs which correspond to pixels (0=off, 1=on), so my input dimension is 81. I generate a bunch of samples by placing some random vertical and horizontal lines in, but with more lines being less likely. According to the paper, a good initial guess as to the number of output units is based on the number of bits a "good" model of the input data would need, multiplied by 10% of the number of training cases, divided by the number of weights per output unit.

Each sample has zero horizontal lines with probability 0.7, one horizontal line with probability 0.21, and two horizontal lines with probability 0.09, with a vanishingly small probability of three lines, and likewise with vertical lines, and each line can be at any of nine positions, which means that a "good enough" model of the input would need something like 9 bits for the horizontal lines and 9 bits for the vertical lines, or 18 bits. With 3000 or so samples, 300 x 18 / 81 = 67 output units.

I also used 2000 epochs, averaging weights through time, a momentum term which starts at 0.5, then switches to 0.9 halfway through, a learning rate which starts at 1, then switches to 3 halfway through, and a regularization parameter of 0.001.

I set aside 10% of the samples as cross-validation, and measure the training and cross-validation costs. The cost (per sample per pixel) is defined as:

$J = -\frac{1}{mN_{in}}\sum_{i=1}^{N_{in}} \sum_{k=1}^m \left ( x_i^{(k)} \ln{p'_i^{(k)}} + \left ( 1- x_i^{(k)} \right ) \ln{\left (1-p'_i^{(k)} \right )} \right )$

In other words, this is the usual logistic cost function, except that instead of the output, we're using the probability during the downward pass that an input pixel is 1.

Here are 16 example input samples:

And here are the trajectories of the training and cross-validation costs over the 2000 epochs. Strictly speaking, the cost is a better measure than error in the reconstructed input since it is not as affected by chance pixel flips during reconstruction (output to input) as a direct pixel-to-pixel comparison. For example, if an input pixel should be on, and it is only turned on 90% of the time during reconstruction, then the cost for that pixel is -ln 0.9 =  0.105. If we were to actually reconstruct that pixel, then 10% of the time the error would be 1, and 90% of the time the error would be 0; but we only do a single evaluation. So the cost gives us a better idea of how likely a pixel is to be correct.

Blue is the training cost, while red is the cross-validation cost. We can see that the cross-validation cost is a little higher than the training cost, indicating that there is likely no over fitting or under fitting going on. The sudden acceleration at epoch 1000 is due to changing the momentum at that point. The end cost is about 0.004 while the cross-validation cost is about 0.005.

Here is a view of what the weights for each of the 67 output units encode:

Interestingly, only one output unit seems to encode a single line. The others all seem to encode linear combinations of lines, some much more strongly than others. The data shows that on average, 38 of the 67 output units are active (although not all the same ones), while at most 51 are active (again, not all the same ones).

Varying the number of output units affects the final cost, apparently with order less than log N. The end cost for 200 units is about 0.002, as is the cross validation cost. The average number of activated outputs appears to be a little over half.

We can learn a second layer of output units by running all the input samples through to the first output layer, and then using their binary outputs as inputs to another RBM layer. We would be using probabilistic binary outputs, so it is important to have enough samples that the next layer gets a good idea of what the input distribution is. We can use the probability outputs directly, but I've found, at least with this toy problem, that this doesn't seem to lead to significantly better results.

To try this, I'll use 100 units in the first layer, which could be overkill, and a variable number of units in the second layer, from 10 to 200. To get the cost, I can run the output all the way back to the input layer. Here's the result in log cost per pixel:

So this isn't so good: the error rate for the second layer is much higher than that for the first layer. One possibility is that the first layer is so good that the second layer is not necessary at all. But then I would have thought we would get at least the same error rate.

That's where I'll leave this article right now. Possible future investigations would be more complex inputs, and why layer 2 refuses to be as good as layer 1.