# Monroe Calculating Machine Company at the Internet Archive

I've recently come into a large cache of original documents from the Monroe Calculating Machine Company, spanning from 1919 to the 1970s. Most of these are Machine Service Bulletins, which were sent out approximately weekly to Monroe field offices around the country. These MSBs ran the gamut from one-pagers describing how to fix something, to 100-page manuals showing in detail how to disassemble, reassemble, and adjust an entire model of calculator.

There are also some course books, instruction manuals, ads, a newsletter or two, reference cards.

I've scanned all of these, and placed them on the Internet Archive for anyone to read!

Of particular interest are:

• The 1919 instruction book for the Monroe Calculating Machine [link]. The machine depicted is a model D, but 1919 would put it in the model G era.
• Machine Service Bulletin 34 [link], a full disassembly, assembly, and adjustment guide to the K-0 and KA-0 calculators.
• Machine Service Bulletin 40 [link], a full disassembly, assembly, and adjustment guide to the K-1AA and K-2AA calculators.
• Machine Service Bulletin 45 [link], showing what a bad day they were having.
• Machine Service Bulletin 95C [link], illustrated parts guide for various L and LA calculators with prices for each part.
• Machine Service Bulletin 116 [link] (with its supplement 116A [link]), a full disassembly, assembly, and adjustment guide to the MA-213 calculator.
• Book 1 in the Service Training Course [link].

Enjoy! And if you have any Monroe publications that aren't in the Archive, please let me know, I'd like to add them.

# Elektronika MS-1103 (Электроника МС-1103) Manual

Probably because of my RPN Nixie calculator project, I got interested in Soviet-era calculators. There are some MC-1103 calculators on eBay, so I bought two. The first one came without a manual, so I decided to open it and reverse-engineer it.

As you can see, it works. Interestingly, it truncated pi. If it rounded, it should end in a 3. I plugged it in to a 110-to-220 converter. However, US AC voltage is 120 which means that I actually get 240 VAC out. This would result in increased AC voltages inside, but the DC voltages should not be affected if they were regulated.

The above image shows the power supply removed from the case. There's a big transformer on right, and what appears to be four series pass transistors. Each such section also had a multi-turn potentiometer, presumably for fine adjust of voltage, and a chip. There were actually two types of chips: кпен2г (KPEN2G) and кп142ен1г (KP142EN1G).

From cpu-ukraine's article on Soviet chip identification, I learned that Soviet chips were all marked in a nation-wide standardized way. KP meant commercial use, plastic package. 142 was the chip family, EN was the type, and 1G was the model.

I also learned, by piecing together statements on various forums and Russian chip resellers that KPEN2G was shorthand for KP142EN2G (so they are the same chip, different models), that they were linear adjustable voltage regulators, that 1 meant Vin max 20v, Vout 3-12v and 2 meant Vin max 40v, Vout 12-30v, and G meant 0.2% regulation. I also found that they were equivalent to the LM723 regulator, but with a different pinout.

The pass transistors were КТ817А and КТ817Б (KT817B). These are NPN power transistors, A meaning 25v max and B meaning 45v max, and they were equivalent to the BD175 transistor.

Putting all this together, I was able to draw a schematic. The power supply provides regulated 5v, 15v, -15v, and -27v.

Anyway, as I was poking around with my multimeter with the power on, I heard a hissing sound and quickly cut the power. One of the capacitors had sprung a leak! Further investigation showed that it had taken its voltage regulator and series pass transistor with it. Maybe I shorted something? In any case, I had to replace these, and I used an LM317, a three-terminal adjustable voltage regulator which combined the chip and the pass transistor.

After a really messy install, the power supply worked again.

Then I got the second calculator, and it came with a manual. And the manual included schematics. Aside from the missing connection between the transformer's terminal 12 and the other end of the VD21, I got most of it right.

Anyway, here's the scan of the user manual I put up on the Internet Archive. Enjoy!

# Raise3D N2: Vibrations (and solution)

Suddenly, a new feature appears :( The y-axis is nice and smooth but the x-axis is vibrating when it moves. It vibrates so hard that it seems to lose steps. Here's a print where the next layer is shifted over to the left a bit.

Sent these to Raise3D. Let's see if there are some adjustments I can make to get rid of the vibration.

Update: After a day of working with Raise3D, no progress with anything we tried. Everything seems good, nothing loose, broken, wobbly, shaky, or out of alignment except one of the screws on one of the bushings is stripped and wouldn't tighten. Replaced, still no good.

Thinking back, when my friend helped me set this printer up, and we tried the first print, he did notice a vibration in the x-axis, but I didn't pay attention because the print seemed to be going okay. I thought maybe it was normal. Clearly not.

The next day, I got some grease. Not SuperLube (PTFE grease), which is THE grease to have, but unfortunately because it's so good, nobody stocks it.

Instead I got some silicone grease. It seems thick, but in any case I liberally coated the rods and packed the bearings where they meet the rods, and... no vibrations!  The printer is really quiet now.

Trying a small print now, and will follow up with a larger 15-hour job to see how this stuff holds up. Will also replace this grease with SuperLube once I get it in.

About 14 hours in.... This is the Vanderbilt mansion [Thingiverse 10485], model scaled to 50%, 0.15 mm layer height, 10% infill.

# Raise3D N2: second impression

The next print failed.

I started this up before I went to bed, and in the morning I found the head merrily moving around a few inches above the prints. What probably happened is what Raise3D suggested previously when I said that the purge cycle didn't purge anything: the "third-party" filament clearly was not being fed.

I looked closer at the feed point.

Now, this image was from after I unloaded the filament, cut off the faulty bit, and was about to load it back in. Here's the faulty bit:

There are a few features to note here. First, on the right, we have the toothmarks from the feeder gear. This is normal. Then, we have a circular depression where the feeder gear ground into the filament. This is not normal, and is what happens when the filament gets stuck. Finally, on the left side you can see a bend in the filament. Also, the filament has changed  from semitransparent white to opaque white. This is the result of the filament being bent.

For example:

Bending the filament like this produced the exact same result. Now, here's a look at the whole feed path.

Note that there is nothing preventing the filament from being bent around that sharp radius. The printer is relying on the roll of filament being easy to move as the head moves around, and I am not convinced that is true for a 1kg roll of 1.75mm ABS. I used LD Product's brand of natural ABS.

My Ultimaker2 has a tube that runs all the way from the feeder gear in the back up to the extruder. No sharp radius possible. I don't know if the method Raise3D is using is new or not, but I'm a little suspicious.

At this point I am not ready to blame the printer. It could be the filament. I have some Toy Builder Labs 1.75mm ABS filament coming. Raise3D only sells PLA filament, not ABS. So can the printer handle ABS? It's supposed to. Raise3D has also acquired Toy Builder Labs, so I can only hope that the filament Toybuilder sells will (eventually) be compatible with the printer.

While I'm waiting for the Toy Builder Labs filament to come in, I'm going to redo the print using the PLA filament that came with the printer.

Now, when I removed the ABS roll, I found that the filament had come off the roll and was wound around the filament holder. That would absolutely cause an increase in drag.

This is fairly typical. Sometimes filament comes off the roll, and then you're in trouble. The solution is to add a plate which holds the filament on the roll.

I'm going to see if the same problem happens with the PLA. It might not because PLA is not as stiff as ABS and may not come off the roll. After that print, I'm going to print up a plate out of PLA which will fit behind the roll and hopefully help this situation.

I haven't lost hope!

# Raise3D N2: first impressions

Got my Raise3D N2 today. The good news is, it worked right out of the box. A few standouts were that I didn't have to level the bed, the touch screen is awesome, and the whole thing looks great. It's also very very quiet.

51 kg of fun

All the things in the box

All the things inside the thing in the box, except for the drill which is shown for size.

Set up and almost ready to go

However, there are a few things that need fixing. Some are pretty minor. Some are major. I emailed Raise3D all of these, and they got back to me within half an hour! In no particular order (Raise3D's response in italics):

1. The quick-start guide (section C step 3) says to unscrew the 2 nuts holding the Z-axis in place. I'd use the word clamps instead of nuts. Also I'd say to unscrew the 2 screws on each of the two clamps holding the Z-axis in place. It's just clearer that way.

2. The quick-start guide (section G step 4) says that the USB storage included with the printer comes loaded with already sliced models. It did not.

3. Sorry, there's Chinglish in the software, which is sad because the printer looks so good. If you make these changes, it will look much better:

[Note: Chinglish really bothers me. It's not hard to find a native speaker of English that knows Chinese. Or heck, even a native speaker of English that doesn't know Chinese, who can do a second pass on the wording like I just did. I don't think there's any excuse for manuals not to be in good, grammatical English.]

a. "Task Finished! Please take the models off from the bed." -> "Task finished! Please take the models off the bed." (don't capitalize the "f" in "finished", and use "off" instead of "off from").

b. Under the extruder loading screen: "When the extruder gear started rotating, feed the filament into right extruder;" -> "When the extruder gear starts rotating, feed the filament into right extruder." (started -> starting, don't use a semicolon at the end of the sentence; use a period. There's a reason for the semicolon and separating list items isn't it :)  ).

c. Also under the extruder loading screen: "When hot end reached target temperature, press "Load" button to next operation." -> "When hot end reaches target temperature, press "Load" button to continue." (reached -> reaches, and you need a verb after "to")

yes, we will change that for the next update.

4. The BuildTak has bubbles in it. This is very very bad, since the part will not be flat.

We have complained about that to BuildTak a lot also. For now, please use a needle to poke each bubble to release the air and flat it.

5. Under Utilities, the strange symbol in the upper left (which I now recognize as a stepper motor) disables the motors. But there doesn't seem to be a way to re-enable the motors. Pressing that button again just goes right back to saying it disabled the motors.

It is just a one motion button--to disable the motors so that you can push the extruder around. As soon as you use the touchscreen to move the extruder, it will go back on(enable).

6. When I pressed the disable motor button, cancelled the dialog, pressed it again, cancelled the dialog, and did it a few more times, eventually the response became slow and then the touch screen did not respond to touches any more. It did eventually go back to the home screen, and I saw the temperature monitors were still working, but the touch screen remained unresponsive. The only solution was to turn off the printer and turn it on again.

That is an issue we are working on right now. Also do not hold the button for too long otherwise it may freeze and you have to turn it off and on again.

7. Speaking of turning the printer off, the power switch is located in a very inconvenient place. The thing is nearly 60 cm deep, and it's hard to reach around to get to the switch. I suggest that the next iteration have the switch on the front or on the side near the front.

Ya, we noticed this also. We will think about this and figure out a better way.

8. No ground. The power cable that comes with the printer is two-prong with some kind of small green wire hanging off it. This is very very bad. Power cables should use all three prongs, and the case should be grounded using the ground wire. This is a shock hazard and you could be barred from importing this printer into the US unless you ground the machine properly.

[Note to all: I admit I'm probably wrong about this thing being barred from the US for not being grounded. Still, it's a good idea, no?]

We will switch to a standard 3-prong plug.

9. The purge cycle at the start of a print worked fine, but then the purged stuff didn't get wiped off onto the BuildTak. It actually was being dragged around during the print. I had to reach in there and pull it out.

Sometimes it wipes and sometimes it does not. We will think about a way to solve this.

10. After the print is done, the bed does not move down to the bottom. I have to manually move it. But moving it seems to be difficult: I went to Utilities and hit the Z-down button. It moves at most 10mm at a time, and it doesn't move continuously. I have no idea how I'm supposed to move the bed down. On the other printers I've used, the bed moves all the way down after the print is done so you can remove the print easily.

Ya, we are thinking about the Z movement. We might add a button to tell Z axis to move the bottom instead of 10mm a step.

11. When stopping a print, the print should stop immediately and the heaters should turn off. I found that if I wanted to cancel a print just after I started a print, the "Stopping print" message came up, and then did nothing. The nozzle and bed seemed to continue to heat up. I would expect that stopping the print would completely stop the process.

That is a Marlin thing. It has to heat up to desired Temperature before it can stop. We will think about a way to get around this.

12. When heating up, the nozzle temp gauge goes up but the bed temp gauge stays at "25/0" until the nozzle temp turns green, then the bed temp suddenly changes to "75/100". It's like the bed is heating up, but the gauge does not show its temp until the nozzle has completed heating up.

Again, this is also a Marlin thing. We have already fixed this. Please update the firmware and it should be fine. You can find it on our website under download.

13. When I started my second print, the nozzle and bed heated up, then the bed calibrated itself, then the purge cycle began, but nothing came out. Then the print started, and still nothing came out. I verified that the feeder gears were turning. It looks like the purge cycle isn't purging enough to fully load the extruder.

It seems that the filament is stripped due to gear teeth grinding. You can take off the cover and check whether the filament is actually going in or not. If the filament is not going in, then you need to unload the filament and re-load it again. Also using 3rd party filament may cause clogging issue.

A little tip is to always load a little bit filament before printing. If there is something wrong, you can spot it sooner.

Right now I have the printer printing a 17 hour job as a kind of torture test. We'll see how it goes, but from Raise3D's responsiveness so far, and the quality of the one print I finished, after all the usual initial bugs are worked out, this looks like it will be an excellent printer!

# Cleaning a TRS-80 model III keyboard

I have a TRS-80 model III.

It has a keyboard. It doesn't work so well. Some keys don't work, others are flaky.

So I opened it up and popped the keycaps off.

I removed the keyboard assembly. It's an Alps 12KE010C.

Well, that explains some of the keys not working at all. The only thing protecting the keyboard PCB from the plastic post in the base of the computer is... nothing. Oh Radio Shack. Were you already beginning the long slow slide into oblivion?

Still doesn't explain flaky keys. So I desoldered a key, since that's the only way to get them out.

I measured the resistance when pressing the key.

160kohms?! That's just wrong! Let's open this thing up. It's easy: you just pry on these tabs with your fingernails and they just pop apart.

The inside of the bottom contains the contacts.

Those little dimples are supposed to press into the rubber conductive material inside the dome. And press they did. I suspect these keys weren't meant to stand up to much use. 640k keypresses ought to be enough for anyone.

Lacking true contact cleaner, I just scratched up the contacts and dimples with a screwdriver to maybe get rid of oxidation. I also rotated the dome 90 degrees so the dimples can press into fresh rubber. I put the key together and...

87 ohms! That's more like it.

63 more keys to go :(

# Driving Nixie tubes with a MAX6922 VFD driver

The MAX6922 is a driver for VFDs. It takes standard 3.3v or 5v logic inputs via something close to SPI, and controls 32 separate outputs. Each output can be pulled low, sinking 15mA, or pulled high to a separate high voltage supply (up to 76 VDC), sourcing 45mA.

You can also daisy chain the drivers to add as many outputs as you want.

It's alive! Pro Trinket on left, transistors "stored" on the breadboard, MAX6922 on SchmartBoard PLCC adapter.

It seems like a powerful enough chip to replace the standard 74141/K155ID1 chips, with a more convenient input as well. Here's a circuit I put together showing the MAX6922 driving three IN-12B tubes. I left out the decimal points, since they would require 33 outputs, not 32. Since the drivers can be daisy chained, this wouldn't be a limitation.

Is it worth it? You can get them at Digikey, quantity 1, for USD 7.45 each, with price breaks at 10, 25, and so on, with a bit of shipping. For the same circuitry, you'd need three K155ID1s (about USD 1.50 each, plus shipping from Ukraine, Russia, or some other former Soviet Bloc country). So that would ordinarily deal the K155ID1 a win, but then, do you want to drive that decimal point? You need more parts. Do you not want to use up pins on your microcontroller? Then you need a GPIO expander running on I2C or SPI. Do you want the driver within the next few days? Then you probably shouldn't get a K155ID1.

I'm not going to argue about availability. There are probably a bazillion K155ID1s still out there, and likely MAX6922s will never be produced in such quantities. Nevertheless, the MAX6922 is worth a hard look if you need to drive a few Nixies.

# Project: Nixie tube calculator, Part 3

CORDIC, or Coordinate Rotation Digital Computer, is a venerable algorithm used by calculators to calculate trigonometric and logarithmic functions and their inverses: sin, cos, tan, sinh, cosh, tanh, ln, exp.

Rather than give an explanation here, I know of no better explanation than that given by Richard Parris.

The CORDIC algorithm uses some number of constants, and then successive additions or subtractions and shifts, followed by one multiplication at the end to give the answer. Since it consists of two of the simplest possible mathematical operations, it is fast enough and simple enough to be implemented by quite limited processors.

Now, since my calculator is meant to be able to represent the magnitude of any number up to 10100 to an accuracy of 10-100, I decided to use 200 digits (or so) to represent any number using the 100.100 format, or 100 digits for the integer part, and 100 digits for the fractional part.

This meant that I needed all the CORDIC constants, to at least 200 digits, in this format. I wrote a Mathematica notebook which does this for me, which you can see in PDF format.

# Project: Nixie tube RPN calculator, part 2

If I go to my HP48 calculator (or the Android or iOS versions thereof) and enter 1 3 ÷, the answer you'll get is .333333333333, 12 digits. Now, if I enter 3 x I get .999999999999. And this is correct, since the calculator makes the assumption that all digits past the displayed answer are 0.

If, however, I do this with 32-bit floating point calculations, we see that the closest floating point representation of 1/3 is 0.3333333432674407958984375 (exactly). How should that be displayed? If I display it as 0.3333333 then I'm lying, because the closest representation of that is actually 0.333333313465118408203125. But if I display it as 0.33333334 then I'm inaccurate because there is no way that is 1/3.

Two things come to mind.

So really, we should be using multiple-precision decimal arithmetic, but even that has problems. To solve the 1 3 ÷ 3 x problem, we could add one guard digit and round, which means .999999999999(9) will be displayed as 1. Even so, there will always be errors, sometimes fatal ones. There is a whole field around this issue.

But this is a calculator. We can provide an answer and assume that the human knows what context their calculation has, so even if they use the HP48 and see .999999999999, they'll know when this is 1 and when it is not 1.

If, however, we want to use scientific notation, then no matter how many digits of precision we use, we run into all the floating point problems again because N digits of precision plus an exponent is the definition of floating point.

Maybe we shouldn't let the point float? What if we want to be able to represent all numbers between 1x102 (100) and 1x10-2 (0.01) to an accuracy of 1x10-2? Then we need 4 digits. If our 4-digit representation is ABCD, then the actual number represented is ABCDx10-2, always. If ABCD = 0001 the number is 1x10-2 (0.01). and if ABCD = 9999 then the number is 99.99, or 1x102 - 1x10-2 (close enough, there's always going to be that upper bound missing). The exponent never changes, which means the point does not float. Add a sign bit and you've got arithmetic.

So where do we find a multiple-precision library? The Gnu Multiple Precision library (GMP) holds out some hope. Let's play with it for a bit.

Let's try 1+2:

#include <gmp.h>

int main() {
mpz_t x, y, z;

mpz_init(x);
mpz_init(y);
mpz_init(z);

mpz_set_ui(x, 1);
mpz_set_ui(y, 2);
mpz_add(z, x, y);

gmp_printf("%Zd\n", z);
}


This will simply print 3. So far, so good. But when we use mpz_tdiv_q instead of mpz_add, the answer is zero because 1/2 with the result rounded to an integer towards zero is zero.

This is where our assumed exponent comes in. When we divide Xx10-2 by Yx10-2 we get [X/Y]x100 because mantissas divide while exponents subtract. I use [X/Y] to indicate the integer result of X/Y rounded towards zero. But we want an answer that is a multiple of 10-2. The only way to do this is add more precision to the numerator: Xx102 x10-2. Now we get the correct answer, [X x102/Y] x10-2.

#include <gmp.h>

int main() {
mpz_t x, y, z;

mpz_init(x);
mpz_init(y);
mpz_init(z);

mpz_set_ui(x, 10000);
mpz_set_ui(y, 200);
mpz_tdiv_q(z, x, y);

mpz_t intPart, fracPart;
mpz_t hundred;

mpz_init(intPart);
mpz_init(fracPart);
mpz_init_set_ui(hundred, 100);

mpz_tdiv_qr(intPart, fracPart, z, hundred);

gmp_printf("%Zd x 10^-2\n", z);
gmp_printf("%Zd.%02Zd\n", intPart, fracPart);
}


The output of this program is:

50 x 10^-2
0.50

And that is correct. For multiplication, we will end up with twice as many digits as we need, so we just round off.

So much for the four functions. What about exponentiation and roots? Trigonometric functions? For that we'll need the incredible magical CORDIC algorithm, which I'll talk about in part 3.

# Project: Nixie tube RPN calculator

Nixie tubes are neat. They're nostalgic, and not too hard to get working. Russia still has bazillions of them.

Here's an actual calculator that was sold, for money, made with Nixie tubes:

The 1972 Casio model AS-C. Image copyright Nigel Tout of vintagecalculators.com.

Everyone's making clocks with them.

Nixie clock built by (and image copyright) Tomasz Watorowski, 2014.

Some are rethinking calculators with them.

Nixie calculator project (and image copyright) Spike Tsasmali, 2010.

So I thought it would be fun to go bigger, and interactive, but not too crazy. So that's where the Reverse Polish Notation Nixie calculator comes in. Like the venerable HP-48, it should have four visible stack levels. I would use IN-12B tubes, since they are top-read and have a decimal separator: a comma, which makes it compatible with most of South America, Europe and Asia.

## Reverse Polish Notation

Polish Notation was invented by the Polish logician Jan Łukasiewicz in 1924. He was stuck in Poland under German occupation during World War II, and after the war he emigrated to Ireland. It wasn't until 1954 that Reverse Polish Notation came into being, but it wasn't used in calculators and computers until the early 1960s. RPN was certainly inspired by Łukasiewicz's work.

## Alternate History

In my alternate history, Łukasiewicz did not leave Poland after the war. He was deported from Poland to the USSR as part of Operation Osoaviakhim to work on mathematics and help other captured scientists. However, the Soviets quickly realized the utility of Polish Notation, except reversed, in calculations, and by the 1950s were building large RPN desk-calculators with Nixie tube displays.

## Features

Here are the features I want to support:

• 4-level stack display
• scientific notation
• IEEE 754 double-precision floating point (binary64, which is 15-17 decimals of precision)
• Functions 10x , e x , log 10 , ln
• Functions x 2 , x y , √x, y √x
• Trigonometric functions (arc) sin, cos, tan (h)
• π, e
• Stack operations drop, dup, swap, rot3
• Auxilliary memory store and recall

## Display

I want to have enough digits on each stack level to display a double-precision number to 17 digits with 3 digits of exponent. To represent scientific notation, instead of the usual practice of using "E" or a space to separate the mantissa from the exponent, I'd use a light shining through a panel overlay between the 3rd and 4th digits, showing either " + "or " - ". So O.778+O43 would be 0.778x1043 . Negative numbers would also be represented by having a " - " panel light on the far left side.

The interesting thing about this representation is that numbers not needing an exponent, which are those having an exponent of 0, could either be displayed with that exponent ( +OOO ) or, probably better, by leaving off the exponent and leaving the last three digits blank. That would make exponentiated and non-exponentiated numbers line up.

Infinity could be represented by OO and NaN by lighting all the commas.

## Insides

As for electronics, as I said above, I'd use IN-12B tubes. These require a supply of 170VDC. Russia also has bazillions of K155ID1 chips, which drive these tubes, except for the decimal separator, for which I'd just use a simple transistor driver. This driver chip is pin-for-pin compatible with (and sadly much more available than) the 74141. One could use any 4-to-16 decoder and couple it with some high-voltage tolerant circuitry, but then you'd have essentially built a K155ID1.

I'd run the whole thing off of (of course) an Arduino. Now, each stack level has 20 digits times 5 bits each (4 for the BCD digit, and 1 for the separator), plus at least three indicators (sign, exponent signs). That's 103 bits per level, times 4 levels is 412 bits.

Conveniently, there are chips which take I2C and can latch many outputs (or read many inputs). There are two nice ones:

Each can be programmed with a 7-bit address, so there is no cramped address space to deal with.

If I want to be modular, each stack level can be powered using three of the NXP chips, or two of the Cypress chips. The NXP chips seem more approachable to me.

## Keyboard

I figure the keyboard will be a 4x8 matrix, which should be able to handle all the functions plus digits as long as I have a shift key.

Keys without shift functions (18):

• 0-9
• .
• SH (shift)
• +, -, x, ÷
• CHS (change sign)
• EE (enter exponent)
• HYP (hyperbolic, for trigonometric functions)

Keys with shift functions (12):

• sin / asin
• cos / acos
• tan / atan
• log / 10x
• ln / ex
• x2 / √x
• xy / y√x
• 1/x / x!
• π / e
• DROP / DUP
• ROT2 / ROT3
• STO / RCL

These 30 keys would be placed in a 4x8 matrix, not necessarily physically but electrically. Each row of eight keys would be read via I2C, with a full keyboard scan happening many times per second.

# Setting up a Huanyang VFD for a CNC router spindle

I recently bought a CNCRouterParts Benchtop Pro. It is a 24"x24" router. The thing is, it's a kit. It comes in six boxes if you get the plug-and-play electronics package with it, and you have to put it together. But it doesn't come with everything you need.

• You need a Windows computer to drive it (with Mach3), and of course a monitor, keyboard, and mouse to run it.
• Then you need a raised surface to put the whole thing on -- you're not going to leave this on your garage floor. I got a shelving unit from McMaster (48x36 2-shelf), then I bought some wheels off of eBay -- because everything in a workshop must be on wheels. The "feet" on the shelving unit were 1/4-20 bolts, so locking wheels with 1/4-20 threads worked perfectly. Then I hacked up some cross-bracing for stability.
• Also, you need to buy a spindle and a VFD to control it. And a spindle cable to connect the VFD to the spindle. And then you need a 220V circuit for the VFD, and a power cable to connect that circuit to the VFD. The Huanyang VFD seems to be the one you find all over eBay. Everyone's got one.
• Preferably the spindle has an ER20 collect chuck. You will need a set of ER20 collets.
• Dust collection is absolutely necessary. You do not want to breathe in the fine particulates that the router generates. So, you can either get a complete dust collector, or you can do what most people do: make one out of a ShopVac with a HEPA filter (otherwise you're not getting the harmful fines), an Oneida Dust Deputy DIY cyclonic dust separator, a bucket and a Gamma Seal lid with some hoses to connect them all.
• Don't forget end mills to actually cut with.
• How about toolpathing? A G-Code file tells your router what path it should take, but you have to get from your design to G-Code. This is where V-Carve comes in. I decided to get V-Carve Pro. You get a discount from CNCRouterParts if you buy a router from them. However, you can probably get away with Fusion 360. It is free, runs on Windows and OSX, does parametric design (parametric or go home) and can output G-Code.

The kit is obviously not meant to be turn-key. It is not for the impatient or the easily frustrated. But it is cheaper than most solutions, and many of those solutions still need most of the above extra things.

All the things

## The 220V circuit (in the US)

What amperage circuit is needed? Well, the most common spindle is 2.2kW, which means 10 amps (2200 divided by 220). You'll need a little extra to compensate for power loss in the cable, so figure 15 amps.

But wait! The age of gasoline is nearing an end, and wouldn't you like your house to be ready for an electric car? A car charger typically runs off 220V, and the higher amperage the better. So I opted to get a 50A circuit -- 40A for a Tesla charger, plus a bit for cable loss.

The plug gives you four wires: ground, a neutral, and two hots. While each slot in a breaker box is 110V, consecutive slots are 180 degrees out of phase with each other, which is why 220V breakers take up two slots. That's how you get 220V. The voltage between the neutral and any hot is 110V, because the neutral is the conductor between the two slots and the hots are on either side.

I'll be putting a 15A breaker in the circuit between the breaker box and the VFD, because I don't trust the VFD to have its own fuse.

## The spindle cable

For some reason, this was the hardest thing to obtain. It needs to be four-conductor shielded cable. It must be shielded because without a shield, such a cable will spew RF all over the place, which is bad for nearby electronics. Like, the stuff that is controlling your motors. The shield must be connected to ground on both ends.

I went to surplus stores, and by extreme chance found some, but the wires were too thick to fit in the connector that came with the spindle. You can get some from Soigeneris, but only if they have it in stock.

For some reason, there seem to be only two makers of this kind of cable: Alpha and Lapp.

I got some from Element14: Lapp Kabel Ölflex servo cable, 4 conductor, 1.5mm wires (24M9570). They sell it by the meter. I ordered 10 units, but because I'm a stupid American, I thought it was by the foot. I ended up getting more than I needed, but better too much than too little.

I wish someone sold this cable with the connector already on it.

## The VFD

It turns out that setting up the VFD was the hardest part of the whole project. For one thing, the VFD is very programmable. There are lots and lots of parameters you can set for all sorts of custom circumstances.

But mainly it was difficult because the instruction manual, nominally in English, is horribly written. You can search the webs and find lots of pages on how to set up the parameters in the VFD for your particular motor, but I've found some of the information to be wrong. So here is yet another page on how to set up a VFD, for a particular spindle. I'll try to explain what the parameters really mean.

Scary temporary testing. The shield has not yet been hooked up to ground.

## But first, some myths

Here are a few myths I've found which just make no sense, and I really need to put these first. If you ever see these on a VFD page, don't pay attention to that section.

Myth #1: You need to set up the parameters in a particular order.

No. You don't. If you set PD013 to 8, that's factory reset. So of course, you would do that first. But you can set the other parameters in any order you like.

Myth #2: The max RPM of your motor should be divided by the value of parameter PD010, and that should be entered into PD144.

No. For some settings it is just coincidence that PD010 x PD144 = Max RPM. In reality, they have absolutely nothing to do with each other.

## The spindle parameters

First, gather your spindle's operating parameters. If you bought one off eBay from China, you only get this data: power (kW), voltage (V), air or water cooled, max RPM. The spindle I bought is 2.2kW, 220V, air cooled, 24000 rpm max.

You're also supposed to know the spindle's maximum operating frequency. This is often 400Hz for the ones you get off eBay.

## The VFD parameters

First, reset the VFD to factory settings. You don't know where that thing's been. On the front panel, hit PROG (or PRGM), and then the up and down buttons until you reach PD013. Hit SET. Change the value to 8 using the up and down buttons. Hit SET again. Now your VFD is reset.

For the next parameters, I've renamed them to make some kind of sense. For setting multi-digit values, use up and down to increase and decrease the value, and the >> key to move one digit to the right.

PD001: Command source. Set to 0. 0 means you're controlling the spindle via the front panel controls. 1 means you're using controls that you've wired up to the screw terminals. 2 means you're going to control it using RS-485.

PD002: Speed control source. Set to 1. 0 means you're controlling the speed through the up and down front panel buttons. 1 means you're going to control the speed with either the knob on the front or an external potentiometer. 2 means RS-485.

When PD002 is set to 1, there is also a jumper next to the screw terminals that you have to set. If the jumper is on the right pair, the control is the front panel knob. If the jumper is on the left pair, the control is via an external potentiometer connected to the screw terminals. Make sure the jumper is on the right-side pair.

By the way, I found setting 0 pretty weird. You only get to see the speed as a frequency, not as RPM.

PD003: Default frequency. If PD002 was set to 0, this is the frequency the motor will start running at. The frequency is directly related to the speed. Since we set PD002 to 1, we can leave this alone. But you can set it to something like 200 Hz to start at mid-range.

PD004: Rated frequency: Apparently this is for motors with a fixed frequency. Since the spindle is variable frequency, this setting can be ignored.

PD005 through PD010 set three points on a voltage/frequency curve. As the motor ramps up to your desired speed, it follows this curve. The manual usefully shows three types of curve: constant torque, low torque, and high torque. I've set mine to the values for the constant torque graph (why not).

I think that if you get a VFD with a spindle, the particular model of VFD comes with different factory settings for these depending on the spindle. Which is nice.

PD005: High-end frequency: 400 Hz

PD006: Middle frequency: 2.5 Hz

PD007: Low-end frequency: 0.5 Hz

PD008: High-end voltage: 220 V

PD009: Middle voltage: 15 V

PD010: Low-end voltage: 8 V

PD011: Minimum allowed frequency. Set to 120 Hz. Air-cooled spindles are not meant to stay at low speeds, otherwise they overheat. I understand that water-cooled spindles can go as slow as you want.

Leave the next parameters alone, and skip to...

PD070: Speed control input: Set to 1. This means that the speed will be controlled by an input voltage between 0 and 5V. This is what the front panel knob delivers. 0 means 0-10V. 2 means the control is by an input current between 0 and 20mA. 3 means 4-20mA. 4 is a combination of voltage and current.

PD071: Speed control responsiveness: Leave at the factory setting of 20.

PD072: High-end frequency: Set to 400. This sets the frequency represented by the top end of the speed control.

PD073: Low-end frequency: Set to 120. This sets the frequency represented by the bottom end of the speed control.

Now skip straight to...

PD141: Rated motor voltage: Set to 220V.

PD142: Rated motor current: Set to 11A. Why not 10? Because there will always be some loss in the spindle cable. This compensates for that. But feel free to set it to 10A. The worst that can happen is that your motor loses power at the top end.

PD143: Number of motor poles: Set to 4. This is the number of magnetic poles in the motor. It should be either 2 or 4, and is 4 for the 2.2kW spindle.

PD144: RPM at 50Hz: Set to 3000. Since the max RPM is 24000 at 400Hz, this means that the RPM at 50Hz will be 3000.

That's it!

## Testing

Now twist the knob all the way counterclockwise so that you'll start at the lowest speed setting. You can now hit the RUN button and your spindle should start rotating clockwise if you're looking at it from above. If it rotates counterclockwise, press STOP, shut off the power, unplug the VFD, and swap any two of the motor wires. Then try again.

The display may now be showing a frequency rather than a rotational speed -- that is, the HZ light above the display may be lit. Hit >> until the ROTT light above the display is lit. That's RPM.

Now slowly turn the knob clockwise. You should get all the way to 24000 RPM.

Hitting >> until A is lit shows you the current being used by the motor. With no load, mine ran at 1.1A.

# How I didn't crack the Voynich Manuscript

I’ve been interested on and off in the Voynich Manuscript, mostly because it appeals to my appreciation of old occult books. Looking at the book, it’s clear there is a lot of structure in its writing. It’s clearly not, say, solely a work of art such as the Codex Seraphinianus. The history of the Voynich and the cracking attempts against it can be found all over the place. I particularly enjoyed Nick Pelling’s book, Curse of the Voynich. It might be fun, I thought, to take a look at it myself.

First step was assigning letters to each glyph. There is a standard format called EVA that has been used to transcribe the Voynich, but I found it too cumbersome. Although EVA can be transformed to any other format due to its expressiveness, that transformation would also rely on an already-completed transcription. I really wanted to start with a fresh eye.

Warning! Yes, I am aware of other researchers’ theories. The idea is that I’ll use my own theory and see where it takes me. Maybe I’ll make a bad decision. In any case, I reserve the right to modify my theories!

Using the images of the Voynich as my starting point, I looked through the first few pages and settled on a basic alphabet of glyphs, closely modeled on EVA:

Again, this is just my initial take: 25 glyphs. I’m not taking into account other, more rare glyphs at this point. Major differences between my notation and EVA are that my ‘g’ is EVA ‘m’, my ‘v’ is EVA ’n’, my ‘q’ is EVA ‘qo’, and my ‘x’ is EVA ‘ch'. I also specifically include an ‘m' and an ’n' symbol for EVA ‘iin’ and ‘in’, respectively.

The one ‘q’ I found with an accent above it, I just decided to render as ‘Q’. Likewise, the ‘x’ with an accent above I render as ‘X’. The tabled gallows characters ‘f’, ‘k’, ‘p’, and ’t’ are represented as capital letters. This doesn’t mean that they are the same character. This is just a mapping of glyphs to ASCII.

Next, I transcribed some pages using this mapping. I decided to use ‘^’ for a character indicating the beginning of a “paragraph”, including the beginning of any obviously disconnected text areas. Similarly, ‘\$’ marks the end of a “paragraph”. A period is a space between “words”. Occasionally I had to make a judgement call as to whether there was a space or not. Furthermore, lines always end in either a period or, if the line is the last in a paragraph, an end-paragraph symbol.

Finally, any character that I could not figure out or that doesn’t fit in the mapping is replaced by an asterisk.

Thus, the first paragraph on the first page is transcribed as follows:

Again, there are many caveats: I had to make judgement calls on some glyphs and spacing. I’m also aware that Nick Pelling has a theory that how far the swoop on the v goes is significant, and clearly I’m ignoring that.

In any case, after transcribing a few pages, I became aware of certain patterns, such as -am nearly always appearing at the end of a word, and three-letter words being common. I did a character frequency analysis, and found that ‘o’ was the most common character, with ‘y’, ‘a’, and ‘x’ being about 50% of the frequency of ‘o’. Way at the bottom were 'F', 'f', and 'Q'.

Then I did another frequency analysis, this time of trigrams. Most common were ‘xol’, ‘dam’, and ‘xor’. Then I asked, how often do the various trigrams appear at the beginning of a word and at the end of a word? The top five beginning trigrams were ‘xol’, ‘dam’, ‘xor’, ‘Xol’, and ‘xod’, while the top five ending trigrams were ‘xol’, ‘xor’, ‘dam’, ‘Xol’, and ‘ody'.

Are these basic lexemes? I began to suspect that lexemes could encode letters or syllables.

I turned to page f70v2, which is the Pisces page. There were 30 labels, each next to a lady. The labels were things like ‘otolal’, ‘otalar’, ‘otalag’, ‘dolarag’… Most looked like the words were composed of three two-letter lexemes: ‘ot’, ‘ol’, ‘al’, ‘ar’, ‘dol’, ‘ag’, and so on.

So I decided to look for more lexemes by picking the most common trigram, ‘xol’, and looking for words containing it. These were: ‘xol’, ‘otxol’, ‘o*xol’, ‘ypxol’, ‘dxol’, ‘xolam’, ‘xolo’, ‘xololy’, ‘xolTog’, ‘opxol’, ‘btxol’, ‘xoldy’, ‘xolols’. That would make as the lexemes ‘ot-’, ‘yp-‘, ‘d-‘, ‘-am’, ‘-o’, ‘-oly’, ‘-Tog’, ‘op-‘, ‘bt-‘, ‘-dy’, and ‘-ols’.

Similarly, for ‘xor’, we get lexemes ‘d-‘, ‘xeop-‘, ‘k-‘, ‘ot-‘, ‘bp-‘, ‘ok-‘, ’t-‘, ‘-am’, and ‘Xk-‘. At least in the first few pages.

So certain lexemes seemed to come up often, namely ‘ot-‘, ‘ok-‘, ‘op-‘, ‘d-‘, ‘-am’, ‘-dy’.

One possibility that no doubt has already been discounted decades ago is that each lexeme corresponds to a letter. So something common like ‘xol’ could be ‘e’. Because there are so many lexemes, clearly a letter could be encoded by more than one lexeme. Or maybe each lexeme encodes a syllable, so ‘xol’ could be ‘us’. I tend to doubt the polyalphabetic hypothesis, since then I would expect to see perhaps more uniform statistics.

Maybe the labels in Pisces encode numbers. If that’s the case, then by Benford’s Law, ‘ot-‘ would probably encode the numeral 1, since that appears in the first position 16 out of 30 times, and ‘ok-‘ could be 2, appearing 8 times.

Anyway, that’s as far as I’ve gotten.

# Machine Learning: Sparse RBMs

In the previous article on Restricted Boltzmann Machines, I did a variety of experiments on a simple data set. The results for a single layer were not very meaningful, and a second layer did not seem to add anything interesting.

In this article, I'll work with adding sparsity to the RBM algorithm. The idea is that without somehow restricting the number of output neurons that fire, any random representation will work to recover the inputs, even if that representation has no organizational power. That is, the representation learned will likely not be conducive to learning higher-level representations. Sparsity adds the constraint that we want only a fraction of the output neurons to fire.

The way to do this is by driving the bias of an output neuron more negative if it fires too often over the training set. Or, if doesn't fire enough, increase the bias. Octave code here. The specific function I added is lateral_inhibition.

I used the same data set based on horizontal and vertical lines, 5000 patterns. I settled on 67 output neurons, 500 epochs, with momentum, changing halfway through. I decided that a fraction of 0.05 would be interesting, meaning that on average I would want 67 x 0.05 = 3.35 output neurons activated over all 5000 patterns. In order to compensate for the tendency for biases to be very negative due to the sparsity constraint, I set the penalty for weight magnitudes to zero so that weights can become stronger to overcome the effect of the bias.

The change in bias is controlled by a sparsity parameter. I ran the experiment with various sparsity parameters from 0 (no sparsity) to 100, and here are the costs and activations:

The magnitude of the sparsity parameter doesn't seem to have much effect. Although you can't tell from the graph, there is in fact a small downward trend in the average active outputs. Right around 10, the average active outputs reaches the desired 3.35, where it stays up to about 80, and then it starts dropping again. So 10 seems like a good setting for this parameter.

Here are the patterns that each neuron responds to, with differing sparsity parameters:

Sparsity 0:

Sparsity 1:

Sparsity 5:

Sparsity 10:

Sparsity 50:

Sparsity 100:

Sparsity 200:

With no sparsity, we get the expected near-random plaid patterns, and nearly all neurons have something to say about any given pattern. With even a little sparsity, however, the patterns do clean themselves up, although not by much, and by sparsity 200, the network learns nothing at all.

One possibility that the patterns really don't look that sparse is that we wanted the average neurons activated over the entire data set to be 5%. But how much of the data set actually contains a non-empty image? In fact, about 49% of the data set is empty.

What if we require that no data instance be empty? This time a sparsity of 0.05 ends up with a relatively terrible log J of -1.8, compared to the previous result of about -2.4. However, increasing the sparsity to 0.07 gives us a log J of -2.6, which is better than before. This is also expected, since more neurons will be able to represent patterns more closely. And yet, we get better representation anyway:

Sparsity 10:

Sparsity 20:

Sparsity 50:

Sparsity 100 has very poor results.

The visualization is a bit misleading, because although there are pixels that are other than full white, those pixels don't imply that the neuron will be activated with high probability for those other pixels. The maximum weight turns out to be 11.3, with the minimum being -4.7. The visualization routine clips the values of the weights to [-1,+1] meaning that anything -1 or lower is black, while anything +1 or higher is white. However, by using visualize(max(W+c', -0.5)), we can take into account some of the threshold represented by the (reverse) bias from output to input. We also clip at -0.5 so that we can at least see the outlines of each neuron.

So here is another run with sparsity 50:

We can see that, in fact, each neuron does respond to a different line, and that just about 18 lines are represented, as expected.

# Reverse engineered part

After fiddling around with the part from the previous article, I think I might have a reverse engineered technical diagram. I still don't know enough about early 20th century mechanical design techniques to know if this is what they would have done, but it should be enough to at least remanufacture this part.

I also realized that I haven't actually described the part! There are two registers on the typical Monroe calculator, an upper register which indicates operation count (useful for multiplication and division) and a lower register which indicates total. There's a crank which, when turned one way, zeroes out the upper register, and when turned the other way, zeroes out the lower register. The part that I reverse engineered is shown in the original 1920 US patent 1,396,612 by Nelson White, "Zero setting mechanism" in Figure 5. In the patent, the part, 32, is described as follows:

The shaft 60 is normally locked or held against rotation by a rigid arm 32, pivoted upon the shaft 84, and at its free end engaging a peripheral notch 33, of a plate or disk 34, secured to the gear 12...

So the next step might be to make an OpenSCAD file for the part, and put it on Thingiverse so that anyone can recreate the part. It probably can't be 3D-printed at this point, since it really needs to be a metal part. Even Shapeways, which can 3D print metal parts from stainless steel combined with bronze, can only achieve a 1mm detail, and this thing is much more detailed than that.

Full-sized files in various formats: AI | PDF | SVG | PNG

UPDATE: See the thing on Thingiverse.

# Reverse engineering mechanical parts

Or,

Numerology that sorta kinda works!

One of my half-baked projects is to take apart an early 20th century Monroe mechanical calculator and reverse engineer it so that I have a full set of engineering diagrams of every part. This would enable anyone to recreate broken parts and fix their calculator.

Reverse engineering the design of an early 20th century mechanical part has a lot in common with numerology. If the numbers coincidentally fit, then they're probably right. If they almost, but not quite fit, then they're probably right anyway.

Here's a part that I scanned on an Epson Perfection V700 scanner. This scanner is based on a CCD, not LiDe, which means that it has non-zero depth of field. That means that you can scan a part that has height and it won't end up too blurry. I scanned the part at 1400 dpi so that I could optically measure it. The thing sticking out at the top is just a screwdriver that I used to hold the part horizontal.

I could pop this into Illustrator and use the pen tool to trace around the part, but all this would get me is an outline of this particular part with no insight into why it had that particular outline. This part was designed, not evolved. It was designed to work with other parts. So clearly its measurements and the relationships between one bit of the part and another are not arbitrary.

For example, take the hole at the top. It fits over a shaft. Now, "the ancients" probably didn't use shafts of arbitrary diameter. They were standard, and since this was an American design, that meant fractions of an inch, specifically inverse powers of two: 1/2, 1/4, 1/8, and so on. The hole at the top measures between 0.187" and 0.188" on my calipers. But 3/16" is 0.1875", so it makes sense that the engineers designed this hole to be exactly 3/16". This fits around the shaft that is 0.001" under 3/16", which I suppose is a standard undersized shaft.

The 1910 Cyclopedia of Mechanical Engineering, edited by Howard Raymond, has this to say on page 129 in the section on mechanical drawing: "Keep dimensions in even figures, if possible. This means that small fractions should be avoided… Even figures constitute one of the trade-marks of an expert draftsman. Of course a few small fractions, and sometimes decimals, will be necessary. Remember, however, that fractions must in every case be according to the common scale; that is, in sixteenths, thirty-seconds, sixty-fourths, etc.; never in thirds, fifths, sevenths, or such as do not occur on the common machinist's scale."

In Illustrator, I pulled up the image and drew a circle of diameter 3/16", placing it so that it fit exactly into the hole in the image. Now I had the center of that hole, and I could draw more concentric circles. Because I could measure these diameters directly on the part, I used those diameters: 3/8" and 7/16".

The measurements in the image were done using VectorScribe.

Note that while the inner circle and middle circle (diameter 3/8") fit exactly, the outer circle (7/16") does not. The outer circle does not seem to be quite concentric, but numerology: if it's nearly right, it probably is. By moving the outer circle a few thous, I was able to get a good registration. Under high magnification, I was able to tell that the inner subpart was welded onto the sheet metal subpart, so all this indicates that there were several steps involved in manufacturing this part: first, turn the small subpart on a lathe. Then create the larger subpart from sheet metal. Then weld the two together. Welding the two together was apparently not an extremely exact procedure.

After moving the large circle to its new center, the centers no longer coincide.

Now for the rest of the part. Using SubScribe, I drew a circular arc on a circular-looking feature. Then I measured its radius.

0.283" x 2 = 0.566" is close enough to a diameter of 9/16" to say that this was the intent of the original engineer. I drew the circle, and then measured the distance between the centers.

0.568" is again close enough to 9/16". And not coincidentally, this second center coincides precisely with the location of another shaft on the machine. That certainly nails down the intent of the engineer.

I can now draw the inner tangent line between the two circles (done again using SubScribe):

The length and angle of this line in fact do not matter, since there is one and only one inner tangent line connecting these two circles in the right direction. Certainly 0.269 is close to 7/64, but that was not the design constraint. The line had to be tangent to the two circles, and drawing inner and outer tangent lines were geometric constructions that were familiar to the ancients.

We can now draw another concentric circle corresponding to the outer outline of the part, and draw an outside tangent line. Again, knowledge of design intent lets us set the outer circle's diameter at 7/8", which seems to fit precisely onto the part.

The more excitable among you may have noticed by now that the inner surface of the inner tab appears to be a circular arc, and you would be right. Drawing the circle freehand gives us a diameter of 0.439", which is close enough to 7/16" as to fix that measurement. But right now I won't analyze the tab, since I want to get the larger part done.

Near the bottom of the part, we can draw some tangent lines.

I did this in SubScribe by picking a point on the straight section of the part, then drawing a line tangent to the circle. Then I extended the line outwards. Now those lines could have started anywhere on the circle. Why these particular points? Let's draw some lines intersecting the centers. I'll also rotate the diagram so that the inter-center segment is horizontal.

The angle that the rightmost tangent line forms with the horizontal, 67.35 degrees, is irrelevant, since the constraint for that intersection point was based on an outer tangent line. But consider the angle formed between that angle and the next intersection: 125.71 - 67.35 = 58.36 degrees. This is close to 60 degrees, a nice round angle. For the intersection in the inner circle, the angle is 171.76 - 67.35 = 104.41, which is very close to 105 degrees, which is 60 + 45, more round angles. So the design intent seems clear: the outer intersection is 60 degrees from the outer tangent line, while the inner intersection is 45 degrees away from that. Let's move the intersections and construct the tangent lines so these relations become exact.

As mentioned above, this pawl fits into a hole on a gear located on a shaft. There are three shafts so far, let's call them A, B, and C. The pawl fits on shaft A, goes around shaft B, and the gear it locks is on shaft C. We know that the distance between shaft A and shaft B is 9/16". I also know from direct measurement that the distance between shaft C and shaft B is also 9/16". However, the distance between shaft A and shaft C is irregular: 0.977", not close to any fraction at all. This may be due to some constraint that we do not yet know about.

However, let's pretend that 0.977" is eventually determined through some constraint, and place the location of shaft C on the diagram.

(Update, 17 Dec 2012: It turns out that a line drawn from B perpendicular to A-C has a length of very close to 9/32", which makes A-C tangent to the 9/16" diameter circle around B. Maybe that's why the shafts are where they are: A-B is 9/16", B-C is 9/16", and A-C is tangent to the 9/16" diameter circle around B.)

I also put a circle of diameter 3/16" (shaft C's diameter), and another of diameter 3/8" around shaft C's center, which corresponds to the size of shaft C's bushing where the hole is. You can imagine the pawl fitting into a hole on the bushing by looking at the diagram.

It seems fairly clear that the pawl's end is designed to fit into the hole. The end also isn't square; it is tapered. Remember that inner tab? There is a cam which that inner tab rides on. The large diameter for the cam measures 17/32", and the small diameter measures 29/64". When the cam is rotated so that the large diameter pushes the inner tab, the pawl lifts out of the hole. When the cam small diameter is against the tab, the pawl is inside the hole. I can add the two cam diameters and then rotate the image of the pawl to simulate the two states.

In the hole (locked state):

Out of the hole (unlocked state)

Clearly one design criteria we can deduce is that the width of the pawl's end at the outside of the hole when the pawl is in the locked state must be equal to the width of the hole, and the pawl must thereafter taper. Measuring the width of the pawl in the locked state at the hole gives 0.081". Perhaps not surprisingly, this is the diameter of the hole as measured with calipers. In fractions, this is near enough to 13/16", a drill size that any mechanical engineer would have had.

Here I've drawn the outline of the hole along with its centerline. I've made the depth just deep enough for the pawl's end. We can see that the tapered pawl end does indeed fit in the hole.

Resetting the part to its design position, I found that I could draw a line along the outer outline of the pawl tangent to shaft C's outline:

The angle formed between the C-B line and the beginning of the construction line is 129 degrees. It is not a very round angle, and if the angle were not too important, it would make more sense to have it be round, perhaps divisible by 5.

Another possibility is to look at the angle formed by a radial line with the intersection:

Relative to our reference angle at the right, this is 91.57 degrees. Too far away from 90 degrees; a line placed at 90 degrees intersects the upper outline nowhere near the right place. The radial line is also 31.57 degrees from the 60-degree line. This could be significant, since 31.5 is exactly 7/10 of 45 degrees, and placing a radial line at 31.5 degrees produces an intersection very nearly at the drawn intersection.

I don't know enough about early 20th century mechanical design techniques to know if this would be reasonable: angles measured in tenths of 45 degrees.

If you'd like to have a try at figuring this out, here's the Illustrator file.

# Machine Learning: Restricted Boltzmann Machine

The Restricted Boltzmann Machine is an autoencoder which uses a biologically plausible algorithm. It uses a kind of Hebbian learning, which is the biologically plausible idea that "neurons that fire together, wire together".

Suppose we have an input layer of dimension Nin and an output layer of dimension Nout, with no hidden layer in between. All input units are connected to all output units, but input units are not connected to each other, and output units are not connected to each other, which is where the Restricted comes in the name.

Let the input units be binary, so either 0 or 1. Further, each output unit uses the logistic function, so that the output of unit j is:

\begin{align*} z_j &= b_j + \sum_{i=1}^{N_{in}} w_{ij}x_i \\ p_j &= \frac{1}{1+e^{-z_j}} \\ y_j &= \begin{cases} 1 & \text{ if } uniformrandom(0,1) < p_j \\ 0 & \text{ otherwise } \end{cases} \end{align*}

bj is the bias term for output unit j, and wij is the weight from input unit i to output unit j. Note that we're treating the logistic function as a probability, but this time the outputs are binary rather than the probability, so that an output unit j is on with probability pj. That is, the output unit is stochastic.

Now, to make this an autoencoder, we want to feed the outputs backwards to the inputs, which works as follows:

\begin{align*} z_i &= c_i + \sum_{j=1}^{N_{out}} w_{ij}y_j \\ p_i &= \frac{1}{1+e^{-z_i}} \\ x'_i &= \begin{cases} 1 & \text{ if } uniformrandom(0,1) < p_i \\ 0 & \text{ otherwise } \end{cases} \end{align*}

We're just doing the same thing in reverse, except that there is a bias ci now associated with each input unit. There is some evidence that this kind of feedback happens biologically.

After this downward pass, we perform one more upward pass, but this time using the probability input rather than the reconstructed input:

\begin{align*} z'_j &= b_j + \sum_{i=1}^{N_{in}} w_{ij}p_i \\ p'_j &= \frac{1}{1+e^{-z'_j}} \\ y'_j &= \begin{cases} 1 & \text{ if } uniformrandom(0,1) < p'_j \\ 0 & \text{ otherwise } \end{cases} \end{align*}

We do this for every input sample, saving all the values for each sample. When all the m samples have been presented (or, in batch learning, when a certain proportion m of the samples have been presented) we update the biases and weights as follows:

\begin{align*} \Delta w_{ij} &= \eta\frac{1}{m}\sum_{k=1}^{m} x_i p_j - \eta\frac{1}{m}\sum_{k=1}^{m} p_i p'_j \\ &= \eta \left ( \langle x_i p_j \rangle - \langle p_i p'_j \rangle \right ) \\ \Delta b_j &= \eta \left ( \langle p_j \rangle - \langle p'_j \rangle \right )\\ \Delta c_i &= \eta \left ( \langle x_i \rangle - \langle p_i \rangle \right )\\ \end{align*}

And that's the Restricted Boltzmann Machine. There are other refinements such as momentum and regularization which I won't cover, but which are implemented in the example Octave file. There are also many extremely helpful hints as to parameter settings in Hinton's paper A practical guide to training restricted Boltzmann machines.

As an example (Octave code here), I set up a 9x9 array of inputs which correspond to pixels (0=off, 1=on), so my input dimension is 81. I generate a bunch of samples by placing some random vertical and horizontal lines in, but with more lines being less likely. According to the paper, a good initial guess as to the number of output units is based on the number of bits a "good" model of the input data would need, multiplied by 10% of the number of training cases, divided by the number of weights per output unit.

Each sample has zero horizontal lines with probability 0.7, one horizontal line with probability 0.21, and two horizontal lines with probability 0.09, with a vanishingly small probability of three lines, and likewise with vertical lines, and each line can be at any of nine positions, which means that a "good enough" model of the input would need something like 9 bits for the horizontal lines and 9 bits for the vertical lines, or 18 bits. With 3000 or so samples, 300 x 18 / 81 = 67 output units.

I also used 2000 epochs, averaging weights through time, a momentum term which starts at 0.5, then switches to 0.9 halfway through, a learning rate which starts at 1, then switches to 3 halfway through, and a regularization parameter of 0.001.

I set aside 10% of the samples as cross-validation, and measure the training and cross-validation costs. The cost (per sample per pixel) is defined as:

$J = -\frac{1}{mN_{in}}\sum_{i=1}^{N_{in}} \sum_{k=1}^m \left ( x_i^{(k)} \ln{p'_i^{(k)}} + \left ( 1- x_i^{(k)} \right ) \ln{\left (1-p'_i^{(k)} \right )} \right )$

In other words, this is the usual logistic cost function, except that instead of the output, we're using the probability during the downward pass that an input pixel is 1.

Here are 16 example input samples:

And here are the trajectories of the training and cross-validation costs over the 2000 epochs. Strictly speaking, the cost is a better measure than error in the reconstructed input since it is not as affected by chance pixel flips during reconstruction (output to input) as a direct pixel-to-pixel comparison. For example, if an input pixel should be on, and it is only turned on 90% of the time during reconstruction, then the cost for that pixel is -ln 0.9 =  0.105. If we were to actually reconstruct that pixel, then 10% of the time the error would be 1, and 90% of the time the error would be 0; but we only do a single evaluation. So the cost gives us a better idea of how likely a pixel is to be correct.

Blue is the training cost, while red is the cross-validation cost. We can see that the cross-validation cost is a little higher than the training cost, indicating that there is likely no over fitting or under fitting going on. The sudden acceleration at epoch 1000 is due to changing the momentum at that point. The end cost is about 0.004 while the cross-validation cost is about 0.005.

Here is a view of what the weights for each of the 67 output units encode:

Interestingly, only one output unit seems to encode a single line. The others all seem to encode linear combinations of lines, some much more strongly than others. The data shows that on average, 38 of the 67 output units are active (although not all the same ones), while at most 51 are active (again, not all the same ones).

Varying the number of output units affects the final cost, apparently with order less than log N. The end cost for 200 units is about 0.002, as is the cross validation cost. The average number of activated outputs appears to be a little over half.

We can learn a second layer of output units by running all the input samples through to the first output layer, and then using their binary outputs as inputs to another RBM layer. We would be using probabilistic binary outputs, so it is important to have enough samples that the next layer gets a good idea of what the input distribution is. We can use the probability outputs directly, but I've found, at least with this toy problem, that this doesn't seem to lead to significantly better results.

To try this, I'll use 100 units in the first layer, which could be overkill, and a variable number of units in the second layer, from 10 to 200. To get the cost, I can run the output all the way back to the input layer. Here's the result in log cost per pixel:

So this isn't so good: the error rate for the second layer is much higher than that for the first layer. One possibility is that the first layer is so good that the second layer is not necessary at all. But then I would have thought we would get at least the same error rate.

That's where I'll leave this article right now. Possible future investigations would be more complex inputs, and why layer 2 refuses to be as good as layer 1.

# My Cabinet of Obsolete Technologies

Over the years, I've bagged some technological items from before I was born, and some nostalgic items from the 80s.

A Radio Shack 40-155 "Personal Stereo Speaker System",
a 1985 Sony WM-F12 Walkman with headphones,
and a 1982 Radio Shack PC-2 Pocket Computer in its case.

1980 Sound Gizmo and 1980s Merlin

Arithma Addiator (with its case and stylus), a film sprocket thing,
a Burroughs punched card printing plate for CanTabCo, and small wooden slide rule

Modern fakes, but in the corner is a roll of J. L. Hammett aluminum foil
for a mimeograph

Vacuum tubes, CRTs, voltmeter

Dymo label machine, photomultiplier tube,
some individually wrapped screws for the Air Force

# Machine Learning: Autoencoders

An autoencoding algorithm is an unsupervised learning algorithm which seeks to recreate its input after processing the output. The layer or layers between input and output then become a representation of the input, which may have fewer or more dimensions than the input.

If the internal layer is further restricted so that only a very few of its components are active for any given input, then it is a sparse autoencoder. Generally, if the dimension of the internal layer is less than that of the input layer, then the autoencoder is performing, appropriately enough, dimension reduction. If, however, the number of dimensions is greater, then we enter the realm of feature detection, which, to me anyway, is a much more interesting application of autoencoding. In addition, feature detection appears to be how the brain handles input.

One of the challenges of feature detection is to ensure the internal layers don't degenerate to a trivial representation of the input, that is, simply repeating the input so that each feature is simply an input feature.

I'll start by talking about autoencoding via backpropagation. Before we tackle this, I'd like to rehash the mathematics of backpropagation, but this time in matrix form, which will be much easier to handle. So feel free to skip if you're not really interested.

### Backpropagation, a more thorough derivation

We start with the same diagram as before:

This time, however, we'll use matrix notation. The equation for the vector of activations for layer l is as follows:

\begin{align*} z^{(l)} &= W^{(l-1)}a^{(l-1)} + b^{(l)}\\ a^{(l)} &= g \left ( z^{(l)} \right ) \end{align*}

where:

• a(l) is a column vector of sl elements (i.e. an sl x 1 matrix), the activations of the neurons in layer l,
• b(l) is a column vector of sl elements (i.e. an sl x 1 matrix), the biases for layer l, equivalent to a fixed input 1 multiplied by a bias weight, separated out so we don't have to deal with a separate and somewhat confusing input augmentation step,
• W(l-1) is an sl x sl-1 matrix for the weights between layer l-1 and layer l, and
• g is a squashing function, which we can take to be the logistic function (for range 0 to 1) or the tanh function (for range -1 to 1). Or really any differentiable function.

A quick sanity check for z = Wa + b: W is sl x sl-1, a is sl-1 x 1, so multiplying W x a cancels out the middle, yielding sl x 1, which is consistent with the definitions for z and b.

Now, the cost function for a single data point x(i),y(i) is as follows:

$J^{(i)}(W, b) = \frac{1}{2}\left \| a^{(L,i)} - y^{(i)} \right \|^2$

|| a - y || is simply the Euclidean distance between a and y, otherwise known as the L2 norm. Note also that it is a scalar, and not a vector or matrix.

The cost over all data points, and adding a regularization term, is:

$J(W, b) = \frac{1}{m}\sum_i^m \frac{1}{2}\left \| a^{(L,i)} - y^{(i)} \right \|^2 + \frac{\lambda}{2}\sum_{l,u,v} \left ( W_{uv}^{(l)} \right )^2$

That last term simply means to take every weight between every neuron and every other neuron in every layer, square it, and add. We don't take any of the bias terms into the regularization term, as usual.

Now, first, we want to determine how gradient descent moves W(L-1) and b(L):

\begin{align*} W^{(L-1)} &\leftarrow W^{(L-1)} - \alpha \frac{\partial J(W,b)}{\partial W^{(L-1)}} \\ b^{(L)} &\leftarrow b^{(L)} - \alpha \frac{\partial J(W,b)}{\partial b^{(L)}} \end{align*}

This just says that we move W downhill in "J-space" with respect to W, and the same with b. Note that since W(L-1) is an sL x sL-1 matrix, then so too must the derivative of J with respect to W(L-1) be. And now let's compute those derivatives. First, the derivative with respect to the weights in the last layer:

\begin{align*} \frac{\partial J(W,b)}{\partial W^{(L-1)}} &= \frac{1}{m} \sum_i^m \left ( \left \| a^{(L,i)} - y^{(i)} \right \| \frac{\partial a^{(L,i)}}{\partial W^{(L-1)}} \right ) + \lambda W^{(L-1)} \\ \frac{\partial a^{(L,i)}}{\partial W^{(L-1)}} &= \frac{\partial g\left (z^{(L,i)} \right )}{\partial W^{(L-1)}} \\ &= \frac{\partial g\left (z^{(L,i)} \right )}{\partial z^{(L,i)}} \left ( \frac{\partial z^{(L,i)} }{\partial W^{(L-1)}}\right )^\top \\ \frac{\partial g\left (z^{(L,i)} \right )}{\partial z^{(L,i)}} &= g'\left (z^{(L,i)} \right ) \\ \frac{\partial z^{(L,i)} }{\partial W^{(L-1)}} &= a^{(L-1, i)} \\ \therefore \frac{\partial J(W,b)}{\partial W^{(L-1)}} &= \frac{1}{m} \sum_i^m \left ( \left \| a^{(L,i)} - y^{(i)} \right \| g'\left (z^{(L,i)} \right )(a^{(L-1, i)})^\top \right ) + \lambda W^{(L-1)} \end{align*}

Note that we just called the derivative of g with respect to its argument, g'. For the logistic and tanh functions, these are nice, compact derivatives:

\begin{align*} g'_{\text{logistic}}(z) &= g(z) \left ( 1-g(z) \right ) \\ g'_{\text{tanh}}(z) &= 1-g^2(z) \end{align*}

Since the argument of g (being z(L,i)) is an sL x 1 matrix, so too is its derivative. a(L-1,i) is an sL-1 x 1 matrix, its transpose is a 1 x sL-1 matrix, and thus g' x a is an sL x sL-1 matrix, which is consistent with what we wanted the size of the derivative of J with respect to W(L-1) to be.

And now with respect to the bias on the last layer:

\begin{align*} \frac{\partial J(W,b)}{\partial b^{(L)}} &= \frac{1}{m} \sum_i^m \left ( \left \| a^{(L,i)} - y^{(i)} \right \| \frac{\partial a^{(L,i)}}{\partial b^{(L)}} \right ) \\ \frac{\partial a^{(L,i)}}{\partial b^{(L)}} &= \frac{\partial g\left (z^{(L,i)} \right )}{\partial b^{(L)}} \\ &= \frac{\partial g\left (z^{(L,i)} \right )}{\partial z^{(L,i)}} \left ( \frac{\partial z^{(L,i)} }{\partial b^{(L)}}\right )^\top \\ \frac{\partial g\left (z^{(L,i)} \right )}{\partial z^{(L,i)}} &= g'\left (z^{(L,i)} \right ) \\ \frac{\partial z^{(L,i)} }{\partial b^{(L)}} &= 1 \\ \therefore \frac{\partial J(W,b)}{\partial b^{(L)}} &= \frac{1}{m} \sum_i^m \left ( \left \| a^{(L,i)} - y^{(i)} \right \| g'\left (z^{(L,i)} \right ) \right ) \end{align*}

Let us define:

$\delta^{(L,i)} = \left \| a^{(L,i)} - y^{(i)} \right \| g'\left (z^{(L,i)} \right )$

Note that this is an sL x 1 matrix. It is the contribution to the weight or bias gradient due to an "error" in output. We can now define our derivatives more compactly:

\begin{align*} \frac{\partial J}{\partial W^{(L-1)}} &= \frac{1}{m}\sum_i^m \delta^{(L,i)} (a^{(L-1,i)})^\top + \lambda W^{(L-1)}\\ \frac{\partial J}{\partial b^{(L)}} &= \frac{1}{m}\sum_i^m \delta^{(L,i)} \end{align*}

Now, what about the derivatives with respect to the previous layer weights and bias? The key insight in backpropagation is that we can generalize these derivatives as follows. For l from L to 2 (we start from L because these are recursive equations) we have:

\begin{align*} \frac{\partial J}{\partial W^{(l-1)}} &= \frac{1}{m}\sum_i^m \delta^{(l,i)} (a^{(l-1,i)})^\top + \lambda W^{(l-1)}\\ \frac{\partial J}{\partial b^{(l)}} &= \frac{1}{m}\sum_i^m \delta^{(l,i)} \\ \delta^{(l,i)} &= \begin{cases} & \left \| a^{(L,i)} - y^{(i)} \right \| g' \left ( z^{(L,i)} \right )\text{ if } l=L \\ & \left ( (W^{(l)})^\top\delta^{(l+1,i)} \right ) g' \left ( z^{(l,i)} \right ) \text{ if } l

A rigorous mathematical treatment for this is so completely outside the scope of this article as to be invisible :) But the general argument is that delta represents the contribution of a layer to the gradient based on the error between desired output and generated output. For the final layer, this is straightforward, and we can directly calculate it. However, for an internal layer, it is as if the errors from the next layer have propagated backwards through the weights, and so we can calculate, from output to input, the contributions of each layer.

### Backpropagation, the algorithm

First, zero out an accumulator for each layer. The accumulators have the same dimensions as the weight and bias matrices. So for l from 2 to L:

\begin{align*} D_w^{(l-1)} &\leftarrow 0 \\ D_b^{(l)} &\leftarrow 0 \\ \end{align*}

Second, compute all the forward activations a for a single data point. So, for l from 2 to L, we have:

\begin{align*} z^{(l)} &= W^{(l-1)}a^{(l-1)} + b^{(l)}\\ a^{(l)} &= g \left ( z^{(l)} \right ) \end{align*}

Compute the delta terms for l from L to 2, and add to the accumulators:

\begin{align*} \delta^{(l,i)} &= \begin{cases} \left \| a^{(l,i)} - y^{(i)} \right \| g' \left ( z^{(l,i)} \right ) & \text{ if } l=L \\ \left ( (W^{l})^\top \delta^{(l+1,i)}\right ) g' \left ( z^{(l,i)} \right ) & \text{ if } l

Next, after doing the above two steps for each data point, we compute the gradients for l from 2 to L:

\begin{align*} \frac{\partial J}{\partial W^{(l-1)}} &= \frac{1}{m} D_w^{(l-1)} + \lambda W^{(l-1)} \\ \frac{\partial J}{\partial b^{(l)}} &= \frac{1}{m} D_b^{(l)} \end{align*}

Finally, we use these gradients to go downhill, for l from 2 to L:

\begin{align*} W^{(l-1)} &\leftarrow W^{(l-1)} - \alpha \frac{\partial J}{\partial W^{(l-1)}} \\ b^{(l)} &\leftarrow b^{(l)} - \alpha \frac{\partial J}{\partial b^{(l)}} \\ \end{align*}

That is one round of updates. We start from zeroing out the accumulators to do the next iteration, and continue until it doesn't look like the cost is getting any lower.

Instead of the above, we could provide a function which, given W and b, computes the cost and the derivatives. Then we give that function to a library which does minimization. Sometimes minimization libraries do a better job at minimizing than manually doing gradient descent, and some of the libraries don't need a learning parameter (alpha).

### Adding a sparseness criterion

The whole reason for going through the derivation and not going straight to the algorithm was so that we could add a sparseness measure in the cost function, and see how that affects the algorithm.

First, if we have d dimensions in the input, then an autoencoder will be a d:1:d network.

We will first determine the average activation of layer 2 over all data points:

$\hat{\rho} = \frac{1}{m}\sum_i a^{(2)} ( x^{(i)} )$

Note that this is an s2 x 1 matrix. To be sparse, we want the values of each element to be very low. If we're using a logistic function, this means near to zero. If we're using the tanh function, near to -1, but we will rescale the average activation to lie between 0 and 1 by adding 1 and dividing by 2.

Let us denote our target sparsity for each element as ρ, so that we want our measured sparsity to be close to that. Clearly we don't want ρ=0, because that would give us a trival solution: zero weights everywhere.

For a sparsity cost, we will use the following measure, known as the Kullback-Leibler divergence:

$J_{\hat{\rho}} = \sum_k^{s_2} \rho \ln \frac{\rho}{\hat{\rho}_k} + (1-\rho) \ln \frac{1-\rho}{1-\hat{\rho}_k}$

Note that the sum applies element-by-element to the measured sparsity vector, and so the cost is a scalar. This cost is zero when each measured sparsity element is equal to the desired sparsity, and rises otherwise.

We add this cost to our main cost function as follows:

$J_\text{sparse}(W,b) = J(W,b) + \beta J_{\hat{\rho}}$

where β is just a parameter whereby we can tune the importance of the sparsity cost.

Without going through the derivation, we use the following altered delta for layer 2 during backpropagation:

$\delta^{(2,i)} = \left ( (W^{(2)})^\top \delta^{(3,i)} + \beta \left ( -\frac{\rho}{\hat{\rho}} + \frac{1-\rho}{1-\hat{\rho}} \right )\right ) g'(z^{(2,i)})$

That scalar term added due to sparsity is computed only after all data points have been fed forwards through the network, because that is the only way to determine the average activation of layer 2. It is independent, therefore, of i. So the modification to backpropagation would require this:

1. Going through the data set, compute all the way up to layer 2, and accumulate the sum of the activations for each neuron in layer 2.
2. Divide the sums by m, the number of points in the data set.
3. Perform one iteration of backpropagation.
4. Go back to step 1.

### Why I don't like this

It is said that backpropagation is not biologically plausible, that is, it cannot be the algorithm used by the brain. There are several reasons for this, chief among which is that errors do not propagate backwards in the brain.

A sparse backpropagating autoencoder is doubly implausible, because not only does it rely on backpropagation, but it also requires that we wait until all data points are presented before determining the average activation. It would be much nicer if we had something more biologically plausible, if only because I have the suspicion that any algorithm that is not biologically plausible cannot lead to human-level intelligence.

So in the next article, I'll talk about a biologically plausible algorithm called the reverse Boltzmann machine.

# Machine Learning: K-Means Clustering

K-means clustering is the first unsupervised learning algorithm in this series. Unsupervised means that the answer is not available to the learning algorithm beforehand, just the cost of a potential solution. To me, unsupervised learning algorithms are more exciting than supervised learning algorithms because they seem to transcend human intelligence in a way. An unsupervised learning algorithm will seek out patterns in data without any (or with few) hints. This seems especially important when, as the human, we don't know what the hints could possibly be.

The Google "Visual Cortex" project shows how powerful unsupervised learning algorithms can be: from millions of unlabeled images, the algorithm found generalized categories such as human faces and cat faces. It is easy to see that if the same thing could be done with an audio stream or a text stream, the streams could be combined at a high enough level for association to produce sounds and text for images, images for text and sounds, and at a high enough level, reasoning.

The K-means clustering algorithm treats data as if it were in clusters centered around some number of points k, one cluster per point. Conceptually, the algorithm picks k centroid points, assigns each point in the data to a cluster based on how close it is to which cluster's centroid, moves each centroid to the center of its cluster, and repeats. The result is a set of centroids which minimizes the distances between each point and its associated cluster's centroid.

The cost function is:

$J = \frac{1}{m}\sum_i^k \sum_{j \in C_i} (x_j - \mu_i)^2$

where Ci is cluster i, and μi is the centroid for cluster i.

There are a few methods for picking the initial centroids. One method, the Forgy method, involves picking k random points from the data set to be the initial centroids. Another method, the Random Partition method, assigns each data point to a random cluster, then produces the initial centroids for each cluster. Regardless of the initial method, the algorithm proceeds by repeating the following two steps:

First, produce the clusters by assigning each data point to one cluster. This means comparing the distance of a point to each centroid, and assigning the point to the cluster whose centroid yields the lowest distance.

Second, calculate the centroid of each resulting cluster.

Repeat these steps until the total cost does not change.

### A Concrete Example

I implemented the above algorithm in Java and ran it on the usual concrete strength data set. As usual, I set aside 20% of the data set as a cross-validation. But a problem quickly became apparent: how many clusters should I use? Clearly the more clusters, the less the overall cost would be simply because there would be more centroids.

One solution is to try different numbers of centroids and ask if there is an obvious point where there is not a lot of improvement in the cost. Here is what I found from k=2 through 10:

Blue is the training cost, while red is the cross-validation cost. Interestingly, the cross-validation cost was always below the training cost, indicating that the cross-validation points represent well the training points. There is clearly no overfitting because there is no large gap between costs. However, there is no obvious point at which increasing the number of clusters doesn't help much.

The other solution to the number of clusters relies on evaluating different numbers of clusters only after later processing. If downstream processing works better with a certain number of clusters, then that number of clusters should be chosen. So, for example, if I put each cluster's data through a neural network, how good is the error for each number of clusters?

I trained an 8:10:10:1 neural network on each cluster of points, so k=2 had 2 networks, and k=10 had 10 networks. I used fewer hidden neurons than before, on the theory that each cluster has less data, meaning that I can probably get away with a smaller parameter space. Here are the results:

Clearly the more clusters and the more networks, the better the output. Perhaps because more networks means smaller clusters, which in turn means less variation to account for. Interestingly, 8 clusters works about as well as 5 clusters, and it's only with 9 and 10 clusters that more advantage is found. In any case, choosing k=10, here are the errors:

Compared to training a single 8:20:20:1 network on the entire data set, clustering has definitely reduced the errors. Most errors in the training set are now under 5% (down from 10% before), and even the one troubling point from before (error 100%) has been knocked down to an error of 83%. The low errors in the cross-validation points -- which, remember, the network has never seen -- all lead us to believe that the networks trained are not overfit.

I would still want to look at those high-error points, perhaps even asking for the experimental data for those points to be rechecked or even rerun. But for now, I would be happy with this artificially intelligent concrete master.

For the next article, I'm going to go off the syllabus of the Machine Learning course, and talk about one of my favorite unsupervised learning algorithms, the autoencoder.

# Machine Learning: Feedforward backpropagation neural networks

If we take a logistic function as in logistic regression, and feed the outputs of many logistic regressions into another logistic regression, and do this for several levels, we end up with a neural network architecture. This works nicely to increase the number of parameters as well as the number of features from the basic set you have, since a neural network's hidden layers act as new features.

Each non-input neuron in a layer gets its inputs from every neuron from the previous layer, including a fixed bias neuron which acts as the x0 = 1 term we always have.

Rather than θ, we now call the parameters weights, and the outputs are now called activations. The equation for the output (activation) of neuron p in layer l is:

\begin{align*} z^{(l)}_q &= \sum_{p=0}^{s_{l-1}} w_{pq}^{(l-1)}a_p^{(l-1)}\\ a^{(l)}_q &= g\left (z^{(l)}_q \right ) \end{align*}

Breaking it down:

• w(l-1)pq is the weight from neuron p in layer l-1 to neuron q in layer l
• a(l-1)pq is the activation of neuron p in layer l-1, and of course when p=0, the activation is by definition 1.
• z(l)is the usual sum, specifically for neuron q in layer l.
• g is some function, which we can take to be the logistic function.

So we see that the output of any given neuron is a logistic function of its inputs.

We will define the cost function for the entire output, for a single data point, to be as follows:

$J^{(i)} = \frac{1}{2}\sum_{p=1}^{s_L} \left ( y_p^{(i)} - a_p^{(i, L)} \right )^2$

Note that we are using the linear regression cost, because we will want the output to be an actual output rather than a classification. The cost can be defined using the logistic cost function if the output is a classification.

Now, the algorithm proceeds as follows:

1. Compute all the activations for a single data point
2. For each output neuron q, compute:

$\delta^{(L)}_q = -(y_q^{(i)} - a_q^{(i, L)})(1 - a_q^{(i, L)})a_q^{(i, L)}$

3. For each non-output neuron p, working backwards in layers from layer L-1 to layer 1, compute:

$\delta^{(l)}_p = (1 - a_p^{(i, l)})a_p^{(i, l)}\sum_{q=1}^{s_{l+1}} w^{(l)}_{pq}\delta^{(l+1)}_q$

4. Compute the weight updates as follows:

$w^{(l)}_{pq} \leftarrow w^{(l)}_{pq} + \alpha a^{(l)}_p \delta^{(l+1)}_q$

The last step can, in fact, be delayed. Simply present multiple data points, or even the entire training set, adding up the changes to the weights, and then only update the weights afterwards.

Because it is extraordinarily easy to get the implementation wrong, I highly suggest the use of a neural network library such as the impressively expansive Encog as opposed to implementing it yourself. Also, many neural network libraries include training algorithms other than backpropagation.

### The Concrete Example

I used Encog to train a neural network on the concrete data from the earlier post. I first took the log of the output, since that seemed to represent the data better and led to less network error. Then I normalized the data, except I used the range 0-1 for both the days input and the strength output, since that seemed to make sense, and also led to less network error.

Here's the Java code I used. Compile it with the Encog core library in the classpath. The only argument to it is the path to the Concrete_Data.csv file.

The network I chose, after some experimentation checking for under- and overfitting, was an 8:20:10:1 network. I used this network to train against different sizes of training sets to see the learning curves. Each set of data was presented to the network for 10,000 iterations of an algorithm called Resilient Backpropagation, which has various advantages over backpropagation, namely that the learning rate generally doesn't have to be set.

As before, the blue line is the training cost, the mean squared error against the training set, and the red line is the cross-validation cost, the mean squared error against the cross-validation set. This is generally what I would expect for an algorithm that is neither underfitting nor overfitting. Overfitting would show a large gap between training and cross-validation, while underfitting would show high errors for both.

If we saw underfitting, then we would have to increase the parameter space, which would mean increase the number of neurons in the hidden layers. If we saw overfitting, then decreasing the parameters space would be appropriate, so decreasing the number of neurons in the hidden layers would help.

Since the range of the output is 0-1, over the entire training set we get an MSE (training) of 0.0003, which means the average error per data point is 0.017. This doesn't quite tell the whole story, because if an output is supposed to be, say, 0.01, and error of 0.017 means the output wasn't very well-fit. Instead, let's just look at the entire data set, ordered by value, after denormalization:

The majority of errors fall under 10%, which is probably good enough. If I were concerned with the data points whose error was above 10%, I might be tempted treat those data points as "difficult", try to train a classifier to train data points as "difficult" or "not difficult", and then train different regression networks on each class.

The problem with that is that I could end up overfitting my data again, this time manually. If I manually divide my points into "difficult" and "not difficult" points, then what is the difference between that and having more than two classes? How about as many classes as there are data points?

What would be nice is if I could have an automatic way to determine if there is more than one cluster in my data set. One clustering algorithm will be the subject of the next post.