In the first article, we designed an 8-bit ALU. We still have to add the flags C, V, N, and Z.
The C flag is the carry out flag. This is set whenever an addition results in a carry or when a subtraction does not result in a borrow. We already have this flag: it is the C8 output from the lookahead carry unit. However, shifting a 1 in the MSB left should also result in a carry out.
The V flag is the overflow flag. It is set only on arithmetic operations (shift and logical operations clear it) whenever the result of an addition or subtraction using signed operands is less than -128 or greater than 127 (because such numbers cannot be represented as 8-bit 2's complement numbers). This happens when the carry in to the MSB (i.e. C6) is different from the carry out from the MSB (i.e. C7).
The N flag is the negative flag, and is set on all operations whenever the MSB is set.
Finally, the Z flag is the zero flag, and is set only if all X bits are zero (i.e. the result of an operation is zero).
Because the Z flag is the only flag that is a nontrivial function of X, there is a delay of 1 after the ALU output becomes valid in order that the Z flag is valid. It is possible to compute Z in parallel with X, but this would require 34 inputs (the F3 and F4 bits, plus eight bits from each of the four multiplexer inputs), and it would still result in a delay of 1 because we took the M1 calculation out of the multiplexer and put it in X so that we wouldn't have to wait for the lookahead carry unit.
Let us now move on to registers. Since the operands may come from anywhere, and the result may go anywhere, we will add registers to the inputs and outputs. Thus, we now have an A and B register, a C (conditions, or flags) register, an X register, and an F (function) register. All registers except X must be writable by the user. C and X must also be readable.
The carry bit in the C register must be duplicated into a separate one-bit memory cell because the carry out and carry in flag are one and the same, and the constraints of reversible logic prohibit reading and writing the same memory cell within the same chain of logic. When the C register is written by the user, the extra carry bit may be written simultaneously. However, when the C register is written by the ALU, the extra carry bit is copied from the C register only after the ALU (and hence the read of the extra carry bit) is reversed (enabling the extra carry bit to be written).
All registers are eight bits wide. This means there are four read/write bits available in the C register for the user, and three unused bits in F for expansion.
It is a definite advantage to have X be copied into A. This way, an extra instruction to explicitly copy X into A need not be present in a program which performs several functions in a row on the ALU (for example, adding B several times to A, shifting A several times, etc). Thus, the A register becomes a true accumulator.
3. Physical layout
Note: I seem to be using layer and level interchangeably. Let them both mean a physical plane of rods side by side.
Let's take the first two sets of equations of the ALU bit slice. Recall that the equations for the first set are:
and the next set are:
except that M1 was removed from this set and placed in the next set (for X).
The first set of equations requires 7 inputs, and has a total of 9 AND terms and 2 outputs. The next set has 7 inputs (2 of which are the outputs from the previous set), 11 terms, and 3 outputs. The rods could be physically laid out as follows (side view):
This shows a box that has a size of 7 x 11 x 5: a maximum of 7 rods along one axis, 11 rods on the perpendicular axis, and 5 layers of logic.
Note that F0, F1, and F2 are repeated, meaning we have to add a way to get these inputs from the same place (the F register) but send them to different levels.
One possibility is to carry F0, F1, and F2 through the layers until they are no longer needed:
This has the effect of increasing the size of the box to 7 x 12 x 5, and adding 3 rods. The 5 outputs in the middle now include F0, F1, and F2.
Because rods have two sides, another possibility is that we can make the F0, F1, and F2 input rods in the middle "double-sided", so that the 9 terms use those rods as inputs for F0, F1, and F2 instead of the input rods at the top:
This keeps the size of the box at 7 x 11 x 5, and does not require adding any more rods. The F inputs are still on different levels, but at least they are not repeated anymore.
Here is yet another possibility:
In this layout, all the inputs are on the first level. There are two sections side by side at the second level: the first being 9 terms, and the second being 11 terms. At the third level are also two sections side by side, the first being the outputs for the first set of equations, and the second section being the equations for M. This box is 9 x 20 x 3, which bigger in two axes but smaller in one. It has the same minimum number of rods as the previous possibility, but all the inputs are on the same level. Each section has to be driven at the right time, but the drive mechanism previously described is capable of doing that.
With all of these layouts, All the F and A inputs need to feed more than one box. F needs to feed all boxes at the same location relative to each box, while three consecutive but different A bits feed different boxes, except for the LSB, where An-1 is different, and the MSB where An+1 is different.
So the big question is: is it possible for one output to be many inputs at the same time? Certainly having an output rod and any number of identity rods on the next level would satisfy the inputs being the output, but it would also add a delay. Is it possible to have no delay?
One possibility to consider is to use Babbage's design for getting an output from one place to another, used several times here:
The interesting mechanism here is the push rod, pushed by the studs on the barrel, which rotates a shaft via an arm. The shaft rotates another arm on its other end, which results in a rod being pushed elsewhere. Energy is lost wherever a rod attaches to a shaft since the translation from linear to rotational motion and v.v. is not 100% efficient, and hence more force than usual needs to be used to push a rod that pushes another rod elsewhere.
To make an output rod feed N input rods, it is possible to have the output rod rotate a long shaft, off of which are attached other arms pushing other rods. Or possibly pulling other rods: it may be advantageous to pull a rod from its opposite end because the drive mechanism pushing adjacent rods may be in the way.
Let's go ahead and lay out the rest of the bit slice, assuming the 9 x 20 x 3 box. The last equation is for X, and has one new input, Cn-1, 7 terms, and one output:
Since we have eight bits, we just need to have eight of these boxes. They could be laid out side by side, or on a 3 x 3 grid, which could be better since then the transmission of data to and from the registers wouldn't require long parts.
Let's now look at the design of the LCU. Recall the equations:
And the modification to the P equation:
So first we have the two G and P equations for each bit, which taken together requires 2 inputs, and 2 outputs. Next, we have the eight carry equations (C0-C7), together requiring 17 inputs (8 G, 8 P, plus Cin), 44 intermediate terms, and 8 outputs.
We could build one single box as follows:
If we laid out our X boxes in a 3 x 3 grid, then the grid would be 27 rods wide. The LCU would stick out way beyond that.
Instead, we could split the LCU into three boxes, one on top of the other. The top box contains the carries for X0, X1 and X2 (which are Cin, C0, and C1), the middle box contains the carries for X3, X4, and X5 (which are C2, C3, and C4), while the bottom box contains the rest of the carries (C5 and C6 for X6 and X7, and C7 being the carry out):
This way, the LCU boxes are stacked on top of each other, and the maximum width is 24 instead of 44, which compares favorably with the 27 width of the X boxes.
What's next? Figuring out how to implement this in brass!