So now that I've had a chance to think long and hard about the design of the arithmetic logic unit, or Logical Mill as Babbage would have called it, I realized that I had several errors in the previous discussion. I also reorganized the Mill to make it possible to increase the number of bits more easily. I have also tested it thoroughly in simulation. I think I am also able to more coherently discuss the design, so here goes.

## Overview

First, here is the list of operations that the Logical Mill can perform:

The Mill is based on the bit-slice architecture, which means that regardless of how many bits the Mill has, the output for a given bit is based on the inputs for that specific bit, and on no other bit. There may, however, be logic outside the bit slices to provide signals that the bit slices need. For example, in order to add two bits, we need a carry in which comes from a separate section of logic.

## The Bit Slice

Here is the first section of the bit slice. It is a kind of preprocessor of the bits:

A and B are the inputs to the bit slice, and F is a 5-bit function selector. The preprocessor simply inverts A if F_{0} is 1. This is useful for logic functions, for example simply inverting A bitwise. It is also useful arithmetically, since inverting A is equivalent to multiplying by -1 (and subtracting 1, but that's another story).

What the postprocessor does to B is slightly more complicated. The multiplexer selects either B or 0 for forwarding to the optional inverter. 0 is selected if F is 010xx, which corresponds to the four arithmetic additions without B. Thus, if F is 0100x the second operand to the additions is 0, and if F is 0101x, the second operand is 1111...1111 (i.e. -1). Otherwise, if F is 011xx, these are the four arithmetic operations which include B, inverted or not depending on F_{1}.

Here are the equations corresponding to the above diagram, in AND-OR form, which is the easiest to translate to rod logic:

The final part of the bit slice is a multiplexer, which computes all the functions (shift, add, xor, and so on) and selects one of them based on the three highest bits of F.

In the above diagram, the function for F=111xx is logical or, not addition. Addition is handled by F=010xx and 011xx, as described above. S_{R} is a function which computes the right-shift, S_{L} is a function which computes the left-shift, and U is a function which computes the unary logical operations (A, B, or inversions of same). These are the specific equations for S_{R}, S_{L}, and U:

In the above equation, M is a bit indicating whether the bit slice is for the most significant bit, which is important for right shift to determine whether the carry should be shifted in (logical shift) or whether the most significant bit should remain the same (arithmetic shift).

Likewise, L is a bit indicating the same. Note that for both shift right and shift left, the carry is only shifted in when A isn't inverted. Also, the carry isn't shifted in on arithmetic shift right, because the highest bit is kept. I decided that it didn't make much sense to shift carry in when shifting the invert of A.

So, the bit slice in block form looks like this:

## The Lookahead Carry Unit

P and G are special outputs called Propagate and Generate. These are used in the external function which feeds each bit slice with its carry in, and this external function is the Lookahead Carry Unit. The naive way to perform addition is to compute the output and the carry out of the lowest bit, then use the carry out as the carry in of the next bit, compute its output and carry out, and so on. This is a ripple carry, and leads to delays.

A lookahead carry unit, on the other hand, looks at all of the bits and computes carry in for every bit. Consider two bits, A and B, and a carry in. Regardless of the carry in, if A and B are both 1 then a carry out will be generated. If A or B are 1, then a carry in will result in a carry out: the carry in is propagated:

The carry in for bit 0 is simply the carry in to the Mill. The carry in for bit 1 is set if bit 0 generates a carry, or if bit 0 propagates a carry and carry in is set. Similarly, the carry in for bit 2 is set if bit 1 generates a carry, or bit 0 generates a carry and bit 1 propagates a carry, or if bits 0 and 1 propagate a carry and carry in is set. And so on for each carry.

In this way, all the carries can be computed at once, and thus the delay suffered by the last bit is the same as the delay suffered by the first bit, and this delay is quite small.

The equations also show a group propagate and a group generate. The group propagate indicates whether the LCU propagates a carry in, and the group generate indicates whether the LCU generates a carry out. This is useful when we want to increase the number of bits. Simply multiply the hardware by four (thus 16 bits and 4 LCUs) and add an LCU to read in the group propagate and generate signals from each LCU from the first stage. The second stage LCU then generates the carries that feed the first stage LCUs, and the first stage LCUs generate the carries feeding the bit slices:

The advantage is that the same LCU architecture used in the first stage is also used in the second stage. This can be repeated ad infinitum, with every doubling of bit width resulting in a doubling of hardware, but only a linear increase in delay.

## Flags

Finally, we have the flags. I want Z (zero), N (negative), V (overflow), and C (carry out) flags. The equations for these flags are as follows:

The reason carry out is not simply C_{4} from the LCU is that for shift left and shift right, the carry out is the most significant and least significant bits, respectively. In addition, the carry out should always be zero for the logical operations.

## Rod Logic Diagram

I've constructed a convenient method for describing rod logic. Here is an example of an identity gate (top) and an inverter (bottom):

On the left are the rod logic diagrams, and on the right are the corresponding physical realizations.

The line with feathers indicates an input, while the line with an arrowhead indicates an output. The order of pushing the rods is determined by following the directions of the arrows.

A filled-in circle indicates identity, while an open circle is an inverter.

So, here is what an XOR gate looks like:

Let's start with the LCU, since it is fairly simple:

The reason that C_{in} is an inverted input is that all the carry outs are also inverted. Also, since the group Generate signal comes out inverted, the Generate inputs are also inverted.

Here is the bit slice itself. It is much more complicated, and has a few oddities which I will describe below.

First, notice the G and P outputs (these are output from level 3, where level 0 is the input level). Note that G is not inverted. In fact, it cannot be inverted unless we add a delay, and we don't want to do that. Thus, there will in fact be two versions of the LCU: one with G inputs negative as in the diagram above, and one with G inputs positive. I welcome any suggestions to have G be output negative as long as they do not add too many additional rods or any additional delay.

Note also that levels 7 and 8 do nothing. They are, in fact, present to provide a delay of 2. The LCU itself outputs the carry outs after a delay of 2, and this additional delay is here to show what we would need if we had an additional level of LCUs, thus making a delay of 4 from the P and G outputs.

The delay of this bit slice is 9.

The diagram for the flags module comes next:

Here the V and C outputs are negative, and this is necessary to prevent additional delay. Thus, the total delay of the Mill from input to flags is 11.

## Conclusion

I've designed the core of the Logical Mill so that it can be extended to any number of bits. I will start by constructing a 4-bit Logical Mill, and then provided the cost is not too much, scale up to a 16-bit Mill.

The registers surrounding the Mill still have to be designed, but I can start without them.