"At the moment when 9 passes to 0, a lever will be moved, thus recording the necessity of a carriage to the figure above; the carriage is made subsequently, and for the Analytical Engine a method of performing the carriages all simultaneously was invented by my father which he called 'Anticipating' Carriage."—Babbage, Henry P., The Analytical Engine, in Proceedings of the British Association, 1888.

Since I've successfully built a rod logic 1-bit full adder — does that actually count as a completed project? — I think it's time to design an ALU, the **Babbage-Boole Digital Arithmetical and Logical Mill**. My design is based on bit slicing. Each bit position in the ALU operands is considered separate, and each such bit slice is combined to form the full width ALU.

The general design of a bit slice computes all required functions of a particular bit in the operands in parallel, and then uses a multiplexer to choose between the functions:

Here is the ideal list of functions I want the ALU to perform. A and B are the operands, while X is the output.

- X = A + B (with carry)
- X = A - B (with borrow)
- X = A & B (bitwise and, Boolean equation X = AB)
- X = A | B (bitwise or, Boolean equation X = A + B)
- X = A ^ B (bitwise xor)
- X = ¬A (bitwise invert)
- X = A << 1 (shift left)
- X = A >> 1 (shift right, logical and arithmetic)
- X = A + 1 (increment)
- X = A - 1 (decrement)
- X = -A (arithmetic negate)

In addition to these functions, the ALU should have a carry-in, and flags for carry-out (C), overflow (V), zero (Z), and negative (N).

While Charles Babbage did not include any logical operations in the Analytical Engine, the Engine being an arithmetic device, he did include arithmetic shift operations, which he called Stepping Up (multiplying by 10) and Stepping Down (dividing by 10).

By representing numbers in two's complement, we can use the add operation in subtraction. Consider that the two's complement of a positive number P is P, while the two's complement of a negative number N = -P is (¬P)+1 (where ¬P is the bitwise inverse, or negation, of P). Thus, A-B = A+(-B) = A+(¬B)+1. Since we have defined A+B as including a carry-in bit, we can set carry to 1, bitwise negate B, and then add A.

If we do not set carry before subtraction, this would compute A-B-1, which is appropriate if we had to borrow 1 for a previous subtraction operation. Thus, we see that the carry-in bit is an inverted borrow-in bit.

By requiring the user to set carry (i.e. reset borrow) before subtraction, we can use a function bit to specify that B gets bitwise negated before being operated on. This also allows us to increase the number of logical operations, so we now have A & ¬B, A | ¬B, and A ^ ¬B. By adding another function bit to bitwise negate A as well, we can include all possible (non-identity) bitwise operations (NOT, AND, OR, NAND, NOR, XOR, XNOR). By adding that function bit, we end up allowing B-A as well, since that is just (¬A)+1+B.

We could also choose to include an add without carry and a subtract without borrow operation, but for now we'll leave that out unless it turns out to be cheap to add it. The user can easily implement these in software by setting or resetting the carry bit before performing the add with carry or substrat with borrow operations.

Incrementing and decrementing by one also becomes easy. An increment is simply A+0 with carry set, and a decrement is A-0 with carry reset (borrow set). Similarly, arithmetic negate is 0-A with carry set (borrow reset). So we can have yet another function bit to ignore B (instead of requiring the user to zero B).

With all this in mind, let's see what a bit slice now looks like:

Here we have the promised two function bits to invert A and B, plus a bit to zero B. The multiplexer has three function bits, and so chooses between eight inputs: previous bit of A, sum, AND, OR, XOR, A, B, or next bit of A. The previous and next bits of A are used for the shift operations. The bit less than the LSB is always 0, as is the bit higher than the MSB, which fulfills the requirements for logical shift.

Note that C_{inN-1} is the carry from the sum of the previous bit, or carry-in if N=0. We will be using a carry lookahead unit to generate the carries for each bit, rather than use a ripple-carry which would add a lot of delay. The carry lookahead unit also generates the carry-out bit.

The only operation that is not handled by this architecture is arithmetic shift right. This operation is the same as logical shift right, except the MSB remains the same. We will see later if we can squeeze this operation in.

Let's enumerate the 64 possible functions of the ALU based on this bit slice:

Now, unless I counted wrong, there are only 26 unique operations listed, and two of them (0 and 1) are silly, leaving 24 operations. This means that we could encode the different functions with only five function bits instead of six.

The plan to do this:

- Change the A
_{N-1}input to be A_{N+1}whenever F2 is set. - Change the OR input to be XOR whenever F2 is reset.
- Change the AND input to A, B, ¬A, or ¬B based on the combination of F2, F1, and F0.
- Set B to zero only if F4 = 0 and F3 = 1 and F2 = 0
- Remove the now-redundant four inputs and extra function bit from the multiplexer.

This is much better. Let's take a look at the Boolean equations that drive this thing. First, the equations for A and B after they are optionally inverted:

First, these are AND-OR equations. Rod logic requires AND-NOR equations. However, since inversions to logical inputs are free, requiring only the movement of a nub, we can treat these as AND-NOR equations, remembering that the output is inverted.

We characterize each AND-OR equation in terms of the number of inputs and the number of terms. The number of inputs determines the number of input rods, and the number of terms determine the number of intermediate rods. There is always a single output rod. Also, each AND-OR equation requires two steps to carry out after the inputs are set: push the intermediate rods, then push the output rod. So the delay of an AND-OR equation is always 2.

For the A equation, there are 2 inputs and 2 terms, and for the B equation there are 5 inputs and 7 terms. There are no inputs or terms in common, so this first stage requires a total of 18 rods with a delay of 2.

Next, the four multiplexer inputs:

Taken together, there are 8 inputs and 15 terms, plus the 4 output rods making a total of 27 rods and a delay of 2.

Finally, the output:

This equation has 6 inputs, 4 terms, 1 output, making a total of 11 rods and a delay of 2. Although we end up with an inverted output, this can be re-inverted in later stages.

So, from inputs to output is a total delay of 6.

Getting back to arithmetic shift right, we see that it should be trivial to add this function. Shift right is done when F2 = 0. We can say that arithmetic shift right is done when F1 = 0, and logical shift right when F1 = 1. The equation for M0 remains the same for all bits except the most significant bit, where it is replaced by this equation:

All we have to do is add two more inputs and one more term (3 more rods) and we have added arithmetic shift right to the ALU.

Next, let us handle the lookahead carry. Using a lookahead carry unit (LCU) is not inauthentic. Indeed, Charles Babbage utilized lookahead carries through the part of the Analytical Engine he called an Anticipating Carriage.

The LCU will use the standard generate/propagate equations:

The first stage computes the generate and propagate bits. The generate and propagate bits taken together require 2 inputs, 3 terms, plus the 2 output rods, for a total of 7 rods per bit, and a delay of 2.

We can cheaply reduce the delay and number of rods by inverting the propagate bits:

This changes the number of rods to 4 per bit, and a delay of 1, since the equations are now only AND equations.

The carry equations, taken together, require 2W+1 inputs, W(W+1)/2 + W intermediate terms, and W outputs, and a delay of 2 where W is the number of bits wide the ALU is. Also note that C_{W-1} is the carry out.

So the total delay for the lookahead carry unit (the LCU) is 3.

The delay of the LCU is a problem, because the output of the LCU is required for the M1 equation, and the inputs to the LCU requires the A' and B' equations. This means that M1 has to be delayed by 3 to allow enough time for the LCU to generate its output, meaning that the entire ALU now has a delay of 9 instead of 6.

We can reduce the delay by moving the calculation of M1 into the calculation for X as follows:

Now we don't have to calculate M1 (saving 6 rods: 1 input, 4 terms, and 1 output), but we have to add rods to X (3 inputs and 4 terms as opposed to 1 input and 1 term). This results in a net loss of 1 rod, and a net loss of 2 delays, since the computation of the LCU can occur in parallel with the computations of M0, M2, and M3 (plus one delay to allow the LCU to be ready for the calculation of X).

So now we have an ALU with a total delay of 7 before output. Sadly, there is no win by moving the generate/propagate calculations into the carry equations, so this is the best we can do.

Note that we never specified how wide the ALU is. Because of the bit slice architecture, the number of rods (with the exception of the LCU) is linear in the width of the ALU. Unfortunately, the number of rods in the LCU is quadratic in the width of the ALU, so a reasonable width has to be chosen. I will choose eight bits as the minimum reasonable size.

The next article will deal with the physical layout of the ALU, condition flags, and register considerations.