Building LMARV-1: a tangible RISC-V processor, part 1

A new project, hooray!

Background

RISC-V is not a processor in the sense of an ARM processor or x86. It is, in fact, an open specification for an Instruction Set Architecture (ISA). This means that the instruction set is standardized by the [RISC-V Foundation], and then anyone can implement [that instruction set] free of any licenses, royalties, or legal requirements.

RISC-V is the result of much study of existing ISAs. The result is regular, which makes an instruction decoder easy to create. It's also modular. There is only one subset of the ISA that is required to be implemented, the 32-bit integer set. This is called the 32I extension, meaning that all instructions and registers are 32 bits. Everything else is optional. For example, the M extension include multiplication and division instructions. A is for atomic operations. F and D are for single- and double-precision floating point. The G extension means "General", and means that you implement all of IMAF and D.

In general, if your implementation doesn't implement a particular G extension, you will have to provide libraries that do the same thing in software. Otherwise a compiler wouldn't be able to compile certain standard expressions.

There's also a 64-bit extension, 64I, and a 128-bit extension, 128I. There is a 32E extension, which is for smaller (e.g. embedded) processors which halves the number of required integer registers. Quad-precision floating points are in the Q extension. There's even an extension, C, for compressed instructions, which allows 16-bit and variable-length instructions. 

Some other interesting extensions, which have not yet been frozen (i.e. fully developed and agreed upon), are V (vector instructions), L (decimal floating point, as in calculators), and B (bit manipulation).

Speaking of extensions, RISC-V is extensible. If you have some piece of specialized hardware that you want custom instructions for, there is a whole range of opcodes reserved for that. Of course, your compiler would have to support those custom opcodes, or you could just wrap the assembly language.

And speaking of instructions, it should be pretty clear that the ISA is a reduced-instruction set. There are fewer than 50 instructions in the G extension! All the instructions are very simple. There aren't even any condition codes or flags such as carry, zero, or overload (these can be handled by other instructions).

RISC-V specifies four levels of privilege. The highest level is the machine level, which is the only required level. At this level all instructions have access to all of memory and all peripherals. The next level down is the hypervisor level, which is for things like virtual machines. Then there is the supervisor level below that, for operating systems and kernels. Finally, the lowest privilege level is the user level, which is for applications.

The plan

Most, if not all, projects implementing a RISC-V processor use an FPGA. However, I want to build a RISC-V processor that you can see and touch the insides of. So that you can learn about how a RISC-V processor actually works by observing it. I plan to use MSI and LSI chips, so things like buffers, flip flops, and so on. As few programmable chips as possible.

This first part is going to be about building the registers for the processor, which I call the LMARV-1 (Learn Me A RISC-V, level 1). It is level 1 because I only plan on implementing the 32I extension. Later levels add more features.

An instruction

There are several [formats of instructions], but the most interesting one specifies two source registers and a destination register:

xxxxxxx2222211111xxxdddddxxxxxxx

Here, d stands for a destination register, 1 for one source register, and 2 for a second source register. For example, adding two registers and storing the result in a third would use this instruction format.

Another format only specifies a destination and one source register. And a third format only has a destination register, for example for storing an immediate value. But the interesting thing is that the destination and source registers are all in the same position in the instruction. The instruction set is therefore regular. 

Registers

There are 32 registers called x0-x31, in addition to a program counter register, pc. In the 32I extension, these are all 32-bit registers. Interestingly, x0 is a fixed value, zero. Writing to it does nothing, and reading from it always yields zero. Compilers often set aside one register for a fixed zero value, and this just formalizes that.

The RISC-V spec also declares which registers are for what use. This is called the ABI spec, or Application Binary Interface. Each register has an alias name in the ABI. For example the name for x0 is "zero".

I am not too concerned with the ABI, since I'm not writing a compiler. I'm building the hardware that the compiler will write programs for.

So, here's my idea of what a single register should look like:

For this setup, I'm using a 74LVT16374, which is a 16-bit D flip-flop. LV means it's a low-voltage (3.3 volt) part, and T is the technology used, called ABT, or [BiCMOS]. This is like TTL but also low-power like CMOS.

IMG_20180208_193016.jpg

As a destination register, we can clock data from a destination bus into this register. We can also have two source buses, and use the 74LVTH162541 16-bit tristate buffer. This controls which source bus the register outputs on: one, the other, both, or none.

There are LEDs which can show the state of the register, which is important for a visible processor.

Now of course, these are all 16-bit parts. There are no equivalent 32-bit parts. So we would have to multiply the number of chips per register by two, making six. And then there are 31 registers (aside from the zero register), so 96 chips.

Here's an alternate setup:

IMG_20180208_193023.jpg

Here, I store the same value in three places: two sources and a display (the unlabeled box at the bottom, which is also a '16374). This has several advantages: cost, and drive capability. The '16374 can drive 32mA, while the '162541 can only drive 12mA.

And here it is!

IMG_20180206_163640.jpg

The LEDs on it are specifically 3mm flangeless LEDs. The reason they are flangeless is that the flange on the LED adds to the 3mm width, which would require the board to be about 10% larger.

The two card edges go into two PCIe x4 slots, for a total of 196 signals. Some of these signals are used for power and ground between bits on the bus, for signal integrity.

You can see a [video version] of this post on my YouTube channel.

And, all of [my schematics] (KiCAD) are on GitHub and are [Open Source Hardware].

 

Retroassembler to C

Yesterday I drove from California's Bay Area to the Central Coast, a four-hour drive, to pick up some vintage HP equipment so that I could save on shipping and having another damn crate in my garage that I'd have to cut up and stick in the garbage a piece at a time. Anyway, during those eight hours, while listening to the [Retro Computing Roundtable] podcast, I had a stray thought.

Probably not a powerful enough stray thought to do this, though:

 Honestly, I need another project like I need a hole in the head.

Honestly, I need another project like I need a hole in the head.

The thought went like this. Emulators exist for retro processors such as the [6502] and [Z80]. Actually, emulators exist for whole systems: [MAME] (Multi Arcade Machine Emulator / Multi Emulator Super System), for one. [Dosbox] runs [on the Internet Archive], so you can play old MS-DOS games in your browser.

Anyway, these emulators faithfully reproduce the processor down to the instruction level. Now, when I read an assembly listing, sometimes I translate the instructions to C in order to help me understand what's going on. What if you could do that for the whole program and then compile the resulting C code?

Before continuing, anyone interested should check out Jamulator, a project which takes 6502 binaries, generates assembly language from it, and uses that as the input language to LLVM, which then compiles it for a native processor. It's brilliant.

There's a whole host of problems with this idea, though. For one emulators are plenty fast enough. Also, there's the whole flow analysis thing that [IDA] does so well but still needs human input for. And, LLVM is pretty tough to develop for (I've tried). But it's getting easier.

Consider the task of adding two 16-bit numbers on an 8-bit processor. On a 6502, it would go something like this:

LDA SRC1      ; A <- low byte of SRC1
ADD SRC2      ; A <- A + low byte of SRC2
STA DEST      ; low byte of DEST <- A
LDA SRC1+1    ; A <- high byte of SRC1
ADC SRC2+1    ; A <- A + high byte of SRC2 + carry
STA DEST+1    ; high byte of DEST <- A

Now, a naive translation to C would result in something like this:

#include <stdint.h>

void add16(uint8_t* src1, uint8_t* src2, uint8_t* dest)
{
    uint8_t a = *src1;
    a += *src2;
    uint8_t c = a < *src1;
    *dest = a;
    a = *(src1+1);
    a += *(src2+1) + c;
    *(dest+1) = a;
}

Using the Compiler Explorer, I was able to compile this into X86 assembly using clang (at optimization level 3):

add16: # @add16
  mov al, byte ptr [rsi]
  add al, byte ptr [rdi]
  mov byte ptr [rdx], al
  mov al, byte ptr [rsi + 1]
  adc al, byte ptr [rdi + 1]
  mov byte ptr [rdx + 1], al
  ret

That's as close as we're going to get to optimized code. Clang recognized the carry trick.

Note that clang didn't go the extra step of recognizing this as a 16-bit add. For one, this would be an unaligned add. For another, compilers generally excel in optimizing code from the top down, not from the bottom up like we're trying to do. Nevertheless, clang did a great job here.

Not quite with gcc.

add16:
  movzx eax, BYTE PTR [rdi]
  add al, BYTE PTR [rsi]
  setc cl
  mov BYTE PTR [rdx], al
  movzx eax, BYTE PTR [rsi+1]
  add al, BYTE PTR [rdi+1]
  add eax, ecx
  mov BYTE PTR [rdx+1], al
  ret

Here gcc did recognize the carry (hence the SETC instruction), but failed to take advantage of that in the later add, instead having to add twice. gcc did not recognize that we only used one bit in the uint8_t c, and so was forced to add the entire carry uint8_t.

I even tried hinting that c was a single bit:

#include <stdint.h>

typedef struct flags
{
    unsigned c:1;
} flags;

void add16(uint8_t* src1, uint8_t* src2, uint8_t* dest)
{
    flags f;

    uint8_t a = *src1;
    a += *src2;
    f.c = a < *src1;
    *dest = a;
    a = *(src1+1);
    a += *(src2+1) + f.c;
    *(dest+1) = a;
}

Nope, same output. clang still generated the correct output. Note that clang 4.0.1 did not fare very well, and it was only by clang 5.0.0 that the output became nicely optimized.

Anyway, that's all I wanted to say. I don't intend this to go any further, but it sure would make an interesting project for someone else!

Reverse-engineering the 1977 Unisonic 21 calculator/game (part 5.1)

Here's the next part of the segment driver circuit.

segment_sch_driver_left2.png

The output on the left goes to the latch-type output circuit from the previous post. On the right we have three inputs to this circuit. IN_A and X2 are common to all segments. IN_21C is an input specific to the segment at pin 21.

We can already guess that C3 is likely a bootstrap capacitor, so plays no logical role. It is also likely that C2 is a boosting capacitor for Q1. That means that X2 is the boosting signal.

Since C2 and C3 are boosting capacitors, we can ignore their effect in the steady state and concentrate on how the thing works logically.

X2 gates IN_21C at Q6. So the gate of Q5 is low only when Q6 is on and IN_21C is low.

Now, when Q5 is on, IN_A gets passed through to the gate of Q4. So when IN_A is low, that turns Q4 on, pulling the output low.

As we saw in part 5, pulling this output low will cause the output pin to latch high. The output can only go low when it is reset by that CE/CG combination.

So we've deduced that the output pin gets latched high only when IN_A goes low, IN_21C goes low, and X2 goes low.

Now, let's try simulating this. For this simulation, I am going to use the [MIC94030], because it is pretty much the only four-terminal PMOS left in existence, and I plan (hopefully) to use it for a dis-integrated version of the Unisonic 21.

segment_sch_driver_left2_sim.png

It's a little busy, but I want to call your attention to a few things. First, I removed all the capacitors, and added pull-up resistors at all the gates.

Second, even though there is no existing SPICE model for this PMOS, I derived three parameters from the datasheet (Vt0 from the nominal threshold, and kp and lambda from the saturation graphs). So I have no idea if the simulation is accurate.

I've labeled successive gates in the circuit as nodes g1, g2, g3 and g4. Note that g1 goes low whenever IN_B and IN_21C are low, and that g2, g3, and g4 go low when all three inputs are low, as predicted. The output goes high when all three inputs are low.

What is important is the voltage at each gate when it is low. Going from g1 to g4, we appear to be losing a volt after every transistor. This is expected: without being boosted, you lose a threshold. This was explored in [the previous post about bootstraps].

It seems this loss of four volts is not enough to ruin the output of the last transistor which, after all, would have to have a gate voltage below 11v to turn on. Still, it would be nice to fix the issue. We can do that by adding bootstrap capacitors.

Think of bootstrap capacitors as adding feedback from the output back to the gate. If the output goes a little low, that just makes the gate go lower.

Here's one bootstrap capacitor. I set its capacitance so that you can just see the sag in the output.

segment_sch_driver_left2_sim_bootstrap1.png

We can see now that thanks to the bootstrap capacitor, g3 is able to reach a lower voltage than before. This gets translated to a lower voltage at g4. The capacitor, however, is not large enough to maintain its voltage over the period of the pulse.

Here I've increased the bootstrap capacitance from 1n to 10n:

segment_sch_driver_left2_sim_bootstrap10n.png

Interestingly, adding a capacitor across the transistor M4 screws things up:

segment_sch_driver_left2_sim_bootstrap10n_screwup.png

It seems that this capacitor is causing trouble. But adding one across M2 works fine, reducing the low voltage at g4 as expected:

segment_sch_driver_left2_sim_bootstrap10n_noscrewup.png

I suspect bootstrap capacitors only work if one terminal of the transistor is grounded.

Anyway, this is another apparently successful simulation, meaning that it can probably be made physical.