Schematics: Libre Chip Execution Unit — Discussion & Suggestions

In this thread I will post schematic drawings of the Libre Chip Execution Unit and proposals for improvements, with discussion.

The proposals may address either the schematic drawings themselves or the design of the Execution Unit.

So far I have a drawing of the "Physical Registers" unit, which covers everything that reads from the unit's Broadcast Bus.

These drawings are intended as improvements over the original schematics, which can be found here: [Register Renaming]

Currently there are two drawings:

  • The first part of the schematics covers the interface to the ALU units and the L2-registers unit.
  • The second part covers the Reg-Read unit, which outputs to the Store Unit.

If there are other units on the Broadcast Bus, please point them out so they can be added to the second schematic — I couldn’t think of any others offhand.

Please post comments about errors in the drawings or suggested changes.

There is another thread that discusses an alternative Register Unit design, here: [ Idea for changing the register file and muxing to make it faster ]

The drawings are A4-paper friendly.
Here are the schematics:


I forgot to post this nice estimation.

  • it is in NAND2 gate equivalent area.
  • 1 "gate" = 1 NAND2 area = 4 transistors area (tightly packed).

In this gate count, empty space lost to wiring inefficiency is counted in.
For adders, there will be about half the space lost, so to get actual gate and transistor counts, the numbers from the table should be halved.

To convert to LUT4 equivalents, divide the counts by 2.8.

            25 FO4    64-bit inc = 1200 gates    64-bit adder =  3000 gates
            20 FO4    64-bit inc = 1560 gates    64-bit adder =  4500 gates
            15 FO4    64-bit inc = 2100 gates    64-bit adder =  7000 gates
            11 FO4    64-bit inc = 3100 gates    64-bit adder = 10500 gates
            10 FO4    64-bit inc = 3400 gates    64-bit adder = 12500 gates
             9 FO4    64-bit inc = 3800 gates    64-bit adder = 18000 gates
             8 FO4    64-bit inc = 4400 gates

Normalization timing data (latencies):

            FO1 NOT          = 0.4 FO4
            FO4 NOT          = 1.0 FO4
            FO1 NAND2        ≈ 0.692 FO4
            FO1 NAND3        ≈ 0.934 FO4
            FO1 half-adder   ≈ 1.0 FO4  (from carry to carry)
            FO1 full-adder   ≈ 2.0 FO4  (from carry to carry)

Also forgot to mention: there is a nice fast and compact 64-bit adder:

  • 9.9 FO4, 12 000 NAND2 gate equivalent area (at 7nm FF, static complementary pull-up pull-down CMOS).

I addressed this some more further down.

The GPL requires providing the corresponding source, the preferred form for editing (here that would be the SVG source code, assuming that's how you edit it), for whatever files you are redistributing, so the GPL does actually qualify as "free documentation".

your latest diagram [1] seems pretty good to me, though I would leave out the amplifiers since it's supposed to show the high-level logical layout, not the individual components needed. Also, FPGAs don't have separate amplifiers that you can wire to run a SRAM, the SRAM is an indivisible block.

One thing we could do that you would have a hard time showing in that diagram is: for each of the replica physical registers that are only wired to an ALU input that only reads some of the bits (e.g. a ALU input that's only ever used for carry-in, or a memory-store pipeline that only ever reads the 64 data bits), you can store only those bits that would be read, so the registers on a memory-store pipeline's inputs could only be 64-bits wide, or if there was a input that only ever read flags, it could be just those flag bits that are read so it could be only like 5 bits wide.

assuming accesses to the L2 register file is sufficiently rare, it's fine if the CPU is much slower in pathological cases where it needs to move stuff in/out of the L2 registers (e.g. 100 div instructions in a row will almost certainly fill up the div unit's registers, but that's extremely uncommon code)

increasing the number of physical registers is a trade-off, the general idea I'm following is that we have a standard register-renaming CPU except that because the μOps have a lot of architectural registers (128 for now [2], most of the time only a few of them (16-64) are in use because PowerISA and x86 don't actually have all that many integer/fp registers), we add the L2 register file so we can actually store all of the architectural registers without wasting a huge number of the physical registers that would be rarely used (e.g. if the architectural register is used for storing the rarely accessed parts of XER or FPSCR or EFLAGS).

If we're building a design that has more than around 10 or so units, we'd likely want to reduce how much the units are interconnected, so the register renaming stage can just insert a copy instruction (which takes one or more cycles to move the data to the other side of the CPU) when there isn't a direct connection between the source and destination unit. This scheme is actually much more general than just having separate integer and floating-point physical register files, since you can do things like splitting into 3 groups of units for floating-point multiply, load/store and bitwise and/or/xor, and integer ops, or any arbitrary strongly-connected graph you please, as long as you have enough copy units connecting them to handle moving values from any unit to any unit.


  1. ↩︎

  2. some of the architectural and physical registers are hardwired as constants, e.g. the zero register on RISC-V ↩︎

I saw the reply; I’ll respond to it later.

Better REG-READ unit

While I was redrawing the REG-READ schematic, I realized a way to both simplify the scheduler and improve the speed.

Problems with the earlier schematic:

  • The integer unit had no straightforward path to communicate with the FPU.
  • The REG-READ unit relied on a truncated physical register file, which complicates forwarding and scheduling.

Proposed changes:

  • Replace the truncated physical register file in REG-READ unit with a full physical register set replica. This simplifies the scheduler due to uniformity.
  • Add forwarding from REG-READ directly to the FPU so the integer unit has a way of sending results to the FPU.
  • Merge REG-READ and L2RF into a single unit (see the "MAYBE" box on the new schematic).

I haven’t drawn the combined schematic yet.

Why is this better:

  • Higher data-store bandwidth (reads are possible from any physical register).
  • No additional physical-register replicas, or non-uniform replicas.
  • Reduced load on the Broadcast Bus (fewer readers, lowers load, therefore slightly higher speed).

I’m still refining this; if the idea is accepted I’ll produce the combined REG-READ/L2RF schematic next.

Here is the proposal:

Here is the drawing of new REG-READ unit merged with L2RF unit.
Ah, I forgot to add an amplifier before the Broadcast Mux .. nevermind. I'll add it to some future version of the schematic.

The GPL requires providing the corresponding source, the preferred form for editing (here that would be the SVG source code, assuming that's how you edit it), for whatever files you are redistributing, so the GPL does actually qualify as "free documentation".

Well, you learn something new every day.

That sounds reasonable. Give me a few days to check that and apply the GPL to the SVGs; I’ll post them to the forum once I get the time.

your latest diagram seems pretty good to me, though I would leave out the amplifiers since it's supposed to show the high-level logical layout, not the individual components needed. Also, FPGAs don't have separate amplifiers that you can wire to run a SRAM, the SRAM is an indivisible block.

I'm drawing this primarily for 14 nm to 5 nm CMOS custom silicon, which is my main interest. I don't know much about FPGAs, but I'll try to make the drawings reasonably applicable to them as well.

Key differences I noted:

  • On FPGAs, physical GPRs are usually implemented in BRAM, so you don't need the large "OPERAND MUX" structure required on custom silicon.
  • LUT-based muxes are slow on FPGAs.
  • FPGAs can place a lot of registers into BRAM.

One thing we could do that you would have a hard time showing in that diagram is: for each of the replica physical registers that are only wired to an ALU input that only reads some of the bits, you can store only those bits that would be read, so the registers on a memory-store pipeline's inputs could only be 64-bits wide, or if there was a input that only ever read flags, it could be just those flag bits that are read so it could be only like 5 bits wide.

That could fit in the "register unit" diagram, but to show it accurately I need a map that assigns operations to ALUs — like the example I provided here:

[ Idea for changing the register file and muxing to make it faster - #9 by SecurityEnthusiast ]

Without an ALU-operation map I can't determine which ALUs can use truncated inputs, so I recommend not adding truncated-register detail until we have that mapping.

assuming accesses to the L2 register file is sufficiently rare, it's fine if the CPU is much slower in pathological cases where it needs to move stuff in/out of the L2 registers (e.g. 100 div instructions in a row will almost certainly fill up the div unit's registers, but that's extremely uncommon code)

The key word is “assuming.” For custom silicon that assumption doesn’t hold.

That isn’t a rare pathological case on custom silicon — it happens whenever the sustained average ALU-result rate exceeds about one result per cycle for a window (e.g., ~100 instructions). That will overwhelm L2REGS write bandwidth, and if you’ve also exhausted free physical registers (quite likely), the CPU will slow down.

I think I understand what you meant for FPGAs: BRAM blocks typically have hundreds of entries (e.g., ≥256), so you can provision many more physical registers and avoid spilling into L2REGS.

In that case, why are you even bothering with L2REGS on FPGAs? They make sense only on custom silicon.
On FPGAs, just increase the number of physical registers to some huge amount.

Or tell me if I again misunderstood something about FPGAs.

increasing the number of physical registers is a trade-off, the general idea I'm following is that we have a standard register-renaming CPU except that because the μOps have a lot of architectural registers (128 for now , most of the time only a few of them (16-64) are in use because PowerISA and x86 don't actually have all that many integer/fp registers),

There are 32 GPR integer registers in Power ISA and in RISC-V. About 16 in x86-64. What 128 registers ?

we add the L2 register file so we can actually store all of the architectural registers without wasting a huge number of the physical registers that would be rarely used (e.g. if the architectural register is used for storing the rarely accessed parts of XER or FPSCR or EFLAGS).

I lost you here.

I think you mixed terms there (it happens). The L2 register file stores values for RENAMED (physical) registers, not for architectural registers.

If we're building a design that has more than around 10 or so units,

"10 units" is problematic for both custom silicon and FPGAs:

  • Not a power of two — wastes multiplexer resources
  • Requires a long Broadcast Bus, which reduces speed
  • I don't see a reason for that many ALUs ... (7+1) or (3+1) configurations are more sensible. I recommend (7+1).
  • You may run out of BRAM blocks on FPGAs: Each ALU requires, on its output, 4 BRAM blocks (for a total of 72 bit, dual port). For N units, you need a square of N.
    Example: for 10 units, 10 × 4 BRAM blocks per each replica. And, you need one replica per unit. Thats 400 BRAMs. Not likely.

It is also problematic on custom silicon because of broadcast-bus length and increased routing/area.

I recommend:

  • On custom silicon: split L2REGS into two banks to improve write bandwidth.
  • On FPGAs: remove L2REGS; instead provision many more physical registers in BRAM.
  • Keep flags separate from the physical GPR file (don’t merge them).
  • Keep integer and FPU units separate.

If we're building a design that has more than around 10 or so units, we'd likely want to reduce how much the units are interconnected, so the register renaming stage can just insert a copy instruction (which takes one or more cycles to move the data to the other side of the CPU) when there isn't a direct connection between the source and destination unit. This scheme is actually much more general than just having separate integer and floating-point physical register files, since you can do things like splitting into 3 groups of units for floating-point multiply, load/store and bitwise and/or/xor, and integer ops, or any arbitrary strongly-connected graph you please, as long as you have enough copy units connecting them to handle moving values from any unit to any unit.

I’ll try to decode that passage later — it doesn’t make much sense to me; I can barely follow what you meant.

Ah — I think I get it now ... uh...

That idea is basically described as: separate FPU and integer units (and possibly other units like SIMD/Vector).

If you meant splitting the INTEGER unit into smaller sub-units to reduce connectivity, that won’t work well with the ISAs you plan on supporting.

My ISA (posted somewhere on this forum) does support partitioning the integer unit that way.

I wondered why dedicate an entire unit to secondary functionality? Instead, why not give every unit an ALU and let one special unit (REG-READ) provide the extra capabilities.

So the REG-READ unit now includes its own (smaller) ALU. It may be smaller than the other ALUs, but it still adds meaningful compute capacity.

I’m also considering adding a very small ALU after the L2REGS file. There appears to be an idle timing pocket there, so we could use it to perform simple, common operations.

Here’s the updated diagram:

This version is optimized for slightly higher speed:

An even better variant:

ALU tImings on 7nm FinFET CMOS

Here is my quick estimation of timings for some important operations in ALUs, and for ALU result multiplexers:

  Timing units:     1 TU  =  the same as 1.0 FO4 inverter delays

 MUX2, fast data (A/B)          0.8 TU
 MUX2, equalized inputs         1.2 TU
 MUX2, fast select              1.0 TU

 adder            64/32-bit    10.2 TU
 add / sub / cmp  64/32-bit    11.0 TU
 shift / rot      64/32-bit     9.0 TU
 mul long   64 × 64-> 128      38   TU  (depends a lot on gate count)

 adder           32-bit         9.4 TU
 inc/dec/abs  64/32-bit         8.0 TU

 Total available ALU latency, for 1-cycle operations:  
      11.0 TU    (effectively the same as for "add/sub/cmp 64/32-bit")
     + 0.7 TU    (for operand gating, i.e. power reduction)
    = 11.7 TU   TOTAL: ALU latency on 7nm CMOS

Conclusion:

  • Since "shift/rot" is faster than "total available ALU latency", other operations should be muxed with shift/rot so that the final ALU delay is equalized for all ALUs (to 11 TU)

I've got you another table of delays, an approximation for a 7 nm FF high performance library. Comes in handy.

  • Cin (PU) = input capacitance per input, normalized so the INV-upsized input = 1.00 PU.
  • g = typical logical effort per input (unitless).
  • p = typical parasitic capacitance, normalized to intrinsic inverter p = 1.
  • delay computed with H = 1 (fanout 1), f = g × H, delay = (f + p)×0.5 TFO1, converted: 1 TU = 2.5 TFO1.
+--------------------------------+----------+------+-------+------------+
| Gate / Input                   | Cin (PU) |  g   |   p   | delay (TU) |
+--------------------------------+----------+------+-------+------------+
| Inverter (INV)                 | 0.87     | 1.00 | 1.00  | 0.40       |
| INV-upsized                    | 1.00     | 1.00 | 1.00  | 0.40       |
| INV-mid                        | 1.13     | 1.00 | 1.00  | 0.40       |
| NAND2                          | 0.96     | 1.33 | 2.00  | 0.67       |
| NOR2                           | 0.96     | 1.50 | 2.20  | 0.74       |
| NAND3                          | 1.00     | 1.67 | 3.00  | 0.93       |
| NAND4                          | 1.04     | 1.90 | 3.80  | 1.14       |
| NOR3                           | 1.04     | 2.00 | 3.00  | 1.00       |
| AOI21 — A, B                   | 1.00     | 1.75 | 2.20  | 0.79       |
| AOI21 — C                      | 0.96     | 1.40 | 2.20  | 0.72       |
| MUX2-D — A/B                   | 0.96     | 1.50 | 2.50  | 0.80       |
| MUX2-D — S with NS             | 1.00     | 1.90 | 4.20  | 1.22       |
| MUX2-FS-SS — A/B               | 1.00     | 2.40 | 3.85  | 1.25       |
| MUX2-FS-SS — S                 | 1.04     | 1.80 | 3.20  | 1.00       |
| MUX2-E-SS — A/B/S              | 1.00     | 2.10 | 3.25  | 1.10       |
+--------------------------------+----------+------+-------+------------+