The GPL requires providing the corresponding source, the preferred form for editing (here that would be the SVG source code, assuming that's how you edit it), for whatever files you are redistributing, so the GPL does actually qualify as "free documentation".
Well, you learn something new every day.
That sounds reasonable. Give me a few days to check that and apply the GPL to the SVGs; I’ll post them to the forum once I get the time.
your latest diagram seems pretty good to me, though I would leave out the amplifiers since it's supposed to show the high-level logical layout, not the individual components needed. Also, FPGAs don't have separate amplifiers that you can wire to run a SRAM, the SRAM is an indivisible block.
I'm drawing this primarily for 14 nm to 5 nm CMOS custom silicon, which is my main interest. I don't know much about FPGAs, but I'll try to make the drawings reasonably applicable to them as well.
Key differences I noted:
- On FPGAs, physical GPRs are usually implemented in BRAM, so you don't need the large "OPERAND MUX" structure required on custom silicon.
- LUT-based muxes are slow on FPGAs.
- FPGAs can place a lot of registers into BRAM.
One thing we could do that you would have a hard time showing in that diagram is: for each of the replica physical registers that are only wired to an ALU input that only reads some of the bits, you can store only those bits that would be read, so the registers on a memory-store pipeline's inputs could only be 64-bits wide, or if there was a input that only ever read flags, it could be just those flag bits that are read so it could be only like 5 bits wide.
That could fit in the "register unit" diagram, but to show it accurately I need a map that assigns operations to ALUs — like the example I provided here:
[ Idea for changing the register file and muxing to make it faster - #9 by SecurityEnthusiast ]
Without an ALU-operation map I can't determine which ALUs can use truncated inputs, so I recommend not adding truncated-register detail until we have that mapping.
assuming accesses to the L2 register file is sufficiently rare, it's fine if the CPU is much slower in pathological cases where it needs to move stuff in/out of the L2 registers (e.g. 100 div instructions in a row will almost certainly fill up the div unit's registers, but that's extremely uncommon code)
The key word is “assuming.” For custom silicon that assumption doesn’t hold.
That isn’t a rare pathological case on custom silicon — it happens whenever the sustained average ALU-result rate exceeds about one result per cycle for a window (e.g., ~100 instructions). That will overwhelm L2REGS write bandwidth, and if you’ve also exhausted free physical registers (quite likely), the CPU will slow down.
I think I understand what you meant for FPGAs: BRAM blocks typically have hundreds of entries (e.g., ≥256), so you can provision many more physical registers and avoid spilling into L2REGS.
In that case, why are you even bothering with L2REGS on FPGAs? They make sense only on custom silicon.
On FPGAs, just increase the number of physical registers to some huge amount.
Or tell me if I again misunderstood something about FPGAs.
increasing the number of physical registers is a trade-off, the general idea I'm following is that we have a standard register-renaming CPU except that because the μOps have a lot of architectural registers (128 for now , most of the time only a few of them (16-64) are in use because PowerISA and x86 don't actually have all that many integer/fp registers),
There are 32 GPR integer registers in Power ISA and in RISC-V. About 16 in x86-64. What 128 registers ?
we add the L2 register file so we can actually store all of the architectural registers without wasting a huge number of the physical registers that would be rarely used (e.g. if the architectural register is used for storing the rarely accessed parts of XER or FPSCR or EFLAGS).
I lost you here.
I think you mixed terms there (it happens). The L2 register file stores values for RENAMED (physical) registers, not for architectural registers.
If we're building a design that has more than around 10 or so units,
"10 units" is problematic for both custom silicon and FPGAs:
- Not a power of two — wastes multiplexer resources
- Requires a long Broadcast Bus, which reduces speed
- I don't see a reason for that many ALUs ... (7+1) or (3+1) configurations are more sensible. I recommend (7+1).
- You may run out of BRAM blocks on FPGAs: Each ALU requires, on its output, 4 BRAM blocks (for a total of 72 bit, dual port). For N units, you need a square of N.
Example: for 10 units, 10 × 4 BRAM blocks per each replica. And, you need one replica per unit. Thats 400 BRAMs. Not likely.
It is also problematic on custom silicon because of broadcast-bus length and increased routing/area.
I recommend:
- On custom silicon: split L2REGS into two banks to improve write bandwidth.
- On FPGAs: remove L2REGS; instead provision many more physical registers in BRAM.
- Keep flags separate from the physical GPR file (don’t merge them).
- Keep integer and FPU units separate.
If we're building a design that has more than around 10 or so units, we'd likely want to reduce how much the units are interconnected, so the register renaming stage can just insert a copy instruction (which takes one or more cycles to move the data to the other side of the CPU) when there isn't a direct connection between the source and destination unit. This scheme is actually much more general than just having separate integer and floating-point physical register files, since you can do things like splitting into 3 groups of units for floating-point multiply, load/store and bitwise and/or/xor, and integer ops, or any arbitrary strongly-connected graph you please, as long as you have enough copy units connecting them to handle moving values from any unit to any unit.
I’ll try to decode that passage later — it doesn’t make much sense to me; I can barely follow what you meant.
Ah — I think I get it now ... uh...
That idea is basically described as: separate FPU and integer units (and possibly other units like SIMD/Vector).
If you meant splitting the INTEGER unit into smaller sub-units to reduce connectivity, that won’t work well with the ISAs you plan on supporting.
My ISA (posted somewhere on this forum) does support partitioning the integer unit that way.