Idea for maybe allowing much wider fetch/rename/etc. widths in a CPU

Some background:
Modern CPUs that use register renaming are running into limitations where they have trouble fetching/decoding/register-renaming/etc. a large number of instructions per clock cycle because it's really complex to cram all of the register renaming into a single clock cycle (since the next batch of instructions to be renamed needs the results of the current batch).

It just occurred to me that if we can come up with a register renaming algorithm that has less than O(n) latency, for example something shaped like a parallel prefix sum, then we can eliminate the latency bottleneck in register renaming by instead making it take 2-3 cycles and just rename more than 2-3x as many instructions, that way the CPU will e.g. fetch and decode 16 instructions each cycle, and then register rename and do execution unit assignments in batches of 32 instructions where each batch takes 2 cycles (taking advantage of the O(log N) latency of parallel-prefix-sum-style algorithms), and then sending those instructions to the execution units split back into 16 instructions per cycle. That way we can build a CPU with crazy wide throughput without having to cram all the complex management logic into one cycle, allowing us to still have high clock frequencies.

I meant O(n) latency. edited.

To be clear, I'm not proposing we implement this in our CPU design right away, but if it works it could be a big improvement later.

I’m a bit confused by your usage of the term "register renaming" here. I think you are talking about what I would call "instruction issue ordering".

Every time I tried to figure it out, it appeared too complex for me.

My suggestion: Do something simple for the start. Like: dual-issue, with some minimal instruction re-ordering. Prove that it is SPECTRE-resistant. Then, later, you can add more complexity to it, after the basic implementation problems have been solved.

I was also thinking about some numbers:
Let’s say you have a 12-stage pipeline, and a CPU that executes 1.5 instructions per cycle. Then there is some kind of “average“ of 12×1.5 =18 “future“ instructions in the pipeline. Of those 18 instructions, at 3 will be branches on average. Technically, every security/privilege verification is also a branch, and that will add more branches.

As a consequence, you end up with a need for an extremely high number of decoders, because they need to prefetch and decode many of those branches.

So, the pipeline should have a shape of a funnel: wide at the decoding stage, and then slowly getting narrower and narrower, in terms of the number of simultaneous streams/branches processed by the pipeline stage.

For each stage of the pipeline, you can define the “branch width” number, which tells how many branches/streams can the pipeline stage simultaneously process.

A cool trick would be to have a scheduler that actually produces two instructions streams: one stream for the case that the latest pending branch is taken, and one for the not-taken case. That reduces the need for speculative execution. In other words, a scheduler that allows the issue stage to have a “branch width” of 2.

Here is another suggestion regarding “register renaming“ image at ( Register Renaming ).

I agree that it is a good idea to have each unit have its own local registers.
However, there is a problem with the image on that page.

In my opinion, the unit-local registers should store the unit's output values.

In the image, the registers are marked as "out regs", but drawn as input regs.
Ah, silly me, I get the image now... it actually draws outputs of each unit as being available on the bus for any other unit to access... therefore requiring quite a wide and long bus.

So, why is it better to have outputs stored in local registers? Because you may not know immediately where their values should be forwarded: to which other unit, or to L2 register file, or somewhere else.

Then you can use multiplexers to reduce the bus width, because in each cycle it is likely that only several values need to be transferred between units.

Anyway, i think that the number of local registers per unit should be low... probably 4 output registers, but maybe even as low as 2. The expectation is that those registers are mostly free, because their values would be quickly copied to the L2 registers whenever a write port to L2 is free.

If there are a lot of local registers per unit, that would just slow down the clock rate, because more registers = more multiplexing required.

Or you can simply keep the number of "units" low... therefore, max. 4 units.

Or a combination of the two approaches. For example, the current image shows 4 units with 4 bus "tracks". You could change that into 6 units with 4 bus tracks, and make it so that each unit can select one of some two output tracks. Or similar.

Regarding my previous post, just to make it more clear: each of those two “issue streams” may contain speculative instructions. Only instructions from one of the “issue streams“ are issued, the other stream gets discarded.
The point is to reduce the latency (and speculation resources) of a branch instruction by allowing the “which of the two streams“ decision at the very last moment, i.e. at the issue stage.

I spent some time searching for a good description of register renaming that don't spend most of their time talking about Tomasulo's algorithm (which we're not using), this PDF is probably the best one I've seen so far. A register renaming demo that runs in your browser using Wasm that I wrote several years ago, you can modify the input to change the pipeline's properties: https://web.archive.org/web/20240716103445/https://ftp.libre-soc.org/power-cpu-sim/

I want to concentrate on building a high-performance CPU with register renaming and wide superscalar execution and speculative execution. based off of previous experience, building something simpler first doesn't actually help much since the architecture would be quite a bit different. I want to avoid falling for the trap of building a simple CPU and then never actually getting the high-performance implemented.

actually, modern CPUs go in the other direction, they use branch prediction to guess/learn which direction branches go so they can just follow that and fetch, decode, and then speculatively execute hundreds of instructions -- this allows them to hide a lot of the latency of cache misses by running all the following instructions that don't depend on whatever load had a cache miss, thereby improving performance substantially.

Yes, i know. And I think that my dual-branch-stream suggestion is probably too complicated for this project.
However:

  • "deep speculation" doesn't actually work well on benchmarks like 7zip, because it has a lot of unpredictable branches. You can easily see how badly modern CPUs perform on that benchmark.
  • Libre-chip won't be able to speculate as deeply as those other modern CPUs, because SPECTRE resistance brings a lot of additional limits to speculative execution.

But anyway, I agree that my "two-streams" idea is probably not suitable for this project.

not as much as you might think, I'm expecting the design to have somewhere around (we may need to scale this down to fit in our FPGAs, the design will be parameterized such that you can easily change how many we have of each part of the cpu):
fetch width of 128-bits/cycle, be able to decode 4 simple PowerPC instructions per cycle into internal μOps, be able to rename 4 instructions per cycle, 5-6 execution units each with a queue of like 16 instructions and like 8 or 16 local registers for inputs from each execution unit (so that'd be at most 16*6 = 96 input registers per execution unit; assuming we use the distributed register file idea), L2 register file with >128 registers (since the μOps have 128 addressable registers), retire queue with probably around 100 entries.

hmm, looking at non-hyperthreaded CPUs to simplify things:

7-Zip Compression 24.05 insns/clock:

  • Intel 285K: 125 Ginsns/s / 24 cores / 5.1 GHz = 1.02 insns/clock
  • Snapdragon X1E78100: 57.4 Ginsns/s / 12 cores / 3.4 GHz = 1.41 insns/clock

actually, comparing the 9700X vs. 9800X3D will show how much of it is cache misses rather than bad branch prediction:

  • 9700X: 113 Ginsns/s
  • 9800X3D: 125 Ginsns/s

so I think most of the difference is cache misses rather than branch prediction since 7-Zip uses a large compression window (afaict 16MB by default for LZMA2) that doesn't fit in the L1/L2 caches.

I don't see how the benchmarks you posted are relevant:
9700X vs. 9800X3D shows a 11 %difference when a large cache is added, which is not much difference.

  • The 7zip actually testes several dictionary sizes, from 4 MB to 32 MB (corresponding to powers 22 - 25). OpenBenchmarking lumps those together.

  • Only the single-threaded case is relevant for our discussion, so the test should be performed with 7zip b -mmt1 , which it probably wasn't.

  • An old version of 7zip should be used, like 16.02, because newer versions contain different code paths for different CPUs,

  • To find out gains afforded by deep speculation, the best way is to compare a pre-2010 CPU with a modern CPU. If you do that, you'll find that the speed difference basically equals the difference in the clock rate.

Here is some data of mine for 7zip 16.xx:

7zip 16.02 results at:
https://github.com/ThomasKaiser/sbc-bench/blob/master/Sorted-Results.md#7-zip-mips-single-threaded

Other 7-zip benchmark across architectures:
https://www.7-cpu.com/
                                     Intel Pentium      @  100 MHz:           64            62
                                     Intel Pentium MMX  @  200 MHz:          130           120
             Intel Pentium II (250 nm, Deschutes) (P6)  @  350 MHz:          290           300
                                      AMD K6-2          @  500 MHz:          260           440
                  Intel Celeron (130 nm, Tualatin) (P6) @ 1200 MHz:          760           980
            Intel Pentium III-S (130 nm, Tualatin) (P6) @ 1400 MHz:          980          1250
                                   AMD  E-350  (Bobcat) @ 1600 MHz:         1120          1480
                                   AMD A8-6410 (Jaguar) @ 1800 MHz:         1450          1650


                Raspberry Pi 5 Model B Rev 1.0 = Cortex-A76 @ 2400 MHz:     2996   2994   3333   3332
                Raspberry Pi 3 Model B Rev 1.2 = Cortex-A53 @ 1200 MHz:      704    703   1250   1250
                            Amlogic S905, 32-bit Cortex-A53 @ 1536 MHz:      880          1600
							Amlogic S905, 64-bit Cortex-A53 @ 1536 MHz:      860          1420
                arm64, gcc-6 -O2, Snapdragon 855 Cortex-A55 @ 1780 MHz:     1120          1830
                arm64, gcc-6 -O2, Snapdragon 855 Cortex-A76 @ 2840 MHz:     2830          3360
                arm64, clang-4.0 -O3,    (Twister) Apple A9 @ 1850 MHz:     2480          2430
                                      AMD Athlon 64 X2 (K8) @ 2000 MHz:     1800          2080
                             --- Intel i7 3770 (Ivy Bridge) @ 3400 MHz:     4200          3760
    (7-zip v16-04, Win7, 64-bit) AMD Phenom II X4 955 (K10) @ 3200 MHz:     3121   3101   3345   3344
              risc-V 64 / SpacemiT k1-x deb1 / SpacemiT X60 @ 1600 MHz:      714    713   1245   1243
              risc-V 64 / DC-ROMA            / SpacemiT X60 @ 2000 MHz:
		      risc-V 64 / SiFive FU740       /          U74 @ 1200 MHz:      844          1108

    Turbo Boost disabled, Nehalem-C Intel i5-650 (Westmere) @ 2670 MHz:     3100          2600
              Turbo Boost disabled, Intel i7 4770 (Haswell) @ 3400 MHz:     4000          4000
THPon, v16.02, gcc-6 -O3, Intel Xeon E5-2699 v4 (Broadwell)@ 3600 MHz:     5100          3900
          Linux, "Kaby Lake-S" (14 nm), Intel Pentium G4600 @ 3600 MHz:     5572          3692

                              Linux, v16.02, Ryzen 3 2200 G @      MHz:     3511          3422
               Linux (THP off), gcc-6 -m64 -O3, Ryzen 1700X @ 3900 MHz:     4210          3900
                                    p7zip 16.02, Apple A12Z @ 2500 MHz:     4410          3640

for CPU performance, that's actually a pretty large difference

All of those are bigger than the per-core L2 cache size (1MB), so they are limited by the L3 cache's speed.

yes, but since those benchmarks aren't testing in single-threaded mode, you can get a ok-ish value by dividing by the number of cores and by the clock-speed when using only single-thread-per-core CPUs.

the exact same program also takes a different number of instructions on Arm vs. x86. also, the C library (which doesn't change when you use a different 7-zip version) does a bunch of processor-detection and selecting different implementations for things like memcpy or memchr. So, you can only compare MIPS numbers across different [micro]architectures as a ballpark estimate.

pre-2010 x86 CPUs also have pretty deep speculation, e.g. Intel's Nehalem microarchitecture has a 128-entry reorder buffer, which allows it to have around that depth of speculation. for comparison Zen3 has a 256-entry buffer.

for CPU performance, that's actually a pretty large difference

Perhaps, but I'm in fact completely disputing that result because:
- a) not performed with the "-mmt1" switch (single-threaded)
- b) you are assuming that the difference is due to cache, which is contradicted by most other results that I have. I would say that the difference is due to different thermals between the models, because the CPUs were stressed with full 8 threads in that benchmark. Therefore, I consider that benchmark result you posted to be completely irrelevant.

All of those are bigger than the per-core L2 cache size (1MB), so they are limited by the L3 cache's speed.

That assumption of yours is easily contradicted by the K10 (Phenom II / Athlon II) results. The only difference between those models is that Athlons lack (6 MB) L3 cache completely, but that doesn't affect 7zip benchmark by much.

yes, but since those benchmarks aren't testing in single-threaded mode, you can get a ok-ish value by dividing by the number of cores and by the clock-speed when using only single-thread-per-core CPUs.

No, all the benchmarks I posted use a single-threaded mode.
Furthermore, you can easily see (link below, for many CPUs) that running the benchmark in the multithreaded mode affects the results in an unpredictable way.
So, your assumption of "ok-ish value" is invalid.

[ sbc-bench/Sorted-Results.md at master · ThomasKaiser/sbc-bench · GitHub ]

the exact same program also takes a different number of instructions on Arm vs. x86. also, the C library (which doesn't change when you use a different 7-zip version) does a bunch of processor-detection and selecting different implementations for things like memcpy or memchr. So, you can only compare MIPS numbers across different [micro]architectures as a ballpark estimate.

OK, so let's take just x86-64.
The Athlon II X3 420e (2600 MHz) scores 2566 (2574 / 2558).
Find me an x86 CPU that scores more than 7000 (i.e. disproportional to the clock speed compared to a 15-year old Athlon II)

BTW, Athlon II X3 420e has only 512 KB L2 per core, and lacks L3 completely. You are free to pick a modern CPU with any L3 cache size.

[ sbc-bench/results/4eOo.txt at master · ThomasKaiser/sbc-bench · GitHub ]

pre-2010 x86 CPUs also have pretty deep speculation, e.g. Intel's Nehalem microarchitecture has a 128-entry reorder buffer, which allows it to have around that depth of speculation. for comparison Zen3 has a 256-entry buffer.

Compared to modern CPUs, the Athlon II can't speculate very deeply (and it doesn't need to). Its usual throughput is about 1 instruction per cycle. Modern CPUs can do about 2 instructions per cycle on average. That's why modern CPUs are faster (since the clock rate have not increased by much).