Some background:
Modern CPUs that use register renaming are running into limitations where they have trouble fetching/decoding/register-renaming/etc. a large number of instructions per clock cycle because it's really complex to cram all of the register renaming into a single clock cycle (since the next batch of instructions to be renamed needs the results of the current batch).
It just occurred to me that if we can come up with a register renaming algorithm that has less than O(n) latency, for example something shaped like a parallel prefix sum, then we can eliminate the latency bottleneck in register renaming by instead making it take 2-3 cycles and just rename more than 2-3x as many instructions, that way the CPU will e.g. fetch and decode 16 instructions each cycle, and then register rename and do execution unit assignments in batches of 32 instructions where each batch takes 2 cycles (taking advantage of the O(log N) latency of parallel-prefix-sum-style algorithms), and then sending those instructions to the execution units split back into 16 instructions per cycle. That way we can build a CPU with crazy wide throughput without having to cram all the complex management logic into one cycle, allowing us to still have high clock frequencies.