Responding to NLnet's questions for grant proposal 2024-12-324

programmerjake · May 11, 2025, 11:41pm

The proposed reply is below:

On Sat, May 10, 2025 at 10:21 AM Michiel Leenaars michiel@nlnet.nl wrote:

Could you provide a slightly more breakdown of the main tasks, and the associated effort?

I'm removing project management and hardware costs from the budget as suggested, as well as adjusting budget for the work we've completed so far.

The new budget:

Adding Rocq output in Fayalite

€ 6000
This is for adding the code for translating Fayalite HDL to Rocq, as well as determining how exactly we'll describe HDL in Rocq. I expect the translation code to be of comparable size to the compiler portion of the simulator (the simulator is broken into three main parts, a compiler to an IR optimized for interpretation, the interpreter itself, and the code for reading/writing simulator I/O and handling time simulation), so somewhere around 5000 LoC.

Adding supporting code for generating FPGA bitstreams from Fayalite

€ 4000
This is for adding the tooling to run all the right programs to generate FPGA bitstreams, as well as adding code to handle connecting I/O ports to the FPGA pins. I expect this to be on the order of 1000-2000 LoC for the FPGA pins code, as well as a few hundred for running all the right programs in sequence.

Register Renaming, Execution, and Instruction Retire

€ 10000

This covers getting register renaming working, as well as scheduling, executing simple ALU and Branch instructions, and properly handling instruction retire. (Some of that work is already done.)

A lot of this is the work to come up with a detailed low-level plan for the CPU, so I don't have a good idea of how complex or not this is, though I expect it to be probably 40% of the CPU's complexity.

Instruction Fetch/Decode

€ 8000

This covers instruction fetch, decoding, and caching. For the decoder, unless OpenPower has gotten around to releasing the Latex source code, I'm expecting to use a parser I wrote that parses the instruction descriptions out of the PowerISA v3.1C PDF and writes out XML.

Load/Store instructions

€ 10000

This covers implementing the load/store hierarchy, including an L1 cache. For now, the CPU will only target on-FPGA memory blocks, as well as simple I/O devices. (Support for DRAM can be added at a later point outside of this grant.)

Work towards the Formal Proof of No Spectre bugs

€ 12000

This covers working on the Formal Proof of No Spectre bugs as well as
improvements to other software/hardware needed for that. A major portion of this is to figure out what exact properties we need to formally prove for each part of the CPU, so we can put those parts together to make a working proof for the entire CPU. I'm hoping this will be enough work to get nearly all of the proof written out, even if there are some flaws we discover that we'll have to put off fixing for a later grant.

We plan on working with the OpenPower Foundation for any custom ISA extensions we may need.

Maybe we missed this while reading the proposal, but reading between the lines are we correct in concluding that the core instruction set of is to be based on the OpenPower ISA?

Yes, sort-of. The CPU will have its own internal instruction set (designed to be more general than just for PowerISA, since, as a later addition outside of this grant proposal, I'd like it to support running x86_64, RISC-V, etc. by translating to the internal ISA using a FPGA-style programmable decoder and a µOp-cache), and there will be a dedicated HW decoder that translates PowerISA to that internal instruction set.

More details: Proposal for Libre-Chip's First CPU Architecture

Jacob Lifshay formulated a way to formally prove that a CPU with speculative execution doesn't suffer from any speculative-execution data leaks

Could you provide a bit more detail, so we can understand where this claim comes from?

It's an idea I came up with a few years ago, but didn't pursue at the time.

I have thought through it in enough detail to convince myself it can work, but haven't tried to actually implement it yet.

Is this proof method already published or implemented somewhere?

No, though I did come up with a high-level plan, which I'll write out in some more detail here:

Basically, the idea is to track throughout the cpu which instructions are speculative, and to have a near-identical copy of the cpu that is a mathematical model where it knows (because it is a mathematical model, it can look into the future) when instructions first enter the cpu if those instructions are going to eventually be canceled and it forces those instructions that will be canceled to have their speculative results forced to be zero (or some other appropriate value that obviously can't leak information), then, we write a proof module-by-module that those are equivalent to the original version of the cpu without the time travel. at the whole-cpu level, equivalence is defined to be that, for every clock cycle (past, present, and future), all externally-visible signals from the cpu must have identical values between cpu models.

How does it compare to methodologies such as [1], next to the two papers mentioned by yourself?

[1] https://dial.uclouvain.be/pr/boreal/object/boreal:264203

That website is inaccessible for both Cesar and me, I was able to find the abstract on web.archive.org, but not the actual paper. It seems to be about masking the analog aspects of digital circuits such as timing within a clock cycle and glitches, which is something that our proof doesn't address, since our proof assumes that all output signals have gone through flip-flops before becoming visible to external communications, so only timing at the resolution of a clock cycle matters.

hence why the goal is making substantial progress instead of having the proof 100 % complete and working.

That caveat does makes the goals of this effort fairly vague. Could you at least describe the high level approach you will be taking (perhaps this automatically comes with the answer to the previous question)?

(Note BTW that Coq was renamed to Rocq Prover two years ago, the way the name was pronounced made some people take the project less serious than it deserved..)

Ah, I wasn't aware of that. The rename seems to have only happened in their latest major release, which happened since I submitted the grant proposal:

Assume that to leak data, you must be able to observe some change in the digital signals going from/to the CPU core that depends on the results (defined to include any side-effects that change any state, such as cache state changes or branch predictor changes) of some transient instructions (instructions that were speculatively executed and then canceled).

If this assumption is broken, what guarantees does this proof method still give?

Then it doesn't guarantee a whole lot, though I expect that assumption to hold for basically all intentional (over some defined digital bus) external communication between a computer and the external world.

This excludes things like the chip's internal signals generating radio waves, or things like power measurement, since those require a completely different design process that intentionally takes those into account at every design level, and since those are not generally visible through remote communication protocols (e.g. over a network) and require direct physical access, at which point you generally could more easily just take out the SSD and put whatever code you wanted on it.

What about long-term effects of transient instructions, i.e. not immediately returning a result but instead having another effect such as power consumption or delayed synchronization?

Power consumption is out of scope for the proof.
Synchronization is timing here, so that should be covered by the proof.

There are also other ways to leak data through side-channels, for example through timing or synchronization, voltage drops, or by (in)directly reading out the values of unspecified wires.

All timing at the clock-cycle resolution is covered by the proof. If you're reading the CPU's internal wires (that aren't intended to be used for communication with the external world), or measuring power (voltage drop), then the proof doesn't cover that.

Is side-channel resistance something you are taking as a design requirement?

Timing side-channels -- yes, since those are visible over a network. Power, radio waves, electron microscope images of the chip while it's running, etc. -- no, since those need physical access.

and construct a theoretical equivalent that has the additional feature of looking into the future to determine which instructions will be canceled and preemptively forces their results to be zeros

Is this modeled using prophecy variables (see e.g. [2])?
[2] https://link.springer.com/content/pdf/10.1007/s10703-021-00377-1.pdf

Not intentionally, I'd have to spend a lot more time researching them to know for sure, but I think probably not.

How does this compare to non-interference properties for speculative execution, such as described by Jhala1 and McMillan [3]
[3] https://www.mcmil.net/pubs/CAV01.pdf

As far as I can tell our proof is more general in that it proves that speculative execution that gets cancelled can't change externally-visible timing or signal values, rather than only proving that it doesn't give wrong results.

Will there be any collaborators to the project from Europe, which would give a clearer EU dimension to this project?

Not at this moment, there are some others that were interested, but nothing came of it (probably because I waited too long to apply for this grant).

Is Fayalite as expressive as the dynamically typed system for composing circuits you previously worked with, such as nmigen2 (as well as its reincarnation Amaranth,

As far as I can tell Fayalite is more expressive in some ways (it includes hardware-level enums (tagged-unions)) and less in others (it doesn't automatically thread signals across module-level boundaries if you refer to a child module's internals from outside it, but I think that's actually a benefit in most cases for Fayalite since it encourages better modularization and not violating privacy boundaries).

and alternatives like Py2HWSW [4]? Or are some circuits not (easily) expressible due to the strict(er) typing discipline?

Fayalite can express everything you can write with standard Verilog and more (it allows you to use external modules written in Verilog if you run into something you need that for or just want to use some external project written in Verilog). I think it has better ease of use, but that's a matter of opinion.

[4] NLnet; Py2HWSW

I don't think that's suitable for our purposes since it appears to be a fancy wrapper around a bunch of Verilog snippets, so it comes with the restrictions Verilog imposes, and seems rather verbose.

Is there a strategy to get other people to use Fayalite (and at some time contribute), and build enough critical mass?

I haven't thought about that all that much, but I think Fayalite fills a compelling niche of being a language embedded in Rust that doesn't have all the annoying restrictions of Verilog (which is what the only other Rust library I'm aware of does last I checked). Plus, if we develop a working CPU using Fayalite, I think that will help it be seen as a useful language, similar to Chisel and the first RISC-V CPUs.

There are meanwhile quite some nice HDL's in play - next to Chisel which you already mention, there is SpinalHDL [5], Spade [6], Clash [7], etc. Obviously, there is room for different options but there is a high cost of developing and maintaining 'picks and shovels'

I'll mention that Fayalite is essentially ready to write the CPU design in (in fact the CPU is already maybe 10-15% completed). Fayalite has support for formal proofs using SymbiYosys, a working simulator (implemented since we submitted the grant proposal), and working generation of Verilog. The main thing that Fayalite would need to have added is to generate a Rocq translation of the HDL, which is needed for the grant's main proof since I expect SymbiYosys-style automated theorem provers aren't capable enough from my experience using them before.

over and over again - where you actually want to build chips. Could you reflect on these HDL's and explain the reason(s) for developing your own rather than contribute to any of those larger efforts?

[5] NLnet; SpinalHDL, VexRiscv, SaxonSoc

We decided not to use Chisel or SpinalHDL because:

We found those to be quite difficult to set up due to being written in Scala
The wide variety of custom operators make Chisel much harder to understand.
I am much more fluent in Rust than in Scala or Haskell, saving a lot of time when implementing the translator to Rocq.

[6] NLnet; Spade

We decided to not use Spade because it doesn't seem to let you do arbitrary computation while generating HDL (e.g. you could parse instruction descriptions from the PowerISA pdf or XML and translate those to HDL). Also Fayalite lets you use Rust's powerful generics and traits system (Spade doesn't have those yet, though they're working on it), as well as all existing Rust code (or other languages such as C through FFI).

[7] https://clash-lang.org

I think Clash is not suitable for our project because Fayalite is syntactically much more similar to traditional hardware description languages and imperative programming languages, which makes it much easier to understand for most hardware designers and programmers than Haskell (this is particularly true for Jacob, who has not programmed in Haskell).

Thank you very much for your timely reply. Could you get the answers to us by Monday next week?

cesar · May 12, 2025, 2:54am

The reason we decided not to use Chisel or SpinalHDL is because we found those to be quite difficult to set up due to being written in Scala. Additionally the wide variety of custom operators make Chisel much harder to understand. By contrast, Rust is generally quite easy to set up (one command on Linux using rustup).

I feel that "Scala being difficult to set up compared to Rust" is maybe a weak argument to justify implementing a whole new HDL from scratch. Maybe, in addition, point out that you are more fluent in Rust than in Scala and Haskell, which is going to save time and effort when implementing the necessary extensions to the language to emit Rocq?

Also, since NLnet does fund those projects, maybe clarify that we do not question their merits, but we find they are not such a good fit to this project?

that website is inaccessible for me, I was able to find the abstract on web.archive.org, but not the actual paper

I can't access it either. Agreed that power and glitches are out of scope.

I'd avoid the remark of your dad losing his job over job cuts (sorry), unless you feel it makes a strong point.

I assume you will proof read the final text to have capitalized first words in all sentences, and maybe avoid Internet jargons like "afaict".