As the article mentions, there was a major change between 30 and 31 where the SETEIENUM/SETEIPNUM as well as the CLREIENUM/CLREIPNUM CSRs were removed. The issue on GitHub reads that they were removed since the architecture review board asked for them to be removed.
A manufacturer can use a draft spec, but it may be rendered outdated sooner rather than later.
I don't think it's as cut and dry as "execute sfence.vma every time you write to SATP" as you say for the following reasons.
1. The sfence.vma is much like a TLB flush. Just like a normal fence, it ensures that orderings are done properly. The privileged spec says this in section 4.2.1. In fact, a parenthetical tells us that "The SFENCE.VMA is used to flush any local hardware caches related to address translation."
2. The sfence.vma has three formats. sfence.vma x0, x0, sfence.vma x? (!= 0), x0, and sfence.vma x0, x? (!= 0). This allows us to flush certain addresses from the translation lookaside buffer (TLB) so that the newly updated page tables are updated in this buffer.
3. You don't have to call sfence.vma every time you write to the SATP. This would thrash the TLB every time you context switch. Please see page 66 of the privileged spec (20190608 and more recent versions).
4. If you see the original tutorial this blog post is based on, you can see that I use sfence.vma with the ASID which is the process ID. That way there, the address space and process ID are equivalent so that when I transfer the process to another hart, I can execute sfence.vma x? where x? contains the ASID. If I add a mapping, such as sbrk or mmap, then I can call sfence.vma x0, x? where x? contains the specific page I want updated.
5. If we use PMP (physical memory protection), the specification tells us to use sfence.vma x0, x0 to update in case the hart speculates before we use those particular addresses (see section 3.6.2 in the privileged SPEC).
6. The code I show updating the SATP is the same code we used in the original tutorial: https://osblog.stephenmarz.com. In this case, I use sfence.vma when I update a mapping, transfer a process to another hart, or when I create a new process.
Maybe I'm wrong, but my reading of the SFENCE.VMA instruction combined with the fact that we have multiple versions and the ability to TLB shootdown an entire address space or a single page tells me that it isn't "execute sfence.vma every time SATP is written."
In conclusion, I don't use sfence.vma every time I write to SATP. Instead, I use it when there is a memory ordering issue, such as starting a new process, updating a mapping, or transferring a process to another hart. In this case, we wouldn't do this in the code that switches us to a userspace process, since this is invoked every time we do a context switch.
Here's the specification's recommended use of SFENCE.VMA, which is just a fancier version of what I wrote above:
Page 66 and 67:
The following common situations typically require executing an SFENCE.VMA instruction:
1. When software recycles an ASID (i.e., reassociates it with a different page table), it should
first change satp to point to the new page table using the recycled ASID, then execute
SFENCE.VMA with rs1=x0 and rs2 set to the recycled ASID. Alternatively, software can
execute the same SFENCE.VMA instruction while a different ASID is loaded into satp,
provided the next time satp is loaded with the recycled ASID, it is simultaneously loaded
with the new page table.
2. If the implementation does not provide ASIDs, or software chooses to always use ASID 0,
then after every satp write, software should execute SFENCE.VMA with rs1=x0. In the
common case that no global translations have been modified, rs2 should be set to a register
other than x0 but which contains the value zero, so that global translations are not flushed.
3. If software modifies a non-leaf PTE, it should execute SFENCE.VMA with rs1=x0. If any
PTE along the traversal path had its G bit set, rs2 must be x0; otherwise, rs2 should be set
to the ASID for which the translation is being modified.
4. If software modifies a leaf PTE, it should execute SFENCE.VMA with rs1 set to a virtual
address within the page. If any PTE along the traversal path had its G bit set, rs2 must
be x0; otherwise, rs2 should be set to the ASID for which the translation is being modified.
5. For the special cases of increasing the permissions on a leaf PTE and changing an invalid
PTE to a valid leaf, software may choose to execute the SFENCE.VMA lazily. After
modifying the PTE but before executing SFENCE.VMA, either the new or old permissions
will be used. In the latter case, a page fault exception might occur, at which point software
should execute SFENCE.VMA in accordance with the previous bullet point.
"A consequence of this specification is that an implementation may use any translation for an address that was valid at any time since the most recent SFENCE.VMA that subsumes that address. In particular, if a leaf PTE is modified but a subsuming SFENCE.VMA is not executed, either the old translation or the new translation will be used, but the choice is unpredictable. The behavior is otherwise well-defined."
This is not the "particular" case mentioned - essentially "subsumed" for satp is the entire space (you have to think of satp as simply being a non-leaf PTE)
To be fair I think that there's issues in this area that may force a double pipe flush in some implementations
I don't understand how you get that SATP is a non-leaf PTE. SATP is a register, and a fence is for memory ordering. A register is a particular kind of memory, but in this context, I do not believe the authors are talking about register memory rather than RAM itself.
The SFENCE.VMA instruction is used to force in-memory ordering, meaning that all loads and stores are completed or updated and marked dirty (aka invalid) so the MMU knows not to rely on its cache, as seen in the spec here: "The supervisor memory-management fence instruction SFENCE.VMA is used to synchronize updates to in-memory memory-management data structures with current execution." The keyword that sticks out in my mind is "in-memory". The SATP register is a register, and is not in-memory.
If a write to the SATP register requires a fence, then it should do so, much like how writing to the CR3 register in Intel/AMD X86/64 forces a flush. However, this is specifically not the behavior the authors of the specification went for to avoid one of the biggest problems with flushing every time--TLB thrashing. A fast context switch, like Linux's 1000 HZ, would mean that a larger TLB would be no help since a context switch--even to the kernel--would force a TLB flush. Furthermore, that would nullify the 5 cases that the specification lays out which would "typically" require an SFENCE.VMA.
Additionally, the specification makes clear the reason they chose this was to improve context switch performance: "We store the ASID and the page table base address in the same CSR to allow the pair to be changed atomically on a context switch. Swapping them non-atomically could pollute the old virtual address space with new translations, or vice-versa. This approach also slightly reduces the cost of a context switch." This to me means that the SATP "register" itself is immediate, whereas the memory addresses it points to (the PPN) is not. Otherwise, this couldn't possibly be the case.
The spec goes on to state that "If the new address space’s page tables have been modified, or if an ASID is reused, it may be necessary to execute an SFENCE.VMA instruction (see Section 4.2.1) after writing satp." There is nothing I can find in the specification that states that writing to the SATP register alone necessitates an SFENCE.VMA. I also can gather from context that this is on purpose in order to preserve the TLB across context switches, which is the only reason I can tell to use ASIDs in the first place.
I might be wrong, and there are a ton of issues in the github repository for this specification asking for clarification on a number of other things. I'm not sure we can divine the author's intent more than we've done here.
To my knowledge, and looking at http://riscv.org, this is supposed to be an open-ISA (instruction set architecture). Their specification allows chip manufacturers to write their own extensions using the "custom" opcodes.
The Kendryte K210 is a RISC-V-compliant CPU. It has off-core components, such as what they're calling a KPU The ML and GPU cores are controlled via I/O, not by the CPU directly. These are called platform-level components. In general, this uses MMIO with a hard-wired memory address to control. You can see the KPU (their ML accelerator) here: https://s3.cn-north-1.amazonaws.com.cn/dl.kendryte.com/docum...
See section 3.2.
I think the extensions are meant to be modular. Right now, not many embedded devices allow for the H mode, and hypervisor-ing is still in development. Currently, I know of Machine mode, Supervisor mode, and User mode, but since the change from 2018 to 2019, they have really started to ramp up the virtualization ISA support.
So RISC-V is open, but the extensions are often proprietary?
And what does virtualization even mean when extensions proliferate? Will you need a distinct physical machine for each combination of extensions that someone might want to virtualize?
Properitary = custom...it just means that it's particular to a single chip manufacturer. They have set aside opcode space for doing just that. RISC-V can't forecast everything someone will want to do with a RISC-V chip. Instead, they promise not to use those opcodes so that it won't conflict with a chip manufacturer's "custom" instructions.
In terms of virtualization, most of the extensions can be emulated. This happens a lot with the hodgepodge of extensions laid on Intel/AMD, such as SSE, SSSE, AVX, and so forth. Just because the underlying physical machine doesn't necessarily support it doesn't mean that the guest can't. At the operating system level, the OS can read the misa (Machine ISA) register to see which extensions are supported, and emulate those which are not. I don't think RISC-V solves the issues that virtualizing Intel/AMD also suffer--and I don't think that's really their goal.
In terms of the host, if there is an extension that cannot be emulated, then yes, I would think you'd need the physical machine to be able to support it.
The SDKs for the non big name (Nvidia, Google, AMD) accelerators have very poor developer experiences. (Granted CUDA/ROCm isn't much better) A number of the embedded SDKs require you to use their custom framework and won't support Tensorflow/Torch models out of the box. Onnx and similar conversion frameworks target only the big name chips, not off the shelf generic AI accelerator chips. Embedded systems need more software engineering.
Do you mean like a hardware board for educational purposes? I use the Sipeed Maixduino (or Maix Bit) for my classes. The problem with these boards is there is a lot of esoteric, non-documented items.
For a well documented hardware board, I would look at the Sifive Hifive1 (Rev B).
Since RISC-V is still relatively new, I'm sure more and more boards will start making it to market rather soon.
I think the easiest way to get started with RISC-V is to look at QEMU, which is an emulator. It can emulate the virtio bus, including graphics. I used this in my OSblog: http://osblog.stephenmarz.com which uses an emulated, 64-bit RISC-V CPU.
I'm writing this from the OS's perspective. Linux abstracts this away, but we have to do it to get the same abstraction. You can see under the user space portion how simple it is to grab a framebuffer, which is then mapped into user space. Then we have full access to RGBA values.
I think until they get into the mass production market, risc-v will be fairly expensive compared to the Rpi. Take a look at the full stack Hifive Unleashed ($1000) or the cheaper Hifive1 (revb): https://www.sifive.com/boards
Well, Hifive Unleashed is a dev board with tons of features for developer, a market where $1000 is pretty okay price. It has DDR4 ECC RAM,
RPi is another story: it's a single board computer with exposed GPIO and used to be shipped with ancient GPU.
Hifive1 is more of a competitor to Arduino if you can call it that. They even have the same pinout for expansion.
They are quite pricy, like you said, due to early adopter fee. However, you gotta understand that SiFive develop their cores and unlike Arduino RPi can't just slap something that is already produced and ready to be mounted on a board.
I've been trying to use the Sipeed Maixduino. It's much like the hifive1 except it uses the Kendryte K210, which is a fully supported RV64GC dual-core. It even supports supervisor and user modes. It's also relatively cheap. The problem is that documentation is lacking, so I'm having to reverse engineer their BSP: https://maixduino.sipeed.com/en/
Wonder if this spec will make it easy for embedded systems to catch up. It always seems like they lag behind what's cutting edge. Maybe that's a cost/benefit analysis.
I have a brand new design with DDR2. I can power memory from existing 1.8V rail, no need for more voltage regulators. And 400 MHz is totally ok for me since I can have whole memory bandwidth for myself, no operating system, etc. And my application is very cutting edge for sure in its domain.
I loved this series. I even made a Rust version of it with very little compilers knowledge (this one is written in Python): http://github.com/sgmarz/ttrust
As the article mentions, there was a major change between 30 and 31 where the SETEIENUM/SETEIPNUM as well as the CLREIENUM/CLREIPNUM CSRs were removed. The issue on GitHub reads that they were removed since the architecture review board asked for them to be removed.
A manufacturer can use a draft spec, but it may be rendered outdated sooner rather than later.