More

camel-cdr · 2026-02-27T19:07:47 1772219267

Should've probably been the link to the actual Ubuntu blog: https://ubuntu.com//blog/canonical-and-ubuntu-risc-v-a-2025-...

camel-cdr · 2026-02-23T13:39:40 1771853980

Agreed, however, I'm quite sure 25,000 lines translated in "multiple months" is very "slow", for a naive translation between languages as similar as C++ and Rust.

camel-cdr · 2026-02-19T22:55:21 1771541721

here you go: https://gist.github.com/camel-cdr/bd5b197ab140ad6df259916df1...

camel-cdr · 2026-02-18T15:28:33 1771428513

The 1024-bit RVV cores in the K3 are mostly that size to feed a matmul engine. While the vector registers are 1024-bit, the two exexution units are only 256-bit wide.

The main cores in the K3 have 256-bit vectors with two 128-bit wide exexution units, and two seperate 128-bit wide vector load/store units.

See also: https://forum.spacemit.com/uploads/short-url/60aJ8cYNmrFWqHn...

But yes, RVV already has more diverse vector width hardware than SVE.

0x000xca0xfe · 2026-02-18T17:25:27 1771435527

It's a low clocked (2.1GHz) dual-issue in-order core so obviously nowhere near the real-world performance of e.g. Zen5 which can retire multiple 256-bit or even 512-bit vector instructions per cycle at 5+ GHz.

But I find the RVV ISA just really fascinating. Grouping 8 1024-bit registers together gives us 8192-bit or 1-kilobyte registers! That's a tremendous amount of work that can be done using a single instruction.

Feels like the Lanz bulldog of CPUs. Not sure how practical it will be after all, but it's certainly interesting.

camel-cdr · 2026-02-18T15:24:16 1771428256

The problem with SVE is that ARM vendors need to make NEON as fast as possible to stay competitive, so there is little incentive to implement SVE with wider vectors.

Graviton3 has 256-bit SVE vector registers but only four 128-bit SIMD execution units, because NEON needs to be fast.

Intel previously was in such a dominant market position that they could require all performance-critical software to be rewritten thrice.

camel-cdr · 2026-02-18T15:17:25 1771427845

> SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side.

You can treat both SVE and RVV as a regular fixed-width SIMD ISA.

"runtime variable width vectors" doesn't capture well how SVE and RVV work. An RVV and SVE implementation has 32 SIMD registers of a single fixed power-of-two size >=128. They also have good predication support (like AVX-512), which allows them to masked of elements after certain point.

If you want to emulate avx2 with SVE or RVV, you might require that the hardware has a native vector length >=256, and then you always mask off the bits beyond 256, so the same code works on any native vector length >=256.

jsheard · 2026-02-18T16:10:37 1771431037

> You can treat both SVE and RVV as a regular fixed-width SIMD ISA.

Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time.

ARM seems to be proposing a C language extension which does require compilers to support variably sized types, but it's not clear to me how the implementation of that is going, and equivalent support in other languages like Rust seems basically non-existent for now.

camel-cdr · 2026-02-18T16:28:27 1771432107

> Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time

Yes, you can't, which is annoying, but you can if you compile for a specific vector length.

This is mostly a library structure problem. E.g. simdjson has a generic backend that assumes a fixed vector length. I've written fixed width RVV support for it. A vector length agnostic backend is also possible, but requires writing a full new backend. I'm planning to write it in the future (I alreasy have a few json::minify implementations), but it will be more work. If the generic backend used a SIMD abstraction, like highway, that support scalable vectors this wouldn't be a problem.

Toolchain support should also be improved, e.g. you could make all vregs take 512-bit on the stack, but have the codegen only utilize the lowee 128 bit, if you have 128-but vregs, 256-bit if you have 256-bit vregs and 512-bit if you have >=512-bit vregs.

jsheard · 2026-02-18T16:36:04 1771432564

> Toolchain support should also be improved, e.g. you could make all vregs take 512-bit on the stack, but have the codegen only utilize the lowee 128 bit, if you have 128-but vregs, 256-bit if you have 256-bit vregs and 512-bit if you have >=512-bit vregs.

SVE theoretically supports hardware up to 2048-bit, so conservatively reserving the worst-case size at compile time would be pretty wasteful. That's 16x overhead in the base case of 128-bit hardware.

arka2147483647 · 2026-02-18T22:35:15 1771454115

Surely you could have compiler types for 128, 256, 512, etc, and then choose the correct codepath with simple if statement at runtime?

pertymcpert · 2026-02-18T19:02:43 1771441363

You can definitely SVE vectors on the stack, there are special instructions to load and store with variable offsets. What you can't do is to put them into structs which need to have concretely sized types (i.e. subsequent element offset need to have a known byte offset).

camel-cdr · 2026-02-12T10:18:24 1770891504

I like this document, but it seems to be written with a very specific implementation in mind.

You can implement both regular SIMD ISAs and scalable SIMD/Vector ISAs in a "Vector processor" style and both in a regular SIMD style.

shash · 2026-02-12T11:28:09 1770895689

It _is_ RISC-V Vector extensions, so a very specific ISA in mind at the very least. There's another extension (not ratified I think) called Packed SIMD for RISC-V, but this isn't about that.

camel-cdr · 2026-02-12T12:09:23 1770898163

The name, yes, but going by name is a bad idea as the V in AVX also stands for Vector. BTW, you'll be disappointed if you think of the P extension as something like SSE/AVX. The target for it is way lower power/perf, like a stripped-down MMX.

My point was about the underlying hardware implementation, specifically:

> "As shown in Figure 1-3, array processors scale performance spatially by replicating processing elements, while vector processors scale performance temporally by streaming data through pipelined functional units"

Applies to the hadware implementation, not the ISA, which is not made clear by the text.

You can implement AVX-512 with smaler data path then register width and "scale performance temporally by streaming data through pipelined functional units". Zen4 is a simple example of this, but there is nothing stopping you from implementing AVX-512 on top of heavily temporaly pipelined 64-bit wide execution units.

Similarly, you can implement RVV with a smaller data path than VLEN, but you can also implement it as a bog-standard SIMD processor. The only thing that slightly complicates the comparison is LMUL, but it is fundamentally equivilant to unrolling.

The substantial difference between Vector and SIMD ISAs is imo only the existence of a vl-based predication mechanism. If a SIMD ISA has a fixed register width or not, allowing you to write vector-length agnostic code, is an independent dimension of the ISA design. E .g. the Cray-1 was without a doubt a Vector processor, but the vector registers on all compatible platforms had the exact same length. It did, however, have the mentioned vl-based predication mechanism. You could take AVX10/128, AVX10/256 and AVX10/512, overlap their instruction encodings, and end up with a scalable SIMD ISA, for which you can write vector length agnostic code, but that doesn't make it a Vector ISA any more than it was before.

shash · 2026-02-13T04:52:15 1770958335

> The name, yes, but going by name is a bad idea as the V in AVX also stands for Vector.

Now I get your point after reading more of the linked page. Yes. It is very implementation specific.

One of the things about RVV (and in general any vector ISA) is that the data path can be different enough between different implementations such that specific rules of thumb for hand tuning most probably won’t carry over. As you say it is true of even sufficiently advanced SIMD architectures like AVX.

actionfromafar · 2026-02-12T15:58:40 1770911920

Stripped down MMX? What's left then I wonder? :-D

camel-cdr · 2026-02-12T16:10:19 1770912619

That was a bit overblown, due to my lack of knowlage about MMX. It has a lot more things than MMX. But the core idea behind the P extension was to reuse the GPRs to do SIMD operations with little additional implementation cost.

The spec is currently all over the place, the best reference is currently probably the WIP intrinsics documentation: https://github.com/topperc/p-ext-intrinsics/blob/main/source...

P is not meant to compete/be an alternative for RVV. It's meant for hardware targets you can't scale RVV down to.

Narishma · 2026-02-15T06:47:41 1771138061

> But the core idea behind the P extension was to reuse the GPRs to do SIMD operations with little additional implementation cost.

I think ARMv6 had something similar, before they went with proper SIMD in v7.

shash · 2026-02-13T04:45:26 1770957926

As sibling said, stripped down in the sense it doesn’t have dedicated registers. In terms of supported functions it’s somewhere close to MMX.

I don’t personally like it because it still ends up with all the headache of building most of a vector subsystem (data path, functional units,…) while _only_ pretty much reducing one special vector file.

camel-cdr · 2026-01-25T09:18:04 1769332684

No, the 2.5GHz are for SFX4. Atlantis is on TSMC 12nm and (as I learned yesterday) will run at about 1.5GHz: https://cdn.discordapp.com/attachments/1061659786023813170/1...

So Ascalon should have M1 IPC, at half the frequency.

brucehoult · 2026-01-25T20:55:17 1769374517

It really doesn't matter much. The Titan and K3 are Core 2 performance, the K1 and JH7110 are more like Pentium III.

A 1.5 GHz Ascalon is still going to be ... I don't know ... Skylake level? More than enough for a usable modern desktop machine and a huge leap over even machines we'll start to have delivered 3 or 4 months from now.

Hopefully it will be affordable. As in Megrez or Titan prices, not Pioneer.

LeFantome · 2026-02-02T21:20:27 1770067227

The K3 is launched now.

Single core performance is about what you say. But multi-core performance is much better. The K3 scores higher than a 2017 Macbook Air for multi-core on Geekbench 6.

And the K3 can take 32 GB of DDR5 and run a decent-sized LLM, which is not something you are doing on an a 5-10 year old laptop. In addition to the vector instructions, the built-in video codec acceleration and hypervisor stuff make for quite a modern feature-set.

The K3 is still too slow to be a desktop system for most people but there are some of us who would already be ok with it.

As for pricing, it is hard to find info. But it seems like around $200 may be possible for the Jupiter2.

https://milkv.io/jupiter2

The Framework 13 K3 mainboard will be more:

https://deepcomputing.io/dc-roma-risc-v-mainboard-iii-unveil...

brucehoult · 2026-02-03T03:00:40 1770087640

Yes, I've been using a K3 for a few weeks now. It's quite pleasant, and if I use all 16 cores (8x X100 and 8x A100) then it builds a Linux kernel almost 3x faster than my one year old Milk-V Megrez and almost 5x faster than K1.

    14m25.56s  SpacemiT K3, 8 X100 cores + 8 A100 cores
    16m55.637s SpacemiT K3, 8 X100 cores @2.4 GHz
    19m12.787s i9-13900HX, 24C/32T @5.4 GHz, riscv64/ubuntu docker
    39m23.187s SpacemiT K3, 8 A100 cores @2.0 GHz
    42m12.414s Milk-V Megrez, 4 P550 cores @1.8 GHz
    67m35.189s VisionFive 2, 4 U74 cores @1.5 GHz
    70m57.001s LicheePi 3A, 8 X60 cores @1.6 GHz

It's also great that it's now faster than a recent high end x86 with a lot of cores running QEMU.

Note that the all-cores K3 result is running a distccd on each cluster, which adds quite a bit of overhead compared to a simple `make` on local cores. All the same it shaves 2.5 minutes off. In theory, doing Amdahl calculation on the X100 and A100 times, it might be possible to get close to 11m50s with a more efficient means of using heterogenous cores, but distcc was easy to do.

RISC-V SBC single-core performance has been better than x86+QEMU since the VisionFive 2 (or HiFive Unmatched) but we didn't have enough cores unless you spent $2500 for a Pioneer.

snvzz · 2026-01-26T00:51:24 1769388684

>BXM-4-64

Is that among the few known to work with open pvr drivers?

camel-cdr · 2026-01-20T16:06:41 1768925201

can you share the code?

camel-cdr · 2026-01-20T07:04:38 1768892678

https://tutorial.xiangshan.cc/hpca25/slides/20250302-HPCA25-...