I think it's somewhat unfair to ask for real world examples when there really ar...

snvzz · on May 25, 2024

>I think it's somewhat unfair to ask for real world examples when there really aren't many people writing optimized SVE code right now. Probably because there are hardly any devices with the extension.

Ironically, on the RISC-V side, RVV 1.0 hardware is readily available and cheap. BananaPI BPI-F3 (spacemiT K1) is RVA22+RVV, as well as some C908-based MCUs.

brigade · on May 24, 2024

CPUs with SVE have been generally available for two years now. SME and AVX-512 got benchmarks written showing them off before the CPUs were even available. Seems fair to me.

simdjson specifically benefitted from Intel's hardware decision to implement a 512b permute from 2x 512b registers with a throughput of 1/cycle. That's area-expensive, which is (probably) why ARM has historically skimped on tbl performance, only changing as of the Cortex-X4.

Anyway simdjson is an argument for 256b/512b vector permute, not 128b SVE.

Having written a lot of NEON and investigated SVE... I disagree that SVE is a nicer ISA. The set of what's 2-operand destructive, what instructions have maskable forms vs. needing movprfx that's only fused on A64FX, and dealing the intrinsics issues that come from sizeless types are all unneeded headaches. Plus I prefer NEON's variable shift to SVE's variable shifts.

janwas · on May 25, 2024

Fair point about movprfx, I understand they were short on encoding space. This can be mitigated by using *_x versions of intrinsics where masks are not used.

The sizeless headache is anyway there if you want to support RISC-V V, which we do.

One other data point in favor of SVE: its backend in Highway is only 6KLOC vs NEON's 10K, with a similar ratio of #if (indicating less fragmentation, more orthogonal).

skavi · on May 24, 2024

It’s been a while since I looked, but I remember SVE2 being much more usable than SVE. A64FX was SVE IIRC. I think SVE did not do a great job of fully replacing NEON.

neonsunset · on May 24, 2024

This.

AVX512 is all around a nice addition as JIT-based runtimes like .NET (8+) can use it for most common operations: text search, zeroing, copying, floating point conversion, more efficient forms of V256 idioms with AVX512VL (select-like patterns replaced with vpternlog).

SVE2 will follow the same route.