I agree with you we do not only want "very wide SIMD", and it seems to me that 2...

ribit · on May 27, 2024

> I agree with you we do not only want "very wide SIMD", and it seems to me that 2x512-bit (Intel) or 4x256 (AMD) are actually a good middle ground.

I'd already classify this as "very wide". And the story is far from being that simple. Intel's 512-bit implementation is very area- and power-hungry, so much so that Intel is dropping the 512-bit SIMD altogether. AMD has 4x add units, but only two are capable of multiplication. So if your code mostly does FP addition, you get good performance. If your workflows are more complex, not so much.

The thing is that on many real-world SIMD workloads, Apple's 4x128bit either matches or outperforms either Intel's or AMD's implementation. And that on a core that runs lower clock and has less L1D bandwidth. Flexibility and symmetric ALU capabilities seems to be the king here.

> Sure, it's https://news.ycombinator.com/item?id=40465090

Ah, that is what you meant. Thank you for linking the post! My comment would be that this is not about 128b or 256b SIMD per se but about implementation details. There is nothing stopping ARM from designing a core with more mask write ports. Apparently, they felt this was not worth the cost. Other vendors might feel differently. I'd say this is similar to AMD shipping only two FMA units instead of four. Other vendors might feel differently.

janwas · on May 27, 2024

For very wide, I'm thinking of Semidynamic's 2048-bit HW, which with LMUL=8 gives 2048 byte vectors, or the NEC vector machines.

AFAIK it has not been publicly disclosed why Intel did not get AVX-512 into their e-cores, and I heard surprise and anger over this decision. AMD's version of them (Zen4c) are a proof that it is achievable.

I am personally happy with the performance of AMD Genoa e.g. for Gemma.cpp; f32 multipliers are not a bottleneck.

> The thing is that on many real-world SIMD workloads, Apple's 4x128bit either matches or outperforms either Intel's or AMD's implementation

Perhaps, though on VQSort it was more like 50% the performance. And if so, it's more likely due to the astonishingly anemic memory BW on current x86 servers. Bolting on more cores for ever more imbalanced systems does not sound like progress to me, except for poorly optimized, branch-heavy code.

ribit · on May 28, 2024

> Perhaps, though on VQSort it was more like 50% the performance.

I looked at the paper and my interpretation is that the performance delta between M1 (Neon) and the Xeon (AVX2) can be fully explained by the difference in clock (3.7 vs 3.3 Ghz) and the difference in L1D bandwidth (48byes/cycle vs. 128bytes/cycle). I don't see any evidence here that narrow SIMD is less efficient.

The AVX-512 is much faster, but that is because it has hardware features (most importantly, compact) that are central to the algorithm. On AVX2 and Neon these are emulated with slower sequences.

janwas · on May 28, 2024

Note that compact/compress are not actually the key enablers: also with AVX-512 we use table lookups for u64 keys, because this allows us to actually partition a vector and write it both to the left and write sides, as opposed to compressing twice and writing those individually.

Isn't the L1d bandwidth tied to the SIMD width, i.e., it would be unachievable on Skylake if also only using 128-bit vectors there?

ribit · on May 28, 2024

> Note that compact/compress are not actually the key enablers: also with AVX-512 we use table lookups for u64 keys, because this allows us to actually partition a vector and write it both to the left and write sides, as opposed to compressing twice and writing those individually.

That is interesting! So do I understand you correctly that the 512b vectors allow you to implement the algorithm more efficiently? That would indeed be a nice argument for longer SIMD

> Isn't the L1d bandwidth tied to the SIMD width, i.e., it would be unachievable on Skylake if also only using 128-bit vectors there?

It's a hardware detail. Intel does tie it to SIMD width, but it doesn't have to be the case. For example, Apple has 4x128b units but can only load up to 48 bytes (I am not sure about the granularity of the loads) per cycle.

janwas · on May 28, 2024

Right, longer vectors let us write more elements at a time.

I agree that the number of L1 load ports (or issue width) is also a parameter: that times the SIMD width gives us the bandwidth. It will be interesting to see what AMD Zen5 brings to the table here.