> but if the code contains independent SVE streams, you will be stalled. Can you...

ribit · on May 26, 2024

If you do streaming-type operations on long arrays, yes. If your data sizes are small, however, four smaller units might be more flexible. As a naive example, let's take the popular SIMD acceleration of hash tables. Since the key is likely to be found close to its optimal location, long SIMD will waste compute. With small SIMD however you could do multiple lookups in parallel courtesy of OoO.

This is why I like the ARM/Apple design with "regular SIMD" and "streaming SIMD". The regular SIMD is latency-optimized and offers versatile functionality for more flexible data swizzling, while the streaming SIMD uses long vectors and is optimized for throughput.