It's great you bring up cmp, helps to understand why 4x128 is not necessarily as good as 1x512. Quicksort, hardly a 'weird kernel', does comparisons followed by compaction. Because comparisons return a predicate, which have only a single write port, we can only do 128 bits of comparisons per cycle. Ouch.
However, masking can still help our VQSort [1], for example when writing the rightmost partition right to left without stomping on subsequent elements, or in a sorting network, only updating every second element.
However, masking can still help our VQSort [1], for example when writing the rightmost partition right to left without stomping on subsequent elements, or in a sorting network, only updating every second element.
[1] https://github.com/google/highway/tree/master/hwy/contrib/so...