In AVX-512, there's an instruction vpcompressd, which literally does the equivalent of 16 iterations of that loop (the "store a subset of items to memory contiguously" part at least). The compiler would have to figure out how the conditional write & increment work together, but it's definitely possible as ICC manages to: https://godbolt.org/z/n68ca4aYe
For non-AVX-512, it gets more complicated to do the compression part, but you can still do it with a shuffle & lookup table of indexes from the mask.
For non-AVX-512, it gets more complicated to do the compression part, but you can still do it with a shuffle & lookup table of indexes from the mask.