edit: also this might be implementable purely using bitwise vector operations. Would need to check the throughput of those.
edit: also this might be implementable purely using bitwise vector operations. Would need to check the throughput of those.