If this paper holds, I'd expect that's where custom accelerators will be heading.
edit: also this might be implementable purely using bitwise vector operations. Would need to check the throughput of those.
If this paper holds, I'd expect that's where custom accelerators will be heading.