Golang Duff's devices

pslam · on Feb 3, 2015

This is probably a bad idea and I would bet (and probably win) that less aggressive unrolling would be better in pretty much all cases except a micro-benchmark.

This kind of trick worked well on CPUs in the 1980s (as jvoorhis points out in another comment). It works wonder for micro-benchmarks. It is a net loss for any CPU and workload where instruction cache miss rate can severely impact performance. That is to say, pretty much everything made after 1990s and doing more than just shuffling memory around.

The problems are: 1) this is a large fraction of I-Cache size, 2) unrolling doesn't actually save much (or sometimes anything) on out-of-order multi-issue cores.

First off: extremely aggressive unrolling results in increasing I-cache thrashing. This is usually not apparent in micro-benchmarks, because the entire working code set fits in the I-cache. It becomes apparent if the unrolling is stupid enough to be larger than I-cache (or usually just close enough to it), because then you're spending just as much (usually more) time re-fetching instructions from memory as you are moving data around. That's why picking the best result from a micro-benchmark is misleading. In real use cases, you find that functions like this one tend to cause a "global slow down", with no specific function accounting for it, because they're greatly reducing the effectiveness of the I-cache.

Also, unrolling doesn't really do much these days. You can do better than REP MOVSD and friends - go read Intel's documentation for their recommended sequences (which has a nasty habit of changing every damn CPU generation). Modern out-of-order, multi-issue cores can very easily saturate their load/store units, while also executing all those pesky loop counters and branches. In fact, that's the whole point of being out-of-order and multi-issue: keeping units busy.

So, please can everybody stop using Duff's device, unless: 1) you're targeting a CPU which has no I-cache (e.g tightly coupled ROM/RAMs), 2) you really do have a working set where this fits well enough to be better, 3) you're trying to cheat on benchmarks, or 4) you're trying to obfuscate your binaries against reverse engineering.

slashdev · on Feb 4, 2015

Rep movsd is only recommended for large clears/copies by Intel. Indeed the commit message states:

REP MOVSQ and REP STOSQ have a really high startup overhead. Use a Duff's device to do the repetition instead.

mc_hammer · on Feb 4, 2015

thanks for the post - ive never read any of this stuff before. ive only been a dev for 18 years ¬_¬

chrisbennet · on Feb 4, 2015

Kids these days.. ;-)

dilmuff · on Feb 3, 2015

This is just aggressive loop unrolling. Duff's device is specifically the syntactic curiosity in C/C++ that allows implementing unrolls with a switch and inner while statement.

sago · on Feb 3, 2015

The authors of this code seem to be using "Duff's device" to mean the unroll, plus the mechanism of jumping into the unrolled code. I agree, both are not Duff-specific in assembly. And anyway, not as impressive without the case...do. It still amazes me that Duff's version works.

lilyball · on Feb 4, 2015

Duff's version will always work for as long as compilers remain compatible with ANSI C. It's a clever trick that takes advantage of the precise definition of a switch in the C standard (as opposed to taking advantage of compiler-specific implementation details).

sago · on Feb 5, 2015

It still amazes me it works, not it amazes me it still works. I just mean that even knowing why and how it works it still looks like it is malformed.

jvoorhis · on Feb 3, 2015

Is loop unrolling truly helpful in either case? CPUs have changed a lot since the 1980s when Duff's was invented..

cremno · on Feb 3, 2015

The commit message contains some numbers:

https://github.com/golang/go/commit/6c7cbf086c34ebb88311ba12...

robert_tweed · on Feb 4, 2015

It's been a long time since I've done any speed optimisation stuff / assembly coding but one of the quirks of REP STOS/MOVS I recall is that it is very sensitive to memory alignment. I.e., if you try to do a MOVSD without using a 4-byte aligned base address, it will be very slow; similarly a MOVSQ must be 8-byte aligned. I wonder if this is just a case of someone having benchmarked a naive implementation without taking alignment into account? As other commenters have noted, loop unrolling isn't a great idea on a modern CPU and it certainly wouldn't have got anywhere close to the speed of REP on an older CPU when I was doing this kind of thing.

The right way to fix alignment is to have some MOVSB/MOVSW instructions at the start and jump over the correct number based on the difference between your actual base address and the offset you want, then do the rest with REP. I haven't tried to pick apart the source in detail but it looks like they are just doing a naive REP.

Of course it's possible that this is only used in arrays that are known to be 4/8 byte aligned anyway, in which case this comment is irrelevant. However it's a very surprising result that the in-built and widely used instruction provided by the CPU for doing exactly this operation is slower than the dumb loop unrolling approach, even with all the instruction cache and memory bandwidth problems it introduces. Odd results like that usually point to programmer error rather than surprising CPU behaviour (although if that were always the case, Michael Abrash's books would be a lot less interesting).

vinkelhake · on Feb 3, 2015

I like the hundreds of repeated lines. I wonder when they are going to rediscover macro assemblers.

prattmic · on Feb 4, 2015

The Go build system uses the Plan 9 assembler, which is explicitly simplistic to discourage the use of assembly except when there is a very compelling case for it.

coldtea · on Feb 4, 2015

Or because they couldn't be bothered, and the "discourage" part is just the excuse...

lucio · on Feb 4, 2015

some of the authors seem to have been somewhat related to the creation of unix itself, so I guess, just guess, they know macro-assemblers... ;)

pmarin · on Feb 4, 2015

:yy99p using vi

jbandela1 · on Feb 4, 2015

This illustrates one of the issues with "high-level" languages: when you need to do something that the language does not allow, you have to resort to horrible hacks.

I thought the C version of Duff's device was bad, but this is atrocious. This is non-portable copy-pasta on steroids.

This is one of the reasons why people still use C and C++. When you need to tune something, you can write it using the same language and nicely encapsulate it.

koffiezet · on Feb 5, 2015

Except that this is C code (well at least some dialect of...) with ASM embedded...