It might help to try and squeeze more work between the load and the store. E.g. ...

nkurz · on Oct 14, 2016

Actually, in most cases instruction ordering isn't really a factor on modern Intel processors. They operate "out of order" (often abbreviated OoO), which means that they execute instructions (µops really) from a "reorder buffer" as soon as dependencies are satisfied. Since the loads have no dependencies, they will already execute several (possibly many) iterations ahead of the multiplies and stores.

The latency of the store doesn't matter to use, since we aren't immediately using the result. This doesn't mean that you won't see a benefit from your suggestion, but if you do it will probably be from the reduced loop overhead rather than the improved concurrency.

YZF · on Oct 14, 2016

I'm aware of out of order execution but my experience is that at least with SSE/AVX instruction ordering does matter... The loads do have a dependency on the address for example. Anyways, some experimentation will help, even loop unrolling doesn't always behave the way you'd think.

nkurz · on Oct 14, 2016

The loads do have a dependency on the address for example.

Sort of. They depend on the address, but there is nothing that prevents the address from being calculated ahead of time. So what happens is that both the loads and the addition get executed multiple iterations ahead of the multiplication. While we tend to think of one iteration completing before the next begins, from the processors point of view it's just a continuous stream of instructions.

gpderetta · on Oct 14, 2016

Loops (especially FP loops) are often dependency chains limited wich prevents OoO execution. Unrolling (and using multiple accumulators) help create multiple independent chains that can be executed in parallel.

Edit: for this specific loop the only dependency is on the iteration variables whih is not an issue here, as the loop should only be limited by load/store bandwith, assuming proper scheduling and induction variable elimination from the compiler.

antirez · on Oct 14, 2016

Thanks, this is one of the simplest things to try that can produce an improvement. I'll try for sure.