It might help to try and squeeze more work between the load and the store. E.g. do two of these in parallel (load/load/mul/mul/store/store). Think about latency vs. throughput, the load and the stores may have high latency but also high throughput so you need to be issuing more of them concurrently (without any dependencies between them). Just off the top of my head.
Actually, in most cases instruction ordering isn't really a factor on modern Intel processors. They operate "out of order" (often abbreviated OoO), which means that they execute instructions (µops really) from a "reorder buffer" as soon as dependencies are satisfied. Since the loads have no dependencies, they will already execute several (possibly many) iterations ahead of the multiplies and stores.
The latency of the store doesn't matter to use, since we aren't immediately using the result. This doesn't mean that you won't see a benefit from your suggestion, but if you do it will probably be from the reduced loop overhead rather than the improved concurrency.
I'm aware of out of order execution but my experience is that at least with SSE/AVX instruction ordering does matter... The loads do have a dependency on the address for example. Anyways, some experimentation will help, even loop unrolling doesn't always behave the way you'd think.
The loads do have a dependency on the address for example.
Sort of. They depend on the address, but there is nothing that prevents the address from being calculated ahead of time. So what happens is that both the loads and the addition get executed multiple iterations ahead of the multiplication. While we tend to think of one iteration completing before the next begins, from the processors point of view it's just a continuous stream of instructions.
Loops (especially FP loops) are often dependency chains limited wich prevents OoO execution. Unrolling (and using multiple accumulators) help create multiple independent chains that can be executed in parallel.
Edit: for this specific loop the only dependency is on the iteration variables whih is not an issue here, as the loop should only be limited by load/store bandwith, assuming proper scheduling and induction variable elimination from the compiler.