I would try unrolling 2-4 iterations of the loop. Multiple sequential loads isn'...

		dfbrown on Oct 14, 2016 \| parent \| context \| favorite \| on: Intel will add deep-learning instructions to its p... I would try unrolling 2-4 iterations of the loop. Multiple sequential loads isn't much slower than a single load, so batching your loads and stores together will let you do more arithmetic operations for each time you hit memory.