Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
dfbrown
on Oct 14, 2016
|
parent
|
context
|
favorite
| on:
Intel will add deep-learning instructions to its p...
I would try unrolling 2-4 iterations of the loop. Multiple sequential loads isn't much slower than a single load, so batching your loads and stores together will let you do more arithmetic operations for each time you hit memory.
Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: