Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A for() loop does the same thing at the cost of like 3 instructions. 4x128b has the flexibility that you don't need 512b wide operations on the same data to keep the ALUs fed. If you have 512b wide operations being split to 4x128b instructions, great, otherwise the massive OoOE window of modern chips can decode the next few loop iterations to keep the ALUs fed, or even pull instructions from a completely different kernel.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: