Hacker News new | past | comments | ask | show | jobs | submit login

Both.

Some operations split and join data in non-deterministic ways (especially the order of operations, leading to different floating point rounding). If you shard across multiple machines, weight accumulation order will depend on network latency for example.

Also, GPU's aren't anywhere near as reliable as CPU's when it comes to being able to run for hours without any random bit flips/errors.




> ...split and join data in non-deterministic ways ... to different floating point rounding

Ah, of course! A very timely reminder, thanks!

> GPU's aren't anywhere near as reliable as CPU's when it comes to being able to run for hours without any random bit flips/errors.

Now that's worrying. A bit flip can't be expected to be skewed towards any particular bit within a float, so it could easily happen in the exponent, skewing a single value by orders of magnitude one way or the other. Combine that with the rest of your 'good' results and yuck. That's very concerning. Thanks for the warning.


> A bit flip can't be expected to be skewed towards any particular bit within a float,

Actually, I think they are - for example, the exponent path through an adder/multiplier is typically shorter, so when operated close to clock speed limits, the exponent is more likley to be correct.

(I've not actually verified the above on real hardware)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: