Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Important to remember the half-precision tensorcore misrepresentations where the 8x improvement over fp32 claimed on Imagenet with tensorcores (V100) was actually only 1.2-2x [1,2]. Furthermore, there are major precision issues with network architectures like variational autoencoders and many others.

We use V100s for Richardson-Lucy like deconvolutions for example, where we have near-exact photon counts up to 10,000 per pixel. fp32 is sufficient, tf32 is not.

V100 claimed 15 teraflops of FP32, A100 claims 19.5 teraflops. For most pytorch/tensorflow workflows out there, where FP32 dominates, this approximates closer to 30% improvement of last generation, which is reasonable and typical. Although FP64 does get a nice boost.

[1] https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v... [2] https://www.pugetsystems.com/labs/hpc/TensorFlow-Performance...




I am not that much into ML, just fiddled with it a bit, is tf32=fp16?


Not quite, but close. “tf32” is 18 bits, but with the same 10 bits of exponent that fp32 has. It’s the range of fp32 with the precision of fp16. It’s a shame to see such unoriginality in new number representations. I’d much rather see Posit hardware acceleration: https://web.stanford.edu/class/ee380/Abstracts/170201-slides...


TF32 is 19 bits, not 18 bits. There's an additional bit for sign.

https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-prec...


No, tf32 has the same size exponent field as f32 but the mantissa size of f16, 10 bits.


Why is fp32 sufficient but not tf32 for that task?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: