Extending the tensor Ops to FP64 is an interesting, if not surprising, design choice. Are there many applications sure to leverage this capability? Aside from HPL, of course.
I am suspecting that this is specifically targeted at the HPC market as the lack of FP64 has always been a hindrance to HPC deployment.
You have to remember that the HPC market is $35B today. HPE makes $3B a year alone from that, maybe more with Cray acquisition.
So it's no surprise that NVIDIA wants to position themselves on that market.
Plus, you have too look at the long game with MLX acquisition ( heavy player in the HPC market) and Cumulus. I wouldn't be surprised to see NVIDIA trying to bypass Intel/AMD completely and offer Direct to Interconnect device. Rather than the hybrid CPU + GPU box.
Remember too that AMD and Intel won the contracts for Aurora, Frontier, and El Cap (the three exascale machines for the DOE). I imagine MLX is a big part of getting the next contracts as well as seeing that a lot of these projects are IO bound, not compute. If you can bring supercomputing like abilities to datacenters or AI labs, that'd be a huge advantage. If you could easily split a huge model across 64 GPUs and train like it was on a single node, this would change the space.
Lately they have also been working quite a bit with POWER, quite a few of the big HPC rusns on the POWER/Mellanox/NVIDIA combo. So that could also be something they look into more, instead of going with x86.