Actually it looks like the double precision number in general GPU usage only wen...

Actually it looks like the double precision number in general GPU usage only went up 25% (Volta did 7.8 TFLOPS). To get the 2.5x number, you need to use FP64 in conjunction with TensorCores, which then gets you 19.5 TFLOPS.

Considering how big the die is (826mm^2 @ TSMC 7nm) and how many transistors there are, they really must have beefed up the TensorCores much more than the general compute units.