For example some stats from Whisper [0] (audio transcoding, 30 seconds) show the following for the medium model (see other models in the link):
---
GPU medium fp32 Linear 1.7s
CPU medium fp32 nn.Linear 60.7s
CPU medium qint8 (quant) nn.Linear 23.1s
---
So the same model runs 35.7 times faster on GPU, and compared to an "optimized" model still 13.6.
I was expecting around an order or magnitude of improvement.
Then again, I do not know if in the case of this article the entire model was in the GPU, or just a fraction of it (22 layers) and the remainder on CPU, which might explain the result. Apparently that's the case, but I don't know much about this stuff.
Training and inference on GPUs significantly underutilize …the GPUs. So tuning and various tricks need to be applied to achieve dramatic performance gains. If I am not good at cooking, giving me a larger kitchen will not make me faster or better.
For example some stats from Whisper [0] (audio transcoding, 30 seconds) show the following for the medium model (see other models in the link):
---
GPU medium fp32 Linear 1.7s
CPU medium fp32 nn.Linear 60.7s
CPU medium qint8 (quant) nn.Linear 23.1s
---
So the same model runs 35.7 times faster on GPU, and compared to an "optimized" model still 13.6.
I was expecting around an order or magnitude of improvement.
Then again, I do not know if in the case of this article the entire model was in the GPU, or just a fraction of it (22 layers) and the remainder on CPU, which might explain the result. Apparently that's the case, but I don't know much about this stuff.
[0] https://github.com/MiscellaneousStuff/openai-whisper-cpu