I had bet that matmult would be in transformer-optimized hardware costing a fraction of GPUs first class in torch 2 years ago with no reason to use GPUs any more. Wrong.
The real bottleneck is the memory, optimize your matmul architecture all you like whilst you still have it connected to a big chunk of HBM memory (or whatever your chosen high bandwidth memory is) you can only do so much.
So really GPU v not GPU (e.g. TPU) doesn't matter a whole lot if you've got fundamentally the same memory architecture.