Maybe on a particular model/dataset but extremely unlikely in general. Again, like another commenter pointed out: if you truly believe it isn't that hard we would love to hire you at Meta ;)
Yes some operations are not supported in MPS/TPU and falls back to slower CPU. But for common architectures like transformers and convnets, it works very well for all the datasets.
I never claimed it was easy. I meant in my opinion it is in the order of 10s of millions dollars of investment, not a trillion dollar CUDA moat that people comment here.