Unless there is a massive change in archiecture, it will always be much more cost effective to have a single cluster of GPUs running inference for many users than have each user have hardware capable of running SOTA models but only using it for the 1% of the time where they have asked the model to do something.