Considering this thing needs 400GB of VRAM for non-quantitized inference, I'd say they have struck already. My bet is on smaller expert models in some sort of MoE architecture being the way forward (what GPT-4 is rumored to be), along with really small models that are trained on a massive amount of tokens for a long time to be used as even more specialized experts and/or for speculative execution (where a small model generates a sequence, and the large model will look it over and correct where needed).