I think you'd be surprised by what's possible on mobile chips these days. They aren't going to be running the 70B model at useable speeds, but I think with enough optimization it should be possible to run the 7B and 13B models on device interactively. With quantization you can fit those models in less than 8GB of RAM.
The rate of token output is bottlenecked by the time it takes to transfer the model between RAM and CPU. Not the time it takes to do the multiplication operations. If you have the latest and greatest mobile phone and 8GB (or 12GB) of LPDDR5 on a Snapdragon 8 Gen 2 you still only have 8.5 Gbps memory bandwith (max, less in actual phones running it at slower speeds). That's 1 GB/s. So if your model is a 4 bit 7B parameter model that's 4GB in size that means it'll take at least 4 seconds per token generated. That is SLOW.
It doesn't matter that the Snapdragon 8 gen 2 has "AI" tensor cores or any of that. Memory bandwidth is the bottleneck for LLM. Phones have never needed HPC-like memory bandwidth and they don't have it. If Qualcomm is actually addressing this issue that'd be amazing. But I highly doubt it. Memory bandwidth costs $$$, massive power use, and volume/space not available in the form factor.
Do you know of a smartphone that has more than 1GB/s of memory bandwidth? If so I will be surprised. Otherwise I think it is you who will be surprised how specialized their compute is and how slow they are in many general purpose computing tasks (like transferring data from RAM).