I think you'd be surprised by what's possible on mobile chips these days. They a...

superkuh · on July 23, 2023

The rate of token output is bottlenecked by the time it takes to transfer the model between RAM and CPU. Not the time it takes to do the multiplication operations. If you have the latest and greatest mobile phone and 8GB (or 12GB) of LPDDR5 on a Snapdragon 8 Gen 2 you still only have 8.5 Gbps memory bandwith (max, less in actual phones running it at slower speeds). That's 1 GB/s. So if your model is a 4 bit 7B parameter model that's 4GB in size that means it'll take at least 4 seconds per token generated. That is SLOW.

It doesn't matter that the Snapdragon 8 gen 2 has "AI" tensor cores or any of that. Memory bandwidth is the bottleneck for LLM. Phones have never needed HPC-like memory bandwidth and they don't have it. If Qualcomm is actually addressing this issue that'd be amazing. But I highly doubt it. Memory bandwidth costs $$$, massive power use, and volume/space not available in the form factor.

Do you know of a smartphone that has more than 1GB/s of memory bandwidth? If so I will be surprised. Otherwise I think it is you who will be surprised how specialized their compute is and how slow they are in many general purpose computing tasks (like transferring data from RAM).

treprinum · on July 23, 2023

Chips are capable, but this is a question of battery and heat. llama.cpp on a phone makes it both hot and low on battery quickly.