Great? The community already did it with llama.cpp. Knowing the memory bandwidth bottleneck I can't imagine phones are going to do very well. But hey, llamas (1 and 2) run on rpi4, so it'll work. Just really, unusably, slow.
I think you'd be surprised by what's possible on mobile chips these days. They aren't going to be running the 70B model at useable speeds, but I think with enough optimization it should be possible to run the 7B and 13B models on device interactively. With quantization you can fit those models in less than 8GB of RAM.
The rate of token output is bottlenecked by the time it takes to transfer the model between RAM and CPU. Not the time it takes to do the multiplication operations. If you have the latest and greatest mobile phone and 8GB (or 12GB) of LPDDR5 on a Snapdragon 8 Gen 2 you still only have 8.5 Gbps memory bandwith (max, less in actual phones running it at slower speeds). That's 1 GB/s. So if your model is a 4 bit 7B parameter model that's 4GB in size that means it'll take at least 4 seconds per token generated. That is SLOW.
It doesn't matter that the Snapdragon 8 gen 2 has "AI" tensor cores or any of that. Memory bandwidth is the bottleneck for LLM. Phones have never needed HPC-like memory bandwidth and they don't have it. If Qualcomm is actually addressing this issue that'd be amazing. But I highly doubt it. Memory bandwidth costs $$$, massive power use, and volume/space not available in the form factor.
Do you know of a smartphone that has more than 1GB/s of memory bandwidth? If so I will be surprised. Otherwise I think it is you who will be surprised how specialized their compute is and how slow they are in many general purpose computing tasks (like transferring data from RAM).
People are unreasonably attracted to things that are "minimal", at least 3 different local LLM codebase communities will tell you _they_ are the minimal solution.[1]
It's genuinely helpful to have a static target for technical understanding. Other projects end up with a lot of rushed Python defining the borders in a primordial ecosystem with too many people too early.
[1] Lifecycle:
A lone hacker wants to gain understanding of the complicated word of LLMs. They implement some suboptimal, but code golfed, C code over the weekend. They attract a small working group and public interest.
Once the working group is outputting tokens, it sees an optimization.
This is landed.
It is applauded.
People discuss how this shows the open source community is where innovation happens. Isn't it unbelievable the closed source people didn't see this?[2]
Repeat N times.
Y steps into this loop, a new base model is released.
The project adds support for it.
However, it reeks of the "old" ways. There's even CLI arguments for the old thing from 3 weeks ago.
A small working group, frustrated, starts building a new, more minimal solution...
[2] The closed source people did. You have their model, not their inference code.
Optimization for this workload has arguably been in-progress for decades. Modern AVX instructions can be found in laptops that are a decade old now, and most big inferencing projects are built around SIMD or GPU shaders. Unless your computer ships with onboard Nvidia hardware, there's usually not much difference in inferencing performance.
Pretty much all of Qualcomm's SOCs are built using stock ARM core designs. ARMnn is optimized for multicore A-series chips, which constitutes everything from the Snapdragon 410 to the 888 (~2014-modern day).
Even on a platform where they are fast, I haven't found a solid real world use case personally for anything other than GPT-4 quality LLM. Am I missing something?
Non-commercial entertainment. Which makes this move by Qualcomm all the weirder. I agree, the llamas and all the other foundational models and all of their fine-tunes are not really useful for helping with real tasks that have a wrong answer.