Great? The community already did it with llama.cpp. Knowing the memory bandwidth...

mgraczyk · on July 23, 2023

I think you'd be surprised by what's possible on mobile chips these days. They aren't going to be running the 70B model at useable speeds, but I think with enough optimization it should be possible to run the 7B and 13B models on device interactively. With quantization you can fit those models in less than 8GB of RAM.

superkuh · on July 23, 2023

The rate of token output is bottlenecked by the time it takes to transfer the model between RAM and CPU. Not the time it takes to do the multiplication operations. If you have the latest and greatest mobile phone and 8GB (or 12GB) of LPDDR5 on a Snapdragon 8 Gen 2 you still only have 8.5 Gbps memory bandwith (max, less in actual phones running it at slower speeds). That's 1 GB/s. So if your model is a 4 bit 7B parameter model that's 4GB in size that means it'll take at least 4 seconds per token generated. That is SLOW.

It doesn't matter that the Snapdragon 8 gen 2 has "AI" tensor cores or any of that. Memory bandwidth is the bottleneck for LLM. Phones have never needed HPC-like memory bandwidth and they don't have it. If Qualcomm is actually addressing this issue that'd be amazing. But I highly doubt it. Memory bandwidth costs $$$, massive power use, and volume/space not available in the form factor.

Do you know of a smartphone that has more than 1GB/s of memory bandwidth? If so I will be surprised. Otherwise I think it is you who will be surprised how specialized their compute is and how slow they are in many general purpose computing tasks (like transferring data from RAM).

treprinum · on July 23, 2023

Chips are capable, but this is a question of battery and heat. llama.cpp on a phone makes it both hot and low on battery quickly.

wyldfire · on July 23, 2023

The work involved probably includes porting to the Snapdragon NSP for throughput and efficiency's sake.

For LLMs the biggest challenge is addressing such a large model - or finding a balance between the model size and its capability on a mobile device.

refulgentis · on July 23, 2023

re community already did this:

People are unreasonably attracted to things that are "minimal", at least 3 different local LLM codebase communities will tell you _they_ are the minimal solution.[1]

It's genuinely helpful to have a static target for technical understanding. Other projects end up with a lot of rushed Python defining the borders in a primordial ecosystem with too many people too early.

[1] Lifecycle: A lone hacker wants to gain understanding of the complicated word of LLMs. They implement some suboptimal, but code golfed, C code over the weekend. They attract a small working group and public interest.

Once the working group is outputting tokens, it sees an optimization.

This is landed.

It is applauded.

People discuss how this shows the open source community is where innovation happens. Isn't it unbelievable the closed source people didn't see this?[2]

Repeat N times.

Y steps into this loop, a new base model is released.

The project adds support for it.

However, it reeks of the "old" ways. There's even CLI arguments for the old thing from 3 weeks ago.

A small working group, frustrated, starts building a new, more minimal solution...

[2] The closed source people did. You have their model, not their inference code.

pavlov · on July 23, 2023

If only someone could convince a CPU company to optimize the chips for this workload. Oh, wait…

smoldesu · on July 23, 2023

Like ARM? https://github.com/ARM-software/armnn

Optimization for this workload has arguably been in-progress for decades. Modern AVX instructions can be found in laptops that are a decade old now, and most big inferencing projects are built around SIMD or GPU shaders. Unless your computer ships with onboard Nvidia hardware, there's usually not much difference in inferencing performance.

pavlov · on July 23, 2023

Ultimately Qualcomm is the one who decides how to allocate die area on their CPUs, right? So it can’t exactly hurt if this is a priority for them now.

smoldesu · on July 23, 2023

Pretty much all of Qualcomm's SOCs are built using stock ARM core designs. ARMnn is optimized for multicore A-series chips, which constitutes everything from the Snapdragon 410 to the 888 (~2014-modern day).

fisf · on July 23, 2023

All the recent qualcomm stuff has some kind of dedicated ai support (special vector extensions, etc.).

Qualcomm has it's own SDK for that (used to be called SNPE), which uses GPU, DSP (hexagon),.. CPU is really only a fallback.

MuffinFlavored · on July 23, 2023

Even on a platform where they are fast, I haven't found a solid real world use case personally for anything other than GPT-4 quality LLM. Am I missing something?

superkuh · on July 23, 2023

Non-commercial entertainment. Which makes this move by Qualcomm all the weirder. I agree, the llamas and all the other foundational models and all of their fine-tunes are not really useful for helping with real tasks that have a wrong answer.