Performance of llama.cpp on Apple Silicon A-series

eminence32 · on Dec 19, 2023

I've been playing around a lot with llama.cpp recently, and it's making me re-think my predictions for the future...

Given how big these models are (and the steep cost for GPUs to load them), I had been thinking that most people would interact with them via some hosted API (like what openAI is offering) or via some product like Bard or Copilot which offload inference to some big cloud datacenter.

But given how well some of these models perform on the CPU when quantized down to 4, 6, or 8 bits, I'm starting to think that there will be quite a few interesting applications for fully local inference on relatively modest hardware

api · on Dec 20, 2023

I think in 3-5 years we will have way more competition in the GPU space and maybe even affordable ASICs on the market. Alternatively we may see CPUs gain AI oriented vector acceleration. Either way I expect it to be cheaper in the near future to run large models locally.

That being said your less technical users will probably still use them remotely, and the very largest models will also probably be hosted just because of the capital cost. It doesn’t make sense to spring for a super high end rig to run a massive model locally unless you are using it very very heavily or are a serious homelab enthusiast willing to shell out some bucks.

Of course there are wildcards here like breakthroughs in model efficiency, compression, or distributed execution and training. Any of that could change the game. I get the sense that we really don’t know how far we are from optimal efficiency right now. There may be far more efficient architectures or ways of compressing these things.

girvo · on Dec 20, 2023

> maybe even affordable ASICs on the market

Part of me envisions a Coral-like device, but the kicker/difficulty/cost problem is always going to be fast, big memory. I think that puts a bit of a price floor on some of it. Can't wait to be wrong though!

minimaxir · on Dec 20, 2023

The model running well is only one constraint for consumer adoption. The other ones, namely RAM and making the CPU super warm and rapidly draining battery, are another which is much harder to solve.

falcor84 · on Dec 20, 2023

I'm somewhat bullish on the Edge Computing approach to this. For example, how about a co-op neighborhood LLM rack?

sho_hn · on Dec 20, 2023

I currently use a TrueNAS as my home server, and I'm sure whatever I replace it with will add HW for local inference and act as my Home Brain for various automations and assistants.

ozr · on Dec 20, 2023

Why would you do that, and when else has it worked?

runjake · on Dec 20, 2023

  > how about a co-op neighborhood LLM rack?

It'll work about as well as those "wire your neighborhood for Internet as a collective" movements of the 1990s.

In fact, you'll need those to deal with latency issues, if my experience with consumer ISP quality is any indication.

dharmab · on Dec 20, 2023

Funny enough, I know two separate groups of people doing that. One in an apartment in SF running a shared Wi-Fi network with VLANs so they don't step on each other's frequencies, and another group sharing WISP infrastructure in semi-rural Utah where they can't convince ISPs to lay fiber. (Although I think the second group now uses that as a backup for Starlink.)

runjake · on Dec 20, 2023

Yes, UBNT nanobeam 5ac AC wireless[1] is pretty dirt cheap. Can get a pair for about $199 for a wireless P2P link that can work over several kilometers. I have deployed a number of these myself, along with more high bandwidth airFiber stuff.

1. https://store.ui.com/us/en/pro/category/all-wireless/product...

alberth · on Dec 20, 2023

How is “Edge Computing” any different than what the GP was stating about current cloud hosted (eg OpenAI)?

falcor84 · on Dec 20, 2023

There are different possible approaches here, but for me, the main benefit would be of having control over the model(s) used. Other than that, low latency would be crucial for effective voice interaction.

zztop44 · on Dec 20, 2023

I would much rather send my personal information to Microsoft and the NSA than people in my neighborhood

colechristensen · on Dec 20, 2023

There seems to be a lot of potential for efficiency gains in models where a much smaller model can achieve the same results. On the other end computing hardware always gets faster and GPU size/performance has been growing exponentially for a good long while.

Local inference is quite inevitable.

moneywoes · on Dec 20, 2023

what sort of applications are you thinking? live translations?

nly · on Dec 19, 2023

There won't, because there's no money in it

joegibbs · on Dec 20, 2023

There is for whichever company wants to make use of it - say for example LLM-powered NPCs in singleplayer video games (which I'm working on) - pretty much economically unviable with the current model if you want to run everything on the cloud.

You're selling a game once rather than selling a subscription, so you're locked in to paying for cloud-based LLM services indefinitely, unless you want to shut the game down after X years, which really annoys people. Also, you're incentivised to be stingy with the LLM-based features because the more you use it the more it costs you, but by running it locally you can offload that cost to the consumer.

larodi · on Dec 20, 2023

There’s indeed a lot of GPU time not utilised even by modern games. I can definitely imagine new games utilising llms much more even and with local inference option.

moneywoes · on Dec 20, 2023

would love to learn more and help out on NPCs llms. What sort of performance hit are you seeing?

joegibbs · on Dec 24, 2023

Sure, here's my writeup about it: https://jgibbs.dev/blogs/local-llm-npcs-in-unreal-engine. There's barely any performance hit, only a slight hitch when it starts generating a response.

andy99 · on Dec 20, 2023

At some point soon, LLMs matching current SOTA will run locally in a browser. If applications for LLMs emerge, having one available will be table stakes.

johnklos · on Dec 20, 2023

What's interesting is how there's so much emphasis on high end video cards which are prohibitively expensive for most people, yet many of the newer models, when quantized, run perfectly well on CPUs. Instead of chasing speed with money, seeing what can run decently on available hardware will end up having a much bigger potential impact on a greater number of people.

As an experiment, I've been running llama.cpp on an old 2012 AMD Bulldozer system, which most people consider to be AMD's equivalent of Intel's Pentium 4, with 64 gigs of memory, and with newer models it's surprisingly usable, if not entirely practical. It's much more usable, in my opinion, than spending energy trying to get everything to fit in to more modest GPUs' smaller amounts of VRAM.

It certainly shows that people shouldn't be dissuaded from playing around just because they have an older GPU and/or a GPU without much VRAM.

cgearhart · on Dec 19, 2023

What’s the definition of “prompt processing” vs “token generation”?

Is that separately comparing the time it takes to preprocess the input prompt (prompt_length / pp_token_rate = time_to_first_token) and then the token generation rate is the time for each successive token?

I also see something about bs batch size. Is batching relevant for a locally run model? (Usually you only have one prompt at a time, right?)

discobot · on Dec 20, 2023

regarding the prompt processing and token generation you are correct

it makes sense to benchmark them infependently since prompt processing is done in parralel for each token and is compute bound and token generation is sequential and bound by memory banwidth

andy99 · on Dec 20, 2023

It says at the top that prompts are processed with a batch size of 512, while new tokens can only be generated one at a time with batch size of 1.

Prompt processing doesn't actually need to compute logits for each new token, just cache the KV values and so is much faster than actual inference.

idonotknowwhy · on Dec 20, 2023

Yeah prompt processing is tokenizing the input, which includes the system prompt, chat history and your chat text.

Generation is the response it sends back

carterschonwald · on Dec 19, 2023

Llama.cpp and other “inference at the edge” tools are a really amazing pieces of engineering.

andy99 · on Dec 20, 2023

I agree of course though I also think there is/was a lot of "low hanging fruit" in inference. Pytorch is really a model training framework and pytorch/python have not traditionally been well suited to fast inference, e.g. with lacking quantization but also lacking optimized inference code. That's changing, for example this recent post: https://pytorch.org/blog/accelerating-generative-ai-2/

But "AI" is still very research focused (training) and not application focused (inference). Llama.cpp et al are filling in this gap.

carterschonwald · on Dec 21, 2023

I honestly would love to full time focus on making high performance tools for ml / applied math, but its a hard area to break into, or at least I’ve been going about it wrong.

yieldcrv · on Dec 19, 2023

love that, I’ve been using Mistral 7B on my M1 and I thought it was tolerable but turned out I wasnt utilizing Metal and now its amazing

8x7B nowadays

As long as metal is used on an iphone I could see it worked well too. I use quantize 5 on my laptop but quantize 4 seems very practical

alberth · on Dec 19, 2023

What did you do to enable Metal?

yieldcrv · on Dec 19, 2023

I use LM Studio and 0.2.9 now has a checkmark for it

addandsubtract · on Dec 20, 2023

How much RAM do you have? Ollama comes with a note: "this model requires 48GB of RAM".

zakki · on Dec 19, 2023

How do you use 8x7B?

yieldcrv · on Dec 19, 2023

running it with LM Studio. No command line and has a huggingface browser, very user friendly.

localhost · on Dec 20, 2023

Use ollama [1]

$ ollama run mixtral

[1] https://github.com/jmorganca/ollama

moneywoes · on Dec 20, 2023

could you please eli5 how metal works

yieldcrv · on Dec 20, 2023

its just a co-processor included in Apple devices that has relevant components of a GPU, can accelerate graphics and general computation tasks

Apple’s chip architecture has a variety of co-processors alongside the CPU that make it the most perfect for transformers, in a convenient laptop form factor with fanless heat dissipation and low energy footprint

Havoc · on Dec 20, 2023

Apples stinginess with Ram in phones may come back to bite them on LLMs

saurik · on Dec 20, 2023

I would think the opposite would happen, as they'll be very excited by people having to buy entirely new phones just to use LLMs ;P.

m3kw9 · on Dec 20, 2023

Testing performance of LLM without testing the quality isn’t really practical in real world because of its fast and output is gibrish it won’t matter.

There should be 10-20 input and output that is tested for correctness or something in addition to t/s as a reference