I've been playing around a lot with llama.cpp recently, and it's making me re-think my predictions for the future...
Given how big these models are (and the steep cost for GPUs to load them), I had been thinking that most people would interact with them via some hosted API (like what openAI is offering) or via some product like Bard or Copilot which offload inference to some big cloud datacenter.
But given how well some of these models perform on the CPU when quantized down to 4, 6, or 8 bits, I'm starting to think that there will be quite a few interesting applications for fully local inference on relatively modest hardware
I think in 3-5 years we will have way more competition in the GPU space and maybe even affordable ASICs on the market. Alternatively we may see CPUs gain AI oriented vector acceleration. Either way I expect it to be cheaper in the near future to run large models locally.
That being said your less technical users will probably still use them remotely, and the very largest models will also probably be hosted just because of the capital cost. It doesn’t make sense to spring for a super high end rig to run a massive model locally unless you are using it very very heavily or are a serious homelab enthusiast willing to shell out some bucks.
Of course there are wildcards here like breakthroughs in model efficiency, compression, or distributed execution and training. Any of that could change the game. I get the sense that we really don’t know how far we are from optimal efficiency right now. There may be far more efficient architectures or ways of compressing these things.
Part of me envisions a Coral-like device, but the kicker/difficulty/cost problem is always going to be fast, big memory. I think that puts a bit of a price floor on some of it. Can't wait to be wrong though!
The model running well is only one constraint for consumer adoption. The other ones, namely RAM and making the CPU super warm and rapidly draining battery, are another which is much harder to solve.
I currently use a TrueNAS as my home server, and I'm sure whatever I replace it with will add HW for local inference and act as my Home Brain for various automations and assistants.
Funny enough, I know two separate groups of people doing that. One in an apartment in SF running a shared Wi-Fi network with VLANs so they don't step on each other's frequencies, and another group sharing WISP infrastructure in semi-rural Utah where they can't convince ISPs to lay fiber. (Although I think the second group now uses that as a backup for Starlink.)
Yes, UBNT nanobeam 5ac AC wireless[1] is pretty dirt cheap. Can get a pair for about $199 for a wireless P2P link that can work over several kilometers. I have deployed a number of these myself, along with more high bandwidth airFiber stuff.
There are different possible approaches here, but for me, the main benefit would be of having control over the model(s) used. Other than that, low latency would be crucial for effective voice interaction.
There seems to be a lot of potential for efficiency gains in models where a much smaller model can achieve the same results. On the other end computing hardware always gets faster and GPU size/performance has been growing exponentially for a good long while.
There is for whichever company wants to make use of it - say for example LLM-powered NPCs in singleplayer video games (which I'm working on) - pretty much economically unviable with the current model if you want to run everything on the cloud.
You're selling a game once rather than selling a subscription, so you're locked in to paying for cloud-based LLM services indefinitely, unless you want to shut the game down after X years, which really annoys people. Also, you're incentivised to be stingy with the LLM-based features because the more you use it the more it costs you, but by running it locally you can offload that cost to the consumer.
There’s indeed a lot of GPU time not utilised even by modern games. I can definitely imagine new games utilising llms much more even and with local inference option.
At some point soon, LLMs matching current SOTA will run locally in a browser. If applications for LLMs emerge, having one available will be table stakes.
What's interesting is how there's so much emphasis on high end video cards which are prohibitively expensive for most people, yet many of the newer models, when quantized, run perfectly well on CPUs. Instead of chasing speed with money, seeing what can run decently on available hardware will end up having a much bigger potential impact on a greater number of people.
As an experiment, I've been running llama.cpp on an old 2012 AMD Bulldozer system, which most people consider to be AMD's equivalent of Intel's Pentium 4, with 64 gigs of memory, and with newer models it's surprisingly usable, if not entirely practical. It's much more usable, in my opinion, than spending energy trying to get everything to fit in to more modest GPUs' smaller amounts of VRAM.
It certainly shows that people shouldn't be dissuaded from playing around just because they have an older GPU and/or a GPU without much VRAM.
What’s the definition of “prompt processing” vs “token generation”?
Is that separately comparing the time it takes to preprocess the input prompt (prompt_length / pp_token_rate = time_to_first_token) and then the token generation rate is the time for each successive token?
I also see something about bs batch size. Is batching relevant for a locally run model? (Usually you only have one prompt at a time, right?)
regarding the prompt processing and token generation you are correct
it makes sense to benchmark them infependently since prompt processing is done in parralel for each token and is compute bound and token generation is sequential and bound by memory banwidth
I agree of course though I also think there is/was a lot of "low hanging fruit" in inference. Pytorch is really a model training framework and pytorch/python have not traditionally been well suited to fast inference, e.g. with lacking quantization but also lacking optimized inference code. That's changing, for example this recent post: https://pytorch.org/blog/accelerating-generative-ai-2/
But "AI" is still very research focused (training) and not application focused (inference). Llama.cpp et al are filling in this gap.
I honestly would love to full time focus on making high performance tools for ml / applied math, but its a hard area to break into, or at least I’ve been going about it wrong.
its just a co-processor included in Apple devices that has relevant components of a GPU, can accelerate graphics and general computation tasks
Apple’s chip architecture has a variety of co-processors alongside the CPU that make it the most perfect for transformers, in a convenient laptop form factor with fanless heat dissipation and low energy footprint
Given how big these models are (and the steep cost for GPUs to load them), I had been thinking that most people would interact with them via some hosted API (like what openAI is offering) or via some product like Bard or Copilot which offload inference to some big cloud datacenter.
But given how well some of these models perform on the CPU when quantized down to 4, 6, or 8 bits, I'm starting to think that there will be quite a few interesting applications for fully local inference on relatively modest hardware