I can run a certain 120b on my M3 max with 128GB memory. However I found that while it “fits” Q5 was extremely slow. The story was different with Q4 though which ran just fine around ~3.5-4 t/s.
Now this model is ~134B right? It could be bog slow but on the other hand its a MoE so there might be a chance it could have satisfactory results.
So that would be runnable on a MBP with a M2 Max, but the context window must be quite small, I don’t really find anything under about 4096 that useful
That's a tricky number. Does it run on an 80GB GPU, does it auto-shave some parameters to fit in 79.99GB like any articifially "intelligent" piece of code would do, or does it give up like an unintelligent piece of code?
Are you asking if the framework automatically quantizes/prunes the model on the fly?
Or are you suggesting the LLM itself should realize it's too big to run, and prune/quantize itself? Your references to "intelligent" almost leads me to the conclusion that you think the LLM should prune itself. Not only is this a chicken and egg problem, but LLMs are statistical models, they aren't inherently self bootstraping.
I realize that, but I do think it's doable to bootstrap it on a cluster and teach itself to self-prune, and surprised nobody is actively working on this.
I hate software that complains (about dependencies, resources) when you try to run it and I think that should be one of the first use cases for LLMs to get L5 autonomous software installation and execution.
The LLM itself should realize it’s too big and only put the important parts on the gpu. If you’re asking questions about literature there’s no need to have all the params on the gpu, just tell it to put only the ones for literature on there.
The 9x speedup is a bit inflated... it's measured at a reference point of ~8k GPUs, on a workload that the A100 cluster is particularly bad at.
When measured at smaller #s of GPUs which are more realistic, the speedup is somewhere between 3.5x - 6x. See the GTC Keynote video at 38:50: https://youtu.be/39ubNuxnrK8?t=2330
Based on hardware specs alone, I think that training transformers with FP8 on H100 systems vs. FP16 on A100 systems should only be 3-4x faster. Definitely looking forward to external benchmarks over the coming months...
At inference time it will be possible to do 4000 TFLOPS using sparse FP8 :)
But keep in mind the model won't fit on a single H100 (80GB) because it's 175B params, and ~90GB even with sparse FP8 model weights, and then more needed for live activation memory. So you'll still want atleast 2+ H100s to run inference, and more realistically you would rent a 8xH100 cloud instance.
But yeah the latency will be insanely fast given how massive these models are!
1) NVIDIA will likely release a variant of H100 with 2x memory, so we may not even have to wait a generation. They did this for V100-16GB/32GB and A100-40GB/80GB.
2) In a generation or two, the SOTA model architecture will change, so it will be hard to predict the memory reqs... even today, for a fixed train+inference budget, it is much better to train Mixture-Of-Experts (MoE) models, and even NVIDIA advertises MoE models on their H100 page.
MoEs are more efficient in compute, but occupy a lot more memory at runtime. To run an MoE with GPT3-like quality, you probably need to occupy a full 8xH100 box, or even several boxes. So your min-inference-hardware has gone up, but your efficiency will be much better (much higher queries/sec than GPT3 on the same system).
Oh I totally expect the size of models to grow along with whatever hardware can provide.
I really do wonder how much more you could squeeze out of a full pod of gen2-H100's, obviously the model size would be ludicrous, but how far are we into the realm of dimishing returns.
Your point about MoE architectures certainly sounds like the more _useful_ deployment, but the research seems to be pushing towards ludicrously large models.
You seem to know a fair amount about the field, is there anything you'd suggest if I wanted to read more into the subject?
I agree! The models will definitely keep getting bigger, and MoEs are a part of that trend, sorry if that wasn’t clear.
A pod of gen2-H100s might have 256 GPUs with 40 TB of total memory, and could easily run a 10T param model. So I think we are far from diminishing returns on the hardware side :) The model quality also continues to get better at scale.
Re. reading material, I would take a look at DeepSpeed’s blog posts (not affiliated btw). That team is super super good at hardware+software optimization for ML. See their post on MoE models here: https://www.microsoft.com/en-us/research/blog/deepspeed-adva...
I think it depends what downstream task you're trying to do... DeepMind tried distilling big language models into smaller ones (think 7B -> 1B) but it didn't work too well... it definitely lost a lot of quality (for general language modeling) relative to the original model.
Unfortunately it will be hard to investigate properties of large, powerful neural networks without access to their trained weights. And industrial labs that spend millions of dollars training them will not be keen to share.
If academics want to do research on expensive cutting-edge tech, they will have to join industrial labs or pool together resources, similar to particle physics or drug discovery research today.
For general tasks like language modeling, we are still seeing predictable improvements (on the next-token-prediction loss) with increasing compute. We will very likely be able to scale things up by 10,000x or so and continue to see increasing performance.
But what does this mean for end users? We are probably going to see sigmoid-like curves, where qualitative features of these models (like being able to do math, or tell jokes, or tutor you in French, or provide therapy, or mediate international conflicts) will suddenly get a * lot * better at some point in the scaling curve. We saw this for simple arithmetic in the GPT-3 paper, where the small <1B param models were terrible at it, and then with 100B scale suddenly the model could do arithmetic with 80%+ accuracy.
Personally I would not expect diminishing returns with increased scale, instead there will be sudden leaps in ability that will be very economically valuable. And that is why Meta and others are so interested in scaling up these models.
> "Though the spike seen in the data generates more questions than answers, one thing is clear: A single (albeit large and busy) store’s decision to report a majority of its shoplifting incidents doubled the entire city’s monthly shoplifting rates."
Wow! How are you able to achieve the cost reductions? Is it different hardware, software optimizations, or both? Also does this suggest that OpenAI is charging 6x markups on their models... >:(