Hacker Newsnew | past | comments | ask | show | jobs | submit | ml_hardware's commentslogin

Looks like someone has got DBRX running on an M2 Ultra already: https://x.com/awnihannun/status/1773024954667184196?s=20


I find 500 tokens considered 'running' a stretch.

Cool to play with for a few tests, but I can't imagine using it for anything.


I can run a certain 120b on my M3 max with 128GB memory. However I found that while it “fits” Q5 was extremely slow. The story was different with Q4 though which ran just fine around ~3.5-4 t/s.

Now this model is ~134B right? It could be bog slow but on the other hand its a MoE so there might be a chance it could have satisfactory results.


From the article, should have the speed of a ~36b.


And it appears to be at ~80 GB of RAM via quantisation.


So that would be runnable on a MBP with a M2 Max, but the context window must be quite small, I don’t really find anything under about 4096 that useful


Can't wait to try this on my MacBook. I'm also just amazed at how wasteful Grok appears to be!


That's a tricky number. Does it run on an 80GB GPU, does it auto-shave some parameters to fit in 79.99GB like any articifially "intelligent" piece of code would do, or does it give up like an unintelligent piece of code?


Are you aware how Macs present memory? Their 'unified' memory approach means you could run an 80GB model on a 128GB machine.

There's no concept of 'dedicated GPU memory' as per conventional amd64 arch machines.


What?

Are you asking if the framework automatically quantizes/prunes the model on the fly?

Or are you suggesting the LLM itself should realize it's too big to run, and prune/quantize itself? Your references to "intelligent" almost leads me to the conclusion that you think the LLM should prune itself. Not only is this a chicken and egg problem, but LLMs are statistical models, they aren't inherently self bootstraping.


I realize that, but I do think it's doable to bootstrap it on a cluster and teach itself to self-prune, and surprised nobody is actively working on this.

I hate software that complains (about dependencies, resources) when you try to run it and I think that should be one of the first use cases for LLMs to get L5 autonomous software installation and execution.


Make your dreams a reality!


Worst is software that doesn't complain but fails silently.


The LLM itself should realize it’s too big and only put the important parts on the gpu. If you’re asking questions about literature there’s no need to have all the params on the gpu, just tell it to put only the ones for literature on there.


That's great, but it did not really write the program that the human asked it to do. :)


That's because it's the base model, not the instruct tuned one.


Mosaic's MPT models are already supported in GGML: https://github.com/ggerganov/ggml

Here's MPT-30B running in 4-bit precision on CPU :) https://twitter.com/abacaj/status/1673133443339763712?s=20


Oh I didn't realize this. Everything is moving so fast that I can't even keep up with the features.


The repo for training and finetuning this model is open source here: https://github.com/mosaicml/llm-foundry


Did you actually read the blog? The very first sentence is:

> Try out our Stable Diffusion code here! > https://github.com/mosaicml/diffusion-benchmark


The 9x speedup is a bit inflated... it's measured at a reference point of ~8k GPUs, on a workload that the A100 cluster is particularly bad at.

When measured at smaller #s of GPUs which are more realistic, the speedup is somewhere between 3.5x - 6x. See the GTC Keynote video at 38:50: https://youtu.be/39ubNuxnrK8?t=2330

Based on hardware specs alone, I think that training transformers with FP8 on H100 systems vs. FP16 on A100 systems should only be 3-4x faster. Definitely looking forward to external benchmarks over the coming months...


At inference time it will be possible to do 4000 TFLOPS using sparse FP8 :)

But keep in mind the model won't fit on a single H100 (80GB) because it's 175B params, and ~90GB even with sparse FP8 model weights, and then more needed for live activation memory. So you'll still want atleast 2+ H100s to run inference, and more realistically you would rent a 8xH100 cloud instance.

But yeah the latency will be insanely fast given how massive these models are!


So, we're about a 25-50% memory increase off of being able to run GPT3 on a single machine?

Sounds doable in a generation or two.


Couple points:

1) NVIDIA will likely release a variant of H100 with 2x memory, so we may not even have to wait a generation. They did this for V100-16GB/32GB and A100-40GB/80GB.

2) In a generation or two, the SOTA model architecture will change, so it will be hard to predict the memory reqs... even today, for a fixed train+inference budget, it is much better to train Mixture-Of-Experts (MoE) models, and even NVIDIA advertises MoE models on their H100 page.

MoEs are more efficient in compute, but occupy a lot more memory at runtime. To run an MoE with GPT3-like quality, you probably need to occupy a full 8xH100 box, or even several boxes. So your min-inference-hardware has gone up, but your efficiency will be much better (much higher queries/sec than GPT3 on the same system).

So it's complicated!


Oh I totally expect the size of models to grow along with whatever hardware can provide.

I really do wonder how much more you could squeeze out of a full pod of gen2-H100's, obviously the model size would be ludicrous, but how far are we into the realm of dimishing returns.

Your point about MoE architectures certainly sounds like the more _useful_ deployment, but the research seems to be pushing towards ludicrously large models.

You seem to know a fair amount about the field, is there anything you'd suggest if I wanted to read more into the subject?


I agree! The models will definitely keep getting bigger, and MoEs are a part of that trend, sorry if that wasn’t clear.

A pod of gen2-H100s might have 256 GPUs with 40 TB of total memory, and could easily run a 10T param model. So I think we are far from diminishing returns on the hardware side :) The model quality also continues to get better at scale.

Re. reading material, I would take a look at DeepSpeed’s blog posts (not affiliated btw). That team is super super good at hardware+software optimization for ML. See their post on MoE models here: https://www.microsoft.com/en-us/research/blog/deepspeed-adva...


Is it difficult/desirable to squeeze/compress an open-sourced 200B parameter model to fit into 40GB?

Are these techniques for specific architectures or can they be made generic ?


I think it depends what downstream task you're trying to do... DeepMind tried distilling big language models into smaller ones (think 7B -> 1B) but it didn't work too well... it definitely lost a lot of quality (for general language modeling) relative to the original model.

See the paper here, Figure A28: https://kstatic.googleusercontent.com/files/b068c6c0e64d6f93...

But if your downstream task is simple, like sequence classification, then it may be possible to compress the model without losing much quality.



Unfortunately it will be hard to investigate properties of large, powerful neural networks without access to their trained weights. And industrial labs that spend millions of dollars training them will not be keen to share.

If academics want to do research on expensive cutting-edge tech, they will have to join industrial labs or pool together resources, similar to particle physics or drug discovery research today.


You may find this blog post useful for thinking about AI scaling: https://www.alignmentforum.org/posts/k2SNji3jXaLGhBeYP/extra...

For general tasks like language modeling, we are still seeing predictable improvements (on the next-token-prediction loss) with increasing compute. We will very likely be able to scale things up by 10,000x or so and continue to see increasing performance.

But what does this mean for end users? We are probably going to see sigmoid-like curves, where qualitative features of these models (like being able to do math, or tell jokes, or tutor you in French, or provide therapy, or mediate international conflicts) will suddenly get a * lot * better at some point in the scaling curve. We saw this for simple arithmetic in the GPT-3 paper, where the small <1B param models were terrible at it, and then with 100B scale suddenly the model could do arithmetic with 80%+ accuracy.

Personally I would not expect diminishing returns with increased scale, instead there will be sudden leaps in ability that will be very economically valuable. And that is why Meta and others are so interested in scaling up these models.


> "Though the spike seen in the data generates more questions than answers, one thing is clear: A single (albeit large and busy) store’s decision to report a majority of its shoplifting incidents doubled the entire city’s monthly shoplifting rates."


Wow! How are you able to achieve the cost reductions? Is it different hardware, software optimizations, or both? Also does this suggest that OpenAI is charging 6x markups on their models... >:(


Mainly efficient usage of cloud hardware right now, with some software optimizations. Looking at more cost-effective hardware is on our roadmap too.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: