Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Researchers run high-performing LLM on the energy needed to power a lightbulb (ucsc.edu)
147 points by geox on June 25, 2024 | hide | past | favorite | 70 comments


Paper: https://arxiv.org/abs/2406.02528 -- always better than a press release.

Code: https://github.com/ridgerchu/matmulfreellm

---

Like others before them, the authors train LLMs using parameters consisting of ternary digits, or trits, with values in {-1, 0, 1}.

What's new is that the authors then build a custom hardware solution on an FPGA and run billion-parameter LLMs consuming only 13W, moving LLM inference closer to brain-like efficiency.

Sure, it's on an FPGA, and it's only a lab experiment, but we're talking about an early proof of concept, not a commercial product.

As far as I know, this is the first energy-efficient hardware implementation of tritwise LLMs. That seems like a pretty big deal to me.


The claim about moving closer to brain-like efficiency conveniently omits how that model compares to modern LLMs. You can put together a toy LLM that is much smaller and more efficient than ChatGPT but isn’t as useful and call it “more efficient”, but that’s not useful in practice.


They do cover that in the article:

> Although they reduced the number of operations, the researchers were able to maintain the performance of the neural network by introducing time-based computation in the training of the model. This enables the network to have a “memory” of the important information it processes, enhancing performance. This technique paid off — the researchers compared their model to Meta’s state-of-the-art algorithm called Llama, and were able to achieve the same performance, even at a scale of billions of model parameters.


> The claim about moving closer to brain-like efficiency conveniently omits how that model compares to modern LLMs.

I disagree. The authors aren't conveniently omitting anything. They show all details in a comparison against LLama models.

Moreover, all evidence I've seen so far suggests that tritwise models can scale up to state-of-the-art sizes.

---

PS. I'm talking about the paper, not the fluffy press release.


I took the critique as being against OP, not the paper.


Ah, that makes more sense :-)

Thanks for pointing it out!

PS. I added a PS to my comment above.


I’ve heard some claims that to get closer to brain-like energy efficiency you’d need to use a spiking neural network https://en.wikipedia.org/wiki/Spiking_neural_network


I assume the ternary weight’s memory representation requires two bits, so why do they use only three values instead of four? OTOH, I’m not sure what fourth value would useful for LLM models doing math with -1, 0, and 1. Infinity? NaN?


to me the most interesting part in this all is the quantization used. hardware doing LLMs is likely to be the new norm in few years anyway. with retrofitting existing hardware to use say USBs running LLM accelerators and alike...


> moving LLM inference closer to brain-like efficiency.

Yeah but the brain does more than predictive text.


At least they're aware of what it is to be on the wrong track.

According to the researchers,

>all we had to do was fundamentally change how neural networks work,


> It costs $700,000 per day in energy costs to run ChatGPT 3.5, according to recent estimates, and leaves behind a massive carbon footprint in the process.

Compared to what? I wouldn't defend LLMs as "worth their electricity" quite yet, and they are definitely less efficient than a lot of other software, but I'd still like to see how this compares to gaming consoles, or email servers, the advertising industry hosting costs, cryptocurrency, and so on. Just doesn't seem worth pointing out the carbon footprint of AI just yet.


ChatGPT is very popular, with many users.

Any article citing the power usage without calculating it in terms of users of queries is just trying to push an agenda by omitting how many people are using it.


You had me till your thesis.

"Push an agenda"?

If they inserted a couple paragraphs saying "we estimate about 200M users a day, etc. etc.", would that add or detract from the article?

It's not relevant how they derived the number when you're reading, you only need an order of magnitude estimate, rest is distraction.


When you make something more available/cheaper, overall usage often goes up.

Overall energy use is an important metric regardless of energy per task.

Airplanes are WAY more efficient per passenger than they were in the past, but it's still valid to express concern over the energy usage and pollutants of air travel with so many more routes being flown.


How many people are using it because it's being offered at far below operating costs, even before you factor in the externalities of massive energy use?


Yeah, but the same 'number of users' claim could be fired at tiktok, facebook, linkedin, instagram, myriad other similarly dubious endeavours.

Active user count is not necessarily correlated to worthwhile consumption of resources.


>Just doesn't seem worth pointing out the carbon footprint of AI just yet.

Of course it does. It's not like AI replaced anything you mentioned. Its carbon footprint comes on top of it.

The benefit is secondary if the end result just means more carbon dioxide.


I meant more that the % is so low that even if the usage is on top of all other usage (not a completely clear statement to make), it's like starting to mention any other thing in the long tail of technology leaving behind a "massive carbon footprint". Yes, it matters, especially if you were making a report focused on sources of carbon footprint, but in general, saying "AI carbon footprint is bad" just seems like wanting to give it a bad name. In reality AI tech doesn't seem to be such a big contributor percentage-wise. Of course it should still be optimized, not arguing that.


"carbon footprint" for any computational technology is a ridiculous notion anyway: it's all electricity, and electricity is source independent.


Just because the carrier of energy is source independent, doesn't mean the consumer of that energy is not responsible for the carbon emissions of its production. Since we're talking hundreds of TWh[1], the policies of those consumers can have a massive impact on global emissions.

[1] https://www.iea.org/energy-system/buildings/data-centres-and...


> It's not like AI replaced anything you mentioned.

I mean it's true it hasn't replaced anything the OP mentioned, but it has definitely replaced parts of the compute that I would normally use for e.g. searching.


But do you now spend less time on the computer?


If people are activating Google's servers less, that's some energy saved. I don't know how it compares, but I guess OpenAI aren't running a live bidding war with advertisers on every request.


> If people are activating Google's servers less, that's some energy saved

Instead of google's server people are accessing OpenAi servers, which wastes even more power. How is that any better?


> they are definitely less efficient than a lot of other software

This is not a fair claim to make. Is a milling machine less efficient than a clock?

It does a different thing, it's not really comparable.


It’s fair in some cases, but indeed - not in others - e.g., code completion with LLMs is a time saver and worth paying for, compared to any other tech out there. At least this is how I see it, I know some people say it introduces mistakes but this is up for debate.

I do think if we look at translation tasks, grammar correction, information look up, etc. it adds a competitive convenience factor but I can’t say that running state of the art GPUs at very high wattages for up to a minute to do what specialized software can enable you to do by running for some milliseconds on much lower wattages isn’t less efficient. I’m referring to running multiple Google searches yourself to answer a question, or using a more traditional translation service, spell checker, and so on.


The only source I can find for this estimate is from a year ago. I feel like efficiency has gone up by a lot since then


Same as usage



I asked GPT-4 to estimate how much CO2 this likely emits, in units of "typical car usage in a city." It suggests this emits roughly as much CO2 as Reno or Des Moines. That's staggering, but there are about 100 cities this size in the US, so decreasing car usage 1% would more than offset this. I know this is a bizarre comparison to make, but CO2 emissions are fungible.


> I asked GPT-4

What makes you confident it gave you an accurate answer?


"I asked GPT" is the new "I saw it on TV."


The press release is devoid of useful information, unsurprisingly. You can run an LLM under almost any energy envelope if you’re willing to wait long enough for the result. Total energy consumed and the time difference are the more important metrics.

The actual paper is here: https://arxiv.org/abs/2406.02528

The key part from the summary:

> To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency.

There is a lot of unnecessary obfuscation of the numbers going on in the abstract as well, which is unfortunate. Instead of quoting the numbers they call it “billion-parameter scale” and “beyond human readable throughout”.


They did say that the answer is being produced faster than the human can read.


yes, human average reading speed is ~250 words per minute, but to be fair, it's not a widely known stat.


It's almost as if our brain is optimized for taking in more signals than just words on a page.


250 seems high? I want to believe, though.


That's 4 words per second.


Speaking is 150 wpm


Yeah... They are using a single-core 13W measurement to project out. For a 64x parallelization - no mention of any overhead due to parallelization or power needs of the supporting hardware. This is a key quote for me (page 12 of the PDF):

> The 1.3B parameter model, where L = 24 and d = 2048, has a projected runtime of 42ms, and a throughput of 23.8 tokens per second.

e.g. 64 x 13.67W = 874 Watts to run a 1.3B model at 23.8 t/s... I'm pretty sure my phone can do way better than that! Even half that power given their assertions in the table are still overpowered for such a small model.


When you multiply by 64 you also get 64 times more tokens per second!! Your math is wrong.


That's their math, the 23.8t/s is already the 64x but they didn't 64x the other stats.


When you multiply by 64 you also get 64 times more tokens per second!! Your math is wrong.


> "For the largest model size of 13B parameters, the MatMul-free LM uses only 4.19 GB of GPU memory and has a latency of 695.48 ms, whereas Transformer++ requires 48.50 GB of memory and exhibits a latency of 3183.10 ms"

That's a really, _really_ big difference in memory usage and since this scales sub-linear (300M param model uses 0.21GB, 13B model uses 4.19B) a 70B model would fit on an RTX 4090. I think currently people often run 34B Models with 4bit quants on that so I would like to see some larger models trained on more tokens with this approach.

Also their 2.7B Model took 173hours on 8 NVIDIA H100 GPUs and that also seems to roughly scale linearly with the parameter size, so a company with access to a small cluster of those DGX pods (say 8) could train such a model in about 30 days - though the 100B token training set might be lackluster for SotA but maybe someone else could chime in on that.


> a 70B model would fit on an RTX 4090 With this technique, it will not, because it uses a custom FPGA, not consumer GPUs


They tested their models on both a conventional gpu as well as a custom FPGA.


For reference, this is a real prototype for Bitnet 1.58b which uses (-1,0,1) as the weights which simplifies the matrix multiplication [0].

[0] https://arxiv.org/abs/2402.17764


I'm just curious how you close timing on a billion parameter model! I used to TA a digital design course that heavily involved FPGAs, and any kind of image or sprite usage, even on the order of megabytes, would crank up compile times insanely and sometimes even fail 30-45min into the build!

If anyone can offer insight that would be greatly appreciated


That's 13 Watts apparently, in non-American units.


Is there an American unit for power? I thought J/s is universal. 0.017433 Horsepower?

I had expected them to make the title more clickbaity, but that number is about right for a modern lightbulb.


The joke is that Americans like to use metaphors when describing measurements of things, like "as large as a baseball" or "as long as a football field".

However, there are non-SI units that are somewhat commonplace. Horsepower, foot-pounds per second, or BTU per second aren't unheard of.


It looks like this is a quantization method to flatten matrices for vector addition. Can anyone explain how this could allow LLMs to reach current benchmarks without losing performance?


The article mentions two things. Quantization (which everyone is already doing). There's bitnet, or you can Google 1 bit quantization (it's not really 1 bit, it's ternary as the article says).

Once you do that, a dot product is just addition/subtraction. Matrix multiplications are just dot products, so you've removed multiplication.

Then they built custom hardware that presumably only does that operation and doesn't use much electricity

Since this is just "more aggressive quantization" it's not too surprising that it reaches similar performance to other quantized models.

The network should structurally look the same as any other llm, it's still using transformers etc (afaiu)


Very sceptical of the claims that they don't lose any performance. Sounds like wishful thinking without enough effort put into measuring the performance loss that has to be there due to the heavy quantization. They even dropped computing the entire matrix addition, only focusing on some parts of it. If the benchmarks used don't show a quality drop, then that's because those benchmarks are not able to properly measure said quality drop. (edit: typo)


Because existing LLMs store no more than 2bits of knowledge per parameter, despite having many more bits of precision: https://arxiv.org/abs/2404.05405


Don't FPGAs have some overhead? How much better could this be on a custom ASIC?


It’s cool that my ANN course from twenty years back gives me enough background to understand this and quantization.

My professor at the time was at his last leg in impact, when ANNs were looked down on right before someone had the bright idea of using video cards.

I hope he’s doing well/retired on a high note.


I'd like to play around with ideas like this (training in ternary for example). Does anyone have any links to good reading materials or resources?

The FPGA trinary implementation is also really interesting!


[dupe]

Discussion a few weeks ago: https://news.ycombinator.com/item?id=40620955


Can this work build some easy-to-use apis? So it may be easy to apply in diffusion on other model.


I'm curious about the specs of the FPGA used - I imagine fairly high-end?


Lightbulbs are probably not the best thing to compare against, they're famously high in energy consumption and terribly inefficient. Even LEDs are absolutely awful and barely crack 30% total efficiency, most of what they make is heat.


Does anyone know what's the patent situation around this?


Incandescent I presume. Mistral-7b on a Nvidia 3060 draws about 100-odd watts of power.


Why presume? It's in the article. It's 13 watts.


Didn't read the article I presume.


You must be... Dr Livingstone?


And for candle power?


Why not have fpga code and the LLM algorithm so we can replicate it?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: