Researchers run high-performing LLM on the energy needed to power a lightbulb

cs702 · on June 25, 2024

Paper: https://arxiv.org/abs/2406.02528 -- always better than a press release.

Code: https://github.com/ridgerchu/matmulfreellm

---

Like others before them, the authors train LLMs using parameters consisting of ternary digits, or trits, with values in {-1, 0, 1}.

What's new is that the authors then build a custom hardware solution on an FPGA and run billion-parameter LLMs consuming only 13W, moving LLM inference closer to brain-like efficiency.

Sure, it's on an FPGA, and it's only a lab experiment, but we're talking about an early proof of concept, not a commercial product.

As far as I know, this is the first energy-efficient hardware implementation of tritwise LLMs. That seems like a pretty big deal to me.

Aurornis · on June 25, 2024

The claim about moving closer to brain-like efficiency conveniently omits how that model compares to modern LLMs. You can put together a toy LLM that is much smaller and more efficient than ChatGPT but isn’t as useful and call it “more efficient”, but that’s not useful in practice.

yashap · on June 25, 2024

They do cover that in the article:

> Although they reduced the number of operations, the researchers were able to maintain the performance of the neural network by introducing time-based computation in the training of the model. This enables the network to have a “memory” of the important information it processes, enhancing performance. This technique paid off — the researchers compared their model to Meta’s state-of-the-art algorithm called Llama, and were able to achieve the same performance, even at a scale of billions of model parameters.

cs702 · on June 25, 2024

> The claim about moving closer to brain-like efficiency conveniently omits how that model compares to modern LLMs.

I disagree. The authors aren't conveniently omitting anything. They show all details in a comparison against LLama models.

Moreover, all evidence I've seen so far suggests that tritwise models can scale up to state-of-the-art sizes.

---

PS. I'm talking about the paper, not the fluffy press release.

avisser · on June 25, 2024

I took the critique as being against OP, not the paper.

cs702 · on June 25, 2024

Ah, that makes more sense :-)

Thanks for pointing it out!

PS. I added a PS to my comment above.

juliangamble · on June 25, 2024

I’ve heard some claims that to get closer to brain-like energy efficiency you’d need to use a spiking neural network https://en.wikipedia.org/wiki/Spiking_neural_network

cpeterso · on June 25, 2024

I assume the ternary weight’s memory representation requires two bits, so why do they use only three values instead of four? OTOH, I’m not sure what fourth value would useful for LLM models doing math with -1, 0, and 1. Infinity? NaN?

larodi · on June 25, 2024

to me the most interesting part in this all is the quantization used. hardware doing LLMs is likely to be the new norm in few years anyway. with retrofitting existing hardware to use say USBs running LLM accelerators and alike...

lawlessone · on June 25, 2024

> moving LLM inference closer to brain-like efficiency.

Yeah but the brain does more than predictive text.

fuzzfactor · on June 25, 2024

At least they're aware of what it is to be on the wrong track.

According to the researchers,

>all we had to do was fundamentally change how neural networks work,

blixt · on June 25, 2024

> It costs $700,000 per day in energy costs to run ChatGPT 3.5, according to recent estimates, and leaves behind a massive carbon footprint in the process.

Compared to what? I wouldn't defend LLMs as "worth their electricity" quite yet, and they are definitely less efficient than a lot of other software, but I'd still like to see how this compares to gaming consoles, or email servers, the advertising industry hosting costs, cryptocurrency, and so on. Just doesn't seem worth pointing out the carbon footprint of AI just yet.

Aurornis · on June 25, 2024

ChatGPT is very popular, with many users.

Any article citing the power usage without calculating it in terms of users of queries is just trying to push an agenda by omitting how many people are using it.

refulgentis · on June 25, 2024

You had me till your thesis.

"Push an agenda"?

If they inserted a couple paragraphs saying "we estimate about 200M users a day, etc. etc.", would that add or detract from the article?

It's not relevant how they derived the number when you're reading, you only need an order of magnitude estimate, rest is distraction.

Arainach · on June 25, 2024

When you make something more available/cheaper, overall usage often goes up.

Overall energy use is an important metric regardless of energy per task.

Airplanes are WAY more efficient per passenger than they were in the past, but it's still valid to express concern over the energy usage and pollutants of air travel with so many more routes being flown.

eli · on June 25, 2024

How many people are using it because it's being offered at far below operating costs, even before you factor in the externalities of massive energy use?

Jedd · on June 25, 2024

Yeah, but the same 'number of users' claim could be fired at tiktok, facebook, linkedin, instagram, myriad other similarly dubious endeavours.

Active user count is not necessarily correlated to worthwhile consumption of resources.

croes · on June 25, 2024

>Just doesn't seem worth pointing out the carbon footprint of AI just yet.

Of course it does. It's not like AI replaced anything you mentioned. Its carbon footprint comes on top of it.

The benefit is secondary if the end result just means more carbon dioxide.

blixt · on June 25, 2024

I meant more that the % is so low that even if the usage is on top of all other usage (not a completely clear statement to make), it's like starting to mention any other thing in the long tail of technology leaving behind a "massive carbon footprint". Yes, it matters, especially if you were making a report focused on sources of carbon footprint, but in general, saying "AI carbon footprint is bad" just seems like wanting to give it a bad name. In reality AI tech doesn't seem to be such a big contributor percentage-wise. Of course it should still be optimized, not arguing that.

XorNot · on June 25, 2024

"carbon footprint" for any computational technology is a ridiculous notion anyway: it's all electricity, and electricity is source independent.

ameesdotme · on June 25, 2024

Just because the carrier of energy is source independent, doesn't mean the consumer of that energy is not responsible for the carbon emissions of its production. Since we're talking hundreds of TWh[1], the policies of those consumers can have a massive impact on global emissions.

[1] https://www.iea.org/energy-system/buildings/data-centres-and...

geewee · on June 25, 2024

> It's not like AI replaced anything you mentioned.

I mean it's true it hasn't replaced anything the OP mentioned, but it has definitely replaced parts of the compute that I would normally use for e.g. searching.

croes · on June 25, 2024

But do you now spend less time on the computer?

robertlagrant · on June 25, 2024

If people are activating Google's servers less, that's some energy saved. I don't know how it compares, but I guess OpenAI aren't running a live bidding war with advertisers on every request.

ag_guy · on June 26, 2024

> If people are activating Google's servers less, that's some energy saved

Instead of google's server people are accessing OpenAi servers, which wastes even more power. How is that any better?

jstanley · on June 25, 2024

> they are definitely less efficient than a lot of other software

This is not a fair claim to make. Is a milling machine less efficient than a clock?

It does a different thing, it's not really comparable.

blixt · on June 25, 2024

It’s fair in some cases, but indeed - not in others - e.g., code completion with LLMs is a time saver and worth paying for, compared to any other tech out there. At least this is how I see it, I know some people say it introduces mistakes but this is up for debate.

I do think if we look at translation tasks, grammar correction, information look up, etc. it adds a competitive convenience factor but I can’t say that running state of the art GPUs at very high wattages for up to a minute to do what specialized software can enable you to do by running for some milliseconds on much lower wattages isn’t less efficient. I’m referring to running multiple Google searches yourself to answer a question, or using a more traditional translation service, spell checker, and so on.

adroniser · on June 25, 2024

The only source I can find for this estimate is from a year ago. I feel like efficiency has gone up by a lot since then

croes · on June 25, 2024

Same as usage

pier25 · on June 25, 2024

Yep. See Jevons paradox

https://en.wikipedia.org/wiki/Jevons_paradox

glial · on June 25, 2024

I asked GPT-4 to estimate how much CO2 this likely emits, in units of "typical car usage in a city." It suggests this emits roughly as much CO2 as Reno or Des Moines. That's staggering, but there are about 100 cities this size in the US, so decreasing car usage 1% would more than offset this. I know this is a bizarre comparison to make, but CO2 emissions are fungible.

Wowfunhappy · on June 25, 2024

> I asked GPT-4

What makes you confident it gave you an accurate answer?

TheBlight · on June 25, 2024

"I asked GPT" is the new "I saw it on TV."

Aurornis · on June 25, 2024

The press release is devoid of useful information, unsurprisingly. You can run an LLM under almost any energy envelope if you’re willing to wait long enough for the result. Total energy consumed and the time difference are the more important metrics.

The actual paper is here: https://arxiv.org/abs/2406.02528

The key part from the summary:

> To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency.

There is a lot of unnecessary obfuscation of the numbers going on in the abstract as well, which is unfortunate. Instead of quoting the numbers they call it “billion-parameter scale” and “beyond human readable throughout”.

FpUser · on June 25, 2024

They did say that the answer is being produced faster than the human can read.

boto3 · on June 25, 2024

yes, human average reading speed is ~250 words per minute, but to be fair, it's not a widely known stat.

onlyrealcuzzo · on June 25, 2024

It's almost as if our brain is optimized for taking in more signals than just words on a page.

kevmo · on June 25, 2024

250 seems high? I want to believe, though.

BenjiWiebe · on June 25, 2024

That's 4 words per second.

mensetmanusman · on June 25, 2024

Speaking is 150 wpm

pbnjay · on June 25, 2024

Yeah... They are using a single-core 13W measurement to project out. For a 64x parallelization - no mention of any overhead due to parallelization or power needs of the supporting hardware. This is a key quote for me (page 12 of the PDF):

> The 1.3B parameter model, where L = 24 and d = 2048, has a projected runtime of 42ms, and a throughput of 23.8 tokens per second.

e.g. 64 x 13.67W = 874 Watts to run a 1.3B model at 23.8 t/s... I'm pretty sure my phone can do way better than that! Even half that power given their assertions in the table are still overpowered for such a small model.

shivaluminaire · on July 1, 2024

When you multiply by 64 you also get 64 times more tokens per second!! Your math is wrong.

pbnjay · on July 10, 2024

That's their math, the 23.8t/s is already the 64x but they didn't 64x the other stats.

shivaluminaire · on July 1, 2024

When you multiply by 64 you also get 64 times more tokens per second!! Your math is wrong.

Escapado · on June 25, 2024

> "For the largest model size of 13B parameters, the MatMul-free LM uses only 4.19 GB of GPU memory and has a latency of 695.48 ms, whereas Transformer++ requires 48.50 GB of memory and exhibits a latency of 3183.10 ms"

That's a really, _really_ big difference in memory usage and since this scales sub-linear (300M param model uses 0.21GB, 13B model uses 4.19B) a 70B model would fit on an RTX 4090. I think currently people often run 34B Models with 4bit quants on that so I would like to see some larger models trained on more tokens with this approach.

Also their 2.7B Model took 173hours on 8 NVIDIA H100 GPUs and that also seems to roughly scale linearly with the parameter size, so a company with access to a small cluster of those DGX pods (say 8) could train such a model in about 30 days - though the 100B token training set might be lackluster for SotA but maybe someone else could chime in on that.

dustypotato · on June 25, 2024

> a 70B model would fit on an RTX 4090 With this technique, it will not, because it uses a custom FPGA, not consumer GPUs

dwallin · on June 25, 2024

They tested their models on both a conventional gpu as well as a custom FPGA.

syntaxing · on June 25, 2024

For reference, this is a real prototype for Bitnet 1.58b which uses (-1,0,1) as the weights which simplifies the matrix multiplication [0].

[0] https://arxiv.org/abs/2402.17764

ijustlovemath · on June 25, 2024

I'm just curious how you close timing on a billion parameter model! I used to TA a digital design course that heavily involved FPGAs, and any kind of image or sprite usage, even on the order of megabytes, would crank up compile times insanely and sometimes even fail 30-45min into the build!

If anyone can offer insight that would be greatly appreciated

Retr0id · on June 25, 2024

That's 13 Watts apparently, in non-American units.

jimbobthrowawy · on June 25, 2024

Is there an American unit for power? I thought J/s is universal. 0.017433 Horsepower?

I had expected them to make the title more clickbaity, but that number is about right for a modern lightbulb.

aquova · on June 25, 2024

The joke is that Americans like to use metaphors when describing measurements of things, like "as large as a baseball" or "as long as a football field".

However, there are non-SI units that are somewhat commonplace. Horsepower, foot-pounds per second, or BTU per second aren't unheard of.

bottom999mottob · on June 25, 2024

It looks like this is a quantization method to flatten matrices for vector addition. Can anyone explain how this could allow LLMs to reach current benchmarks without losing performance?

wrsh07 · on June 25, 2024

The article mentions two things. Quantization (which everyone is already doing). There's bitnet, or you can Google 1 bit quantization (it's not really 1 bit, it's ternary as the article says).

Once you do that, a dot product is just addition/subtraction. Matrix multiplications are just dot products, so you've removed multiplication.

Then they built custom hardware that presumably only does that operation and doesn't use much electricity

Since this is just "more aggressive quantization" it's not too surprising that it reaches similar performance to other quantized models.

The network should structurally look the same as any other llm, it's still using transformers etc (afaiu)

airspresso · on June 25, 2024

Very sceptical of the claims that they don't lose any performance. Sounds like wishful thinking without enough effort put into measuring the performance loss that has to be there due to the heavy quantization. They even dropped computing the entire matrix addition, only focusing on some parts of it. If the benchmarks used don't show a quality drop, then that's because those benchmarks are not able to properly measure said quality drop. (edit: typo)

smaddox · on June 25, 2024

Because existing LLMs store no more than 2bits of knowledge per parameter, despite having many more bits of precision: https://arxiv.org/abs/2404.05405

DennisP · on June 26, 2024

Don't FPGAs have some overhead? How much better could this be on a custom ASIC?

nobodyandproud · on June 25, 2024

It’s cool that my ANN course from twenty years back gives me enough background to understand this and quantization.

My professor at the time was at his last leg in impact, when ANNs were looked down on right before someone had the bright idea of using video cards.

I hope he’s doing well/retired on a high note.

truckerbill · on June 25, 2024

I'd like to play around with ideas like this (training in ternary for example). Does anyone have any links to good reading materials or resources?

The FPGA trinary implementation is also really interesting!

ChrisArchitect · on June 26, 2024

[dupe]

Discussion a few weeks ago: https://news.ycombinator.com/item?id=40620955

foreverpiano · on June 25, 2024

Can this work build some easy-to-use apis? So it may be easy to apply in diffusion on other model.

datameta · on June 25, 2024

I'm curious about the specs of the FPGA used - I imagine fairly high-end?

moffkalast · on June 25, 2024

Lightbulbs are probably not the best thing to compare against, they're famously high in energy consumption and terribly inefficient. Even LEDs are absolutely awful and barely crack 30% total efficiency, most of what they make is heat.

daralthus · on June 25, 2024

Does anyone know what's the patent situation around this?

throwawaymaths · on June 25, 2024

Incandescent I presume. Mistral-7b on a Nvidia 3060 draws about 100-odd watts of power.

karaterobot · on June 25, 2024

Why presume? It's in the article. It's 13 watts.

christianqchung · on June 25, 2024

Didn't read the article I presume.

robertlagrant · on June 25, 2024

You must be... Dr Livingstone?

onesphere · on June 25, 2024

And for candle power?

m3kw9 · on June 25, 2024

Why not have fpga code and the LLM algorithm so we can replicate it?