LLaMa running at 5 tokens/second on a Pixel 6

naillo · on March 15, 2023

This is really cool but the output is such garbage at that weight size that you might as well be running a markov chain.

simonw · on March 15, 2023

That's why Alpaca is so exciting: it instruction-tunes LLaMA to the point that even the tiny 7B model (the one that fits on a phone) produces useful output: https://simonwillison.net/2023/Mar/13/alpaca/

throwaway1851 · on March 15, 2023

I’ve been playing with the Alpaca demo, and I’m really impressed! The outputs are generally excellent, especially for a model of that size, fine tuned on a $100 (!!) compute budget.

If the cloud of uncertainty around commercial use of derivative weights from LLaMA can be resolved, I think this could be the answer for a lot of domain-specific generative language needs. A model you can fine tune on your own data, and which you host and control, rather than depending on a cloud service not to arbitrarily up prices/close your account/apply unhelpful filters to the output/etc.

MagicMoonlight · on March 15, 2023

But they won’t give us the model… so it’s ultimately meaningless because they’ll just sell out

itake · on March 15, 2023

My understanding is they legally can't. It was trained used OpenAI, which doesn't allow using their output to train new models. Someone would need to find another data source to fine tune llama.

dwallin · on March 15, 2023

You don't need to find a new data source, you just need to find an unencumbered third party. You can use the that data publicly provided in the git repo as long as you haven't signed an agreement with OpenAI yourself.

hawski · on March 15, 2023

Why they can train their model on copyrighted data and claiming fair use as they do not outright copy while disallowing training other models on their output? I understand revoking access though.

alwayslikethis · on March 15, 2023

They realistically can not, I believe. If Microsoft is right that this is under fair use, then you are only limited by OpenAI ToS, which also says that the copyright of generated content belongs to the user. Do the following:

Person A uses GPT3 to generate training data, and publish it on his blog, without representing it as human generated. Person A does not give permission for it to be used by Alpaca team.

Alpaca team comes along, scrape his blog, and uses it as training data, without permission from person A. Now this is fair use, so there is nothing person A can do to stop it, just like how Github scraped our code for Copilot without permission.

sdrg822 · on March 15, 2023

It’s only a matter of time

hackernewds · on March 16, 2023

waiting for the stable diffusion version of gpt still

Zetice · on March 15, 2023

What will OpenAI do, sue? Okay but now it's out there.

SequoiaHope · on March 15, 2023

I believe the work was done by Stanford, so OpenAI could revoke Stanford's access to their API. That would inhibit Stanford's ability to do new research with this system.

nwoli · on March 15, 2023

Also how could that even be under protection. As if they haven’t been scraping copyrighted materials and sites with end user agreements to train the model in the first place

jerpint · on March 16, 2023

OpenAI I knows they’re just a few cycles behind open source models - probably why they stuck a deal early with Microsoft.

simonw · on March 15, 2023

If they don't release the model, recreating it doesn't look too hard. $100 worth of compute time to run the fine-tuning, and the training data they used is here: https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpac...

That would have the same licensing problems that they have though: that alpaca_data.json file was created using GPT3. But creating a "clean" training set of 52,000 examples doesn't feel impossible to me for the right group.

dwallin · on March 15, 2023

You're only bound by the terms of OpenAI's agreement if you agreed to the terms of use. If a third party obtained the data without signing an agreement with OpenAI (eg. by just downloading it from that repo) they are under no obligation to refrain from using it to compete with OpenAI. It is fair-use by the same argument OpenAI itself uses to train its own models on publicly available data.

jdright · on March 15, 2023

How one can tune the model to a specific usage? Is there some place that teaches this?

circuit10 · on March 15, 2023

It’s quite a bit bigger than GPT-2 which was a really big deal not very long ago (remember the unicorn news article example and the slow release because it was apparently too powerful?)

nl · on March 15, 2023

Our standards have gone up so much!

If you are talking about the video that's perfectly fluent English. There are some unusual elements to the story which probably wouldn't be there in a larger model.

I'd invite you to try that with a Markov model or even something like a LSTM based neural network and compare.

sottol · on March 15, 2023

Afaik, the Llama sampler needs to be tuned to get more sensible outputs.

https://twitter.com/theshawwn/status/1632569215348531201

falcor84 · on March 15, 2023

Isn't any LLM mathematically a Markov chain, such that the current state includes the context of the last (finite) n tokens?

seabass-labrax · on March 15, 2023

The distinctive aspect of the 'transformer' family of neural networks is that they incorporate 'attention', which is a model that identifies which parts of the input are critical to its meaning. You're correct in that modern transformer models are essentially Markov chains, but the function that derives the probabilities (of which the attention heads are part of) is imbued with 'understanding' of related concepts during the training process. In contrast, a traditional Markov chain text generator might have a probability function that only takes into account the frequency of the n-grams in the training data, and so produces superficially coherent (but mostly meaningless) output.

skybrian · on March 16, 2023

This is like saying all computers are state machines because they have finite amounts of memory and disk space. It's sort of true, but it's not a useful mathematical model.

With a Markov chain, you're assuming a state machine where each state has independent probabilities on outgoing edges. As the number of states gets larger, you have fewer training samples for each state. When n gets large enough, nearly all states have zero training samples; they've never been seen before. How do you estimate probabilities?

Better to just say it's a stateless function of the input.

lostmsu · on March 15, 2023

From the video output seems fine.

But if it is a trimmed version, it is wong to call it LLaMa.

PreachSoup · on March 15, 2023

Could call it Slim LLaMa

bryant · on March 15, 2023

SLLaMa?

refulgentis · on March 15, 2023

It's nonsensical, celeb announces they're going to rehab and notes it (?) is an issue affecting all women, at least, earlier today (??), they also noted it wasn't drugs or alcohol this time, but, a life (???)

londons_explore · on March 15, 2023

Without instruction tuning, the perfect language model produces output which has the same level of intelligibility as random text from the training set. And the training set probably has a lot of spam and junk in.

ddren · on March 15, 2023

What are you comparing it to? Without instruction tuning and a two character prompt "He" I am not sure why you would expect it to perform any better.

refulgentis · on March 15, 2023

I was replying to a comment that said it “seems fine.”

It does not seem fine.

It is incomprehensible and doesn’t match the results I’ve seen from 7B through 65B.

It is true that RLHF could improve it, and perhaps then this severe of optimization will seem fine.

tbalsam · on March 15, 2023

I've heard a number of people say (from earlier) that the quantization and default sampling parameters is way wacked. Honestly even running that model size alone is the big achievement here and getting the accuracy to actually reach the benchmark is the beeg next step nao, I believe. <3 :'))))

lostmsu · on March 16, 2023

If you run a quantized 60G model and the output is worse than raw 7G model, you can throw your quantizer out.

superkuh · on March 15, 2023

Until it thermally throttles 40 seconds later. But yeah, it's really cool how many platforms the vanilla code in llama.cpp can be easily compiled on. And somehow I doubt they did the quantization step on the Pixel itself. My favorite was the person who did it on the rpi4. I know a guy working on getting it going on rpi3 but the ARM7/8 mixing , NEON support, and 64 bit ARM intrinsics are apparently non-trivial to convert.

aquajet · on March 15, 2023

I'm the original tweet author.

Currently typing this from my Pixel after running it countless times :)

mdrzn · on March 16, 2023

I would need a step by step guide, but I would love to test it on my Galaxy S21 Ultra 5G. It has 16gb ram and I have about 350 GB available.

sebzim4500 · on March 15, 2023

>And somehow I doubt they did the quantization step on the Pixel itself

You're probably right (because why would they?) but I don't see any reason they couldn't have done this if they wanted to.

aquajet · on March 15, 2023

I would have tried it but I didn't have enough storage on my phone to hold both the original and quantized weights.

beiller · on March 15, 2023

Here is a thread to tweak the parameters which the model seems very sensitive to:

https://github.com/ggerganov/llama.cpp/issues/129

nico · on March 15, 2023

Could the model itself be used to tweak it’s own parameters iteratively?

hackernewds · on March 16, 2023

That's Model Extraction basically

nico · on March 16, 2023

Do you have a link to an explainer of how it would work or a potential implementation?

OscarCunningham · on March 15, 2023

This would be useful for predictive text. That's exactly what LLMs are actually built for.

a-dub · on March 15, 2023

LSTMs have been in the Google keyboard for years...

__mharrison__ · on March 15, 2023

I'm waiting until it runs on my C64...

tosh · on March 15, 2023

Did anyone get this to run on an iPhone or in a browser yet?

pzo · on March 15, 2023

Most iphone have only 4GB RAM (and even latest iphone 14 has only 6GB RAM). Pixel 6 has 8GB RAM. But bigger issue is on iOS still OS limits how much RAM your app can use and might kill your app.

londons_explore · on March 15, 2023

I'm still amazed that Apple invests so much into every other bit of hardware on a high end phone, yet always gives you the bare minimum amount of RAM they can get away with.

There are so many use cases (like this) that require more RAM. And even if a use case doesn't theoretically require more RAM, getting a developer to dedicate time to optimizing RAM is time taken away from making a wonderful app.

squarefoot · on March 15, 2023

> I'm still amazed that Apple invests so much into every other bit of hardware on a high end phone, yet always gives you the bare minimum amount of RAM they can get away with.

Advanced hardware makes bullet points on advertising to sell the device; giving the bare minimum of RAM accelerates the device planned obsolescence, so that user will be forced to upgrade sooner to the next model.

thewataccount · on March 15, 2023

I personally don't buy that it's planned obsolescence. I think most people just don't need that much ram. IOS is really good at loading/unloading stuff as needed, outside of HN I'm not sure most consumers care about the exact amount of ram.

Apple still does security updates for IOS - last was 12.5.7 - 23 Jan 2023 - that's back to the iPhone 5S

They've literally provided security updates for a 10 year old device, has any competitor even come close to that?

hackernewds · on March 16, 2023

You should try using a 10 year old device. None of the apps support the older OS given Apple's dictatorial app approval process. And you can't get away with upgrading to iOS 14 given how slow it will run, and you really have no choice for other OS on the architecture

Functionally they're useless

szundi · on March 15, 2023

How is that an iPhone 7 is completely current vs give me a branded Android from the same year of release that has even security updates, not even features.

alden5 · on March 15, 2023

The problem with iPhones is once updates stop there's nothing you can do. The iPhone 7 isn't current, it's stuck on iOS 15 while the newest is 16. And while the pixel 2 (which is only a month younger than the iPhone 7) only got official support up to Android 11; you actually own the device and can easily unlock the boot-loader to upgrade to Android 13.

thewataccount · on March 15, 2023

Apple still does security updates for IOS - last was 12.5.7 - 23 Jan 2023 - that's back to the iPhone 5S.

Feature updates with the current IOS 16 goes back to the iPhone 8

Yeah you do lose feature updates and slowly app support after the latest version drops support, but it's not like they're dropping support after 2 years, and you can stay on it for years later if you'd like.

I'm not saying it couldn't be better but they're clearly far above the vast majority of their competition.

londons_explore · on March 16, 2023

Also, old versions of android are very functional. Nearly every app will run fine even on ancient android.

Old versions of iOS quickly stop working as apps demand updates, and the updates require a new iOS version.

kurtoid · on March 16, 2023

I wish that were the case with all android manufacturers - Verizon versions of Samsung phones have locked bootloader's and (afaik) there aren't working bootloader unlocks for all of them

syntaxing · on March 15, 2023

Does this in theory mean it should be relatively easy to port to coral TPU?

sottol · on March 15, 2023

Afaik that TPU has only 8MB of RAM to fit models, you'd have to continuously stream the weights - can't imagine that's workable.

saidinesh5 · on March 15, 2023

All their tensor/math magic seems to happen in https://github.com/ggerganov/llama.cpp/blob/master/ggml.h .

So maybe if you implement the ggml.c with tensorflow/libcoral - you'd have a chance.

nshm · on March 15, 2023

It is not really llama, it is llama quantized to 4bit. Not even the quality of original 7B. I could also quantize it to 1 bit and claim it runs on my RPI3.

elbigbad · on March 15, 2023

The quantization to four hits doesn’t have that much effect on the output. 1 bit might not either, but someone would need to do some testing before making the claim that “1 bit … runs on my RPI3” because “runs” is a bit overloaded to mean “runs and produces sensible output.” I think you’re missing that runs here has that overloading.

nwoli · on March 15, 2023

It should also be mentioned that it isn’t really that each weight is a 4 bit float, but rather that they’re basically clustering floats into 2^4 clusters and then grabbing from a lookup table the float associated to a 4 bit value as needed. So as long as the weights roughly fall into 16 clusters you’ll get identical results

alden5 · on March 15, 2023

i haven't noticed 4bit quantization affecting the quality of LLaMA-7B, it produces very coherent outputs, the trick is having a good example in your prompt so it has a good idea of what's expected of it.

muttled · on March 16, 2023

Quality and quantity: I've had the best luck cramming a bunch of examples into the input, just like with GPT-J where you're only working with 6B parameters. Making sure the format stays consistent and ideally presented in the shape you'd encounter that same text if you found it on a webpage somewhere.

mrWiz · on March 15, 2023

The 4 bit quantization performs well, though. Does your 1 bit version?

tbalsam · on March 15, 2023

1 bit will mathematically be guaranteed to be more efficient for performance-per-parameter so to me it is a pretty clear eventuality one day, but I think also the relative performance % will likely tank still. Impressed honestly that it held so well at 4 bit tbh, I thought personally that 8 bit was the ceiling.

However I can see fractional bits (via binary representations) and larger models happening first before that compression step.

And then we have the sub-bit range..... ;DDDD

nshm · on March 15, 2023

Do you have the numbers? I suspect is is way worse. Original llama.cpp authors never measure any numbers as well.

ddren · on March 15, 2023

The python implementation[1] ran some tests using the same quantization algorithm as llama.cpp (4 bit RTN).

1: https://github.com/qwopqwop200/GPTQ-for-LLaMa

nshm · on March 15, 2023

Great thanks a lot.

So we have numbers on PTB original perplexity 8.79 quantized 9.68, already 10% worse. And PPL reported per token I suppose? Because word PPL for PTB must be around 20, not less than 10.

Any numbers on more complex tasks then? like QA?

summarity · on March 15, 2023

Some numbers here: https://github.com/qwopqwop200/GPTQ-for-LLaMa#result

sottol · on March 15, 2023

They're using GTPQ -- here you go: https://arxiv.org/abs/2210.17323 . The authors benchmarked two families of models over a wide range of numbers of params.

ddren · on March 15, 2023

llama.cpp is using RTN at the moment.

renewiltord · on March 15, 2023

I used the 7B quantized to 4 bit and it needs a few tries for most things, but it's not useless.

Havoc · on March 15, 2023

Any more details? I'm guessing they're leveraging the NPU in the pixel?

saidinesh5 · on March 15, 2023

I think they are using llama.cpp without any NPU/TPU patches. By default it only runs on CPU with support for various SIMD extensions.

https://github.com/ggerganov/llama.cpp

zodester · on March 15, 2023

It uses the ARM NEON extensions to the instruction set for SIMD (as far as I understand).

snapplebobapple · on March 16, 2023

So is this finally peak hipster coder and from this point on rust will diminish because all the cool kids start switching to zag?

a-dub · on March 15, 2023

would be even cooler if it employed the accelerator!

(unless this ggml library is doing that under the hood)

i assume it has unified memory, but maybe not little numbers...