Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LLaMa running at 5 tokens/second on a Pixel 6 (twitter.com/thiteanish)
221 points by pr337h4m on March 15, 2023 | hide | past | favorite | 77 comments


This is really cool but the output is such garbage at that weight size that you might as well be running a markov chain.


That's why Alpaca is so exciting: it instruction-tunes LLaMA to the point that even the tiny 7B model (the one that fits on a phone) produces useful output: https://simonwillison.net/2023/Mar/13/alpaca/


I’ve been playing with the Alpaca demo, and I’m really impressed! The outputs are generally excellent, especially for a model of that size, fine tuned on a $100 (!!) compute budget.

If the cloud of uncertainty around commercial use of derivative weights from LLaMA can be resolved, I think this could be the answer for a lot of domain-specific generative language needs. A model you can fine tune on your own data, and which you host and control, rather than depending on a cloud service not to arbitrarily up prices/close your account/apply unhelpful filters to the output/etc.


But they won’t give us the model… so it’s ultimately meaningless because they’ll just sell out


My understanding is they legally can't. It was trained used OpenAI, which doesn't allow using their output to train new models. Someone would need to find another data source to fine tune llama.


You don't need to find a new data source, you just need to find an unencumbered third party. You can use the that data publicly provided in the git repo as long as you haven't signed an agreement with OpenAI yourself.


Why they can train their model on copyrighted data and claiming fair use as they do not outright copy while disallowing training other models on their output? I understand revoking access though.


They realistically can not, I believe. If Microsoft is right that this is under fair use, then you are only limited by OpenAI ToS, which also says that the copyright of generated content belongs to the user. Do the following:

Person A uses GPT3 to generate training data, and publish it on his blog, without representing it as human generated. Person A does not give permission for it to be used by Alpaca team.

Alpaca team comes along, scrape his blog, and uses it as training data, without permission from person A. Now this is fair use, so there is nothing person A can do to stop it, just like how Github scraped our code for Copilot without permission.


It’s only a matter of time


waiting for the stable diffusion version of gpt still


What will OpenAI do, sue? Okay but now it's out there.


I believe the work was done by Stanford, so OpenAI could revoke Stanford's access to their API. That would inhibit Stanford's ability to do new research with this system.


Also how could that even be under protection. As if they haven’t been scraping copyrighted materials and sites with end user agreements to train the model in the first place


OpenAI I knows they’re just a few cycles behind open source models - probably why they stuck a deal early with Microsoft.


If they don't release the model, recreating it doesn't look too hard. $100 worth of compute time to run the fine-tuning, and the training data they used is here: https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpac...

That would have the same licensing problems that they have though: that alpaca_data.json file was created using GPT3. But creating a "clean" training set of 52,000 examples doesn't feel impossible to me for the right group.


You're only bound by the terms of OpenAI's agreement if you agreed to the terms of use. If a third party obtained the data without signing an agreement with OpenAI (eg. by just downloading it from that repo) they are under no obligation to refrain from using it to compete with OpenAI. It is fair-use by the same argument OpenAI itself uses to train its own models on publicly available data.


How one can tune the model to a specific usage? Is there some place that teaches this?


It’s quite a bit bigger than GPT-2 which was a really big deal not very long ago (remember the unicorn news article example and the slow release because it was apparently too powerful?)


Our standards have gone up so much!

If you are talking about the video that's perfectly fluent English. There are some unusual elements to the story which probably wouldn't be there in a larger model.

I'd invite you to try that with a Markov model or even something like a LSTM based neural network and compare.


Afaik, the Llama sampler needs to be tuned to get more sensible outputs.

https://twitter.com/theshawwn/status/1632569215348531201


Isn't any LLM mathematically a Markov chain, such that the current state includes the context of the last (finite) n tokens?


The distinctive aspect of the 'transformer' family of neural networks is that they incorporate 'attention', which is a model that identifies which parts of the input are critical to its meaning. You're correct in that modern transformer models are essentially Markov chains, but the function that derives the probabilities (of which the attention heads are part of) is imbued with 'understanding' of related concepts during the training process. In contrast, a traditional Markov chain text generator might have a probability function that only takes into account the frequency of the n-grams in the training data, and so produces superficially coherent (but mostly meaningless) output.


This is like saying all computers are state machines because they have finite amounts of memory and disk space. It's sort of true, but it's not a useful mathematical model.

With a Markov chain, you're assuming a state machine where each state has independent probabilities on outgoing edges. As the number of states gets larger, you have fewer training samples for each state. When n gets large enough, nearly all states have zero training samples; they've never been seen before. How do you estimate probabilities?

Better to just say it's a stateless function of the input.


From the video output seems fine.

But if it is a trimmed version, it is wong to call it LLaMa.


Could call it Slim LLaMa


SLLaMa?


It's nonsensical, celeb announces they're going to rehab and notes it (?) is an issue affecting all women, at least, earlier today (??), they also noted it wasn't drugs or alcohol this time, but, a life (???)


Without instruction tuning, the perfect language model produces output which has the same level of intelligibility as random text from the training set. And the training set probably has a lot of spam and junk in.


What are you comparing it to? Without instruction tuning and a two character prompt "He" I am not sure why you would expect it to perform any better.


I was replying to a comment that said it “seems fine.”

It does not seem fine.

It is incomprehensible and doesn’t match the results I’ve seen from 7B through 65B.

It is true that RLHF could improve it, and perhaps then this severe of optimization will seem fine.


I've heard a number of people say (from earlier) that the quantization and default sampling parameters is way wacked. Honestly even running that model size alone is the big achievement here and getting the accuracy to actually reach the benchmark is the beeg next step nao, I believe. <3 :'))))


If you run a quantized 60G model and the output is worse than raw 7G model, you can throw your quantizer out.


Until it thermally throttles 40 seconds later. But yeah, it's really cool how many platforms the vanilla code in llama.cpp can be easily compiled on. And somehow I doubt they did the quantization step on the Pixel itself. My favorite was the person who did it on the rpi4. I know a guy working on getting it going on rpi3 but the ARM7/8 mixing , NEON support, and 64 bit ARM intrinsics are apparently non-trivial to convert.


I'm the original tweet author.

Currently typing this from my Pixel after running it countless times :)


I would need a step by step guide, but I would love to test it on my Galaxy S21 Ultra 5G. It has 16gb ram and I have about 350 GB available.


>And somehow I doubt they did the quantization step on the Pixel itself

You're probably right (because why would they?) but I don't see any reason they couldn't have done this if they wanted to.


I would have tried it but I didn't have enough storage on my phone to hold both the original and quantized weights.


Here is a thread to tweak the parameters which the model seems very sensitive to:

https://github.com/ggerganov/llama.cpp/issues/129


Could the model itself be used to tweak it’s own parameters iteratively?


That's Model Extraction basically


Do you have a link to an explainer of how it would work or a potential implementation?


This would be useful for predictive text. That's exactly what LLMs are actually built for.


LSTMs have been in the Google keyboard for years...


I'm waiting until it runs on my C64...


Did anyone get this to run on an iPhone or in a browser yet?


Most iphone have only 4GB RAM (and even latest iphone 14 has only 6GB RAM). Pixel 6 has 8GB RAM. But bigger issue is on iOS still OS limits how much RAM your app can use and might kill your app.


I'm still amazed that Apple invests so much into every other bit of hardware on a high end phone, yet always gives you the bare minimum amount of RAM they can get away with.

There are so many use cases (like this) that require more RAM. And even if a use case doesn't theoretically require more RAM, getting a developer to dedicate time to optimizing RAM is time taken away from making a wonderful app.


> I'm still amazed that Apple invests so much into every other bit of hardware on a high end phone, yet always gives you the bare minimum amount of RAM they can get away with.

Advanced hardware makes bullet points on advertising to sell the device; giving the bare minimum of RAM accelerates the device planned obsolescence, so that user will be forced to upgrade sooner to the next model.


I personally don't buy that it's planned obsolescence. I think most people just don't need that much ram. IOS is really good at loading/unloading stuff as needed, outside of HN I'm not sure most consumers care about the exact amount of ram.

Apple still does security updates for IOS - last was 12.5.7 - 23 Jan 2023 - that's back to the iPhone 5S

They've literally provided security updates for a 10 year old device, has any competitor even come close to that?


You should try using a 10 year old device. None of the apps support the older OS given Apple's dictatorial app approval process. And you can't get away with upgrading to iOS 14 given how slow it will run, and you really have no choice for other OS on the architecture

Functionally they're useless


How is that an iPhone 7 is completely current vs give me a branded Android from the same year of release that has even security updates, not even features.


The problem with iPhones is once updates stop there's nothing you can do. The iPhone 7 isn't current, it's stuck on iOS 15 while the newest is 16. And while the pixel 2 (which is only a month younger than the iPhone 7) only got official support up to Android 11; you actually own the device and can easily unlock the boot-loader to upgrade to Android 13.


Apple still does security updates for IOS - last was 12.5.7 - 23 Jan 2023 - that's back to the iPhone 5S.

Feature updates with the current IOS 16 goes back to the iPhone 8

Yeah you do lose feature updates and slowly app support after the latest version drops support, but it's not like they're dropping support after 2 years, and you can stay on it for years later if you'd like.

I'm not saying it couldn't be better but they're clearly far above the vast majority of their competition.


Also, old versions of android are very functional. Nearly every app will run fine even on ancient android.

Old versions of iOS quickly stop working as apps demand updates, and the updates require a new iOS version.


I wish that were the case with all android manufacturers - Verizon versions of Samsung phones have locked bootloader's and (afaik) there aren't working bootloader unlocks for all of them


Does this in theory mean it should be relatively easy to port to coral TPU?


Afaik that TPU has only 8MB of RAM to fit models, you'd have to continuously stream the weights - can't imagine that's workable.


All their tensor/math magic seems to happen in https://github.com/ggerganov/llama.cpp/blob/master/ggml.h .

So maybe if you implement the ggml.c with tensorflow/libcoral - you'd have a chance.


It is not really llama, it is llama quantized to 4bit. Not even the quality of original 7B. I could also quantize it to 1 bit and claim it runs on my RPI3.


The quantization to four hits doesn’t have that much effect on the output. 1 bit might not either, but someone would need to do some testing before making the claim that “1 bit … runs on my RPI3” because “runs” is a bit overloaded to mean “runs and produces sensible output.” I think you’re missing that runs here has that overloading.


It should also be mentioned that it isn’t really that each weight is a 4 bit float, but rather that they’re basically clustering floats into 2^4 clusters and then grabbing from a lookup table the float associated to a 4 bit value as needed. So as long as the weights roughly fall into 16 clusters you’ll get identical results


i haven't noticed 4bit quantization affecting the quality of LLaMA-7B, it produces very coherent outputs, the trick is having a good example in your prompt so it has a good idea of what's expected of it.


Quality and quantity: I've had the best luck cramming a bunch of examples into the input, just like with GPT-J where you're only working with 6B parameters. Making sure the format stays consistent and ideally presented in the shape you'd encounter that same text if you found it on a webpage somewhere.


The 4 bit quantization performs well, though. Does your 1 bit version?


1 bit will mathematically be guaranteed to be more efficient for performance-per-parameter so to me it is a pretty clear eventuality one day, but I think also the relative performance % will likely tank still. Impressed honestly that it held so well at 4 bit tbh, I thought personally that 8 bit was the ceiling.

However I can see fractional bits (via binary representations) and larger models happening first before that compression step.

And then we have the sub-bit range..... ;DDDD


Do you have the numbers? I suspect is is way worse. Original llama.cpp authors never measure any numbers as well.


The python implementation[1] ran some tests using the same quantization algorithm as llama.cpp (4 bit RTN).

1: https://github.com/qwopqwop200/GPTQ-for-LLaMa


Great thanks a lot.

So we have numbers on PTB original perplexity 8.79 quantized 9.68, already 10% worse. And PPL reported per token I suppose? Because word PPL for PTB must be around 20, not less than 10.

Any numbers on more complex tasks then? like QA?



They're using GTPQ -- here you go: https://arxiv.org/abs/2210.17323 . The authors benchmarked two families of models over a wide range of numbers of params.


llama.cpp is using RTN at the moment.


I used the 7B quantized to 4 bit and it needs a few tries for most things, but it's not useless.


Any more details? I'm guessing they're leveraging the NPU in the pixel?


I think they are using llama.cpp without any NPU/TPU patches. By default it only runs on CPU with support for various SIMD extensions.

https://github.com/ggerganov/llama.cpp


It uses the ARM NEON extensions to the instruction set for SIMD (as far as I understand).


So is this finally peak hipster coder and from this point on rust will diminish because all the cool kids start switching to zag?


would be even cooler if it employed the accelerator!

(unless this ggml library is doing that under the hood)

i assume it has unified memory, but maybe not little numbers...




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: