Whaaaaat, how are you getting 100ms per token on an 5 year old potato without a ...

pbmonster · on July 7, 2023

I'm running TheBlokes wizard-vicuna-13b-superhot-8k.ggmlv3 with 4-bit quantization on a Ryzen 5 that's probably older than OPs laptop.

I get around 5 tokens a second using the webui that comes with oogabooga using default settings. If I understand correctly, this does not get me 8k context length yet, because oogabooga doesn't have NTK-aware scaled RoPE implemented yet.

Using the same model with the newest kobold.cpp release should provide 8k context, but runs significantly slower.

Note that this model is great at creative writing, and sounding smart when talking about tech stuff, but it sucks horribly at stuff like logic puzzles or (re-)producing factually correct in-depth answers about any topic I'm an expert in. Still at least an order of magnitude below GPT4.

The model is also uncensored, which is amusing after using GPT4. It will happily elaborate on how to mix explosives and it has a dirty mouth.

Interestingly, the model speaks at least half a dozen languages much better than I do, and is proficient at translating between them (far worse than deepL, of course). Which is mindblowing for a 8GByte binary. It's actual black magic.

staticman2 · on July 7, 2023

"Note that this model is great at creative writing"

Could you elaorate on what you mean by that, like, are you telling it to write you a short story and it does a good job? My experiments with using these models for creative writing have not been particularly inspiring.

pbmonster · on July 10, 2023

Yes, having the model write an entire short story or chapter is not very good. It excels if you interact closely with it.

I tested it to create NPCs for fantasy role playing games. I think its the primary reason cobold.cpp exists (hence the name).

You give it a (ideally long, detailed) prompt describing the character traits of the NPCs you want, and maybe even add back and forth dialogue with other characters to the prompt.

And then you just talk to those characters in the scene you set.

There's also "story mode", where you and the model take turns writing a complete story, not only dialogue. So both of you can also provide exposition and events, and the model usually only creates ~10 sentences at a time.

There's communities online providing extremely complex starting prompts and objectives (escape prison, assassin someone at a party and get away, ect.) for the player, and for me, the antagonistic ones (the models has control over NPCs that don't like you) are surprisingly fun.

Note that one of the main drivers of having uncensored open source LLMs is people wanting to role-play erotica with the model. That's why the model that first had scaled RoPE for 8k context length is called "superhot" - and the reason it has 8K context is that people wanted to roleplay longer scenes.

Rastonbury · on July 7, 2023

This is a exactly a case in point why people decide to pay OpenAI instead of rolling their own. I'm non-technical but have setup an image gen app based custom SD model using diffusers, so not entirely clueless.

But for LLM I have no where idea where to start quickly. Finding a model on a leaderboard, download and setup then customising it and benchmarking is way too much time for me, I'll just pay for GPT4 if ever need to instead of chasing and troubleshooting to get some magical result. It'll be easier in the future I'm sure when an open model merges as the SD1.5 of LLM

lhl · on July 7, 2023

I've found https://gpt4all.io/ to be the fastest way to get started. I've also started moving my notes to https://llm-tracker.info/ which should help make it easier for people getting started: https://llm-tracker.info/books/howto-guides/page/getting-sta...

PostOnce · on July 7, 2023

Here is a short test of a 7B 4bit model on an intel 8350U laptop with no AMD/Nvidia GPU.

On that laptop CPU from 2017, using a copy of llama.cpp I compiled 2 days ago (just "make", no special options, no BLAS, etc):

  ./main -m models/WizardLM-7B-uncensored.ggmlv3.q4_0.bin -n 128 -s 99 -p "A short test for Hacker News:"

  llama_print_timings:      sample time =    19.12 ms /    36 runs   (    0.53 ms per token,  1882.65 tokens per second)
  llama_print_timings: prompt eval time =   886.82 ms /     9 tokens (   98.54 ms per token,    10.15 tokens per second)
  llama_print_timings:        eval time =  5507.31 ms /    35 runs   (  157.35 ms per token,     6.36 tokens per second)

and a second run:

  ./main -m models/WizardLM-7B-uncensored.ggmlv3.q4_0.bin -n 128 -s 99 -p "Sherlock Holmes favorite dinner was "

  llama_print_timings:      sample time =    54.37 ms /   102 runs   (    0.53 ms per token,  1875.93 tokens per second)
  llama_print_timings: prompt eval time =   876.94 ms /     9 tokens (   97.44 ms per token,    10.26 tokens per second)
  llama_print_timings:        eval time = 16057.95 ms /   101 runs   (  158.99 ms per token,     6.29 tokens per second)

at 158ms per token, if we guess a word is 2.5 tokens, then that's 151 words per minute, much faster than most people can type. On a $250 laptop. Isn't the future neat?

the code I was running: https://github.com/ggerganov/llama.cpp

and the model: https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML

There are other models that may perform better, I'm going to be doing a lot of screwing around with OpenLLaMA this weekend.

marci · on July 7, 2023

I'm on a thinkpad with a 2016 CPU (i5-7300U) running ubuntu.

I don't know anything so I left default settings.

I get about 450ms/t with airoboros-7b and 350ms/t with orca-mini-3b.

edit: with oobabooga webui