I'm running TheBlokes wizard-vicuna-13b-superhot-8k.ggmlv3 with 4-bit quantization on a Ryzen 5 that's probably older than OPs laptop.
I get around 5 tokens a second using the webui that comes with oogabooga using default settings. If I understand correctly, this does not get me 8k context length yet, because oogabooga doesn't have NTK-aware scaled RoPE implemented yet.
Using the same model with the newest kobold.cpp release should provide 8k context, but runs significantly slower.
Note that this model is great at creative writing, and sounding smart when talking about tech stuff, but it sucks horribly at stuff like logic puzzles or (re-)producing factually correct in-depth answers about any topic I'm an expert in. Still at least an order of magnitude below GPT4.
The model is also uncensored, which is amusing after using GPT4. It will happily elaborate on how to mix explosives and it has a dirty mouth.
Interestingly, the model speaks at least half a dozen languages much better than I do, and is proficient at translating between them (far worse than deepL, of course). Which is mindblowing for a 8GByte binary. It's actual black magic.
"Note that this model is great at creative writing"
Could you elaorate on what you mean by that, like, are you telling it to write you a short story and it does a good job? My experiments with using these models for creative writing have not been particularly inspiring.
Yes, having the model write an entire short story or chapter is not very good. It excels if you interact closely with it.
I tested it to create NPCs for fantasy role playing games. I think its the primary reason cobold.cpp exists (hence the name).
You give it a (ideally long, detailed) prompt describing the character traits of the NPCs you want, and maybe even add back and forth dialogue with other characters to the prompt.
And then you just talk to those characters in the scene you set.
There's also "story mode", where you and the model take turns writing a complete story, not only dialogue. So both of you can also provide exposition and events, and the model usually only creates ~10 sentences at a time.
There's communities online providing extremely complex starting prompts and objectives (escape prison, assassin someone at a party and get away, ect.) for the player, and for me, the antagonistic ones (the models has control over NPCs that don't like you) are surprisingly fun.
Note that one of the main drivers of having uncensored open source LLMs is people wanting to role-play erotica with the model. That's why the model that first had scaled RoPE for 8k context length is called "superhot" - and the reason it has 8K context is that people wanted to roleplay longer scenes.
This is a exactly a case in point why people decide to pay OpenAI instead of rolling their own. I'm non-technical but have setup an image gen app based custom SD model using diffusers, so not entirely clueless.
But for LLM I have no where idea where to start quickly. Finding a model on a leaderboard, download and setup then customising it and benchmarking is way too much time for me, I'll just pay for GPT4 if ever need to instead of chasing and troubleshooting to get some magical result. It'll be easier in the future I'm sure when an open model merges as the SD1.5 of LLM
Here is a short test of a 7B 4bit model on an intel 8350U laptop with no AMD/Nvidia GPU.
On that laptop CPU from 2017, using a copy of llama.cpp I compiled 2 days ago (just "make", no special options, no BLAS, etc):
./main -m models/WizardLM-7B-uncensored.ggmlv3.q4_0.bin -n 128 -s 99 -p "A short test for Hacker News:"
llama_print_timings: sample time = 19.12 ms / 36 runs ( 0.53 ms per token, 1882.65 tokens per second)
llama_print_timings: prompt eval time = 886.82 ms / 9 tokens ( 98.54 ms per token, 10.15 tokens per second)
llama_print_timings: eval time = 5507.31 ms / 35 runs ( 157.35 ms per token, 6.36 tokens per second)
and a second run:
./main -m models/WizardLM-7B-uncensored.ggmlv3.q4_0.bin -n 128 -s 99 -p "Sherlock Holmes favorite dinner was "
llama_print_timings: sample time = 54.37 ms / 102 runs ( 0.53 ms per token, 1875.93 tokens per second)
llama_print_timings: prompt eval time = 876.94 ms / 9 tokens ( 97.44 ms per token, 10.26 tokens per second)
llama_print_timings: eval time = 16057.95 ms / 101 runs ( 158.99 ms per token, 6.29 tokens per second)
at 158ms per token, if we guess a word is 2.5 tokens, then that's 151 words per minute, much faster than most people can type. On a $250 laptop. Isn't the future neat?
Like, not vaguely hand wavey stuff, specifically, what model and what inference code?
I get nothing like that performance for the 7B models, forget the larger models, using llama.cpp on a pc without an nvidia GPU.