Out of topic but are we now back at the same performance than ChatGPT 4 at the t...

hmottestad · on April 10, 2024

I’ve been testing a lot of LLMs on my MacBook and I would say that all of them are far away from being as good as GPT-4, at any time. Many are as good as GPT-3 though. There are also a lot of models that are fine tuned for specific tasks.

Language support is one big thing that is missing from open models. I’ve only found one model that can do anything useful with Norwegian, which has never been an issue GPT-4.

Eisenstein · on April 11, 2024

Which ones have you tested? There were some huge ones released recently.

hmottestad · on April 12, 2024

Samantha, llama 2 pubmed, marcoroni, openchat, fashiongpt, falcon 180B, deepseek llm chat, orca 2, orca 2 alpac uncersored, meditron, tigerbot, mixtral instruct, wizardcoder, gemma, nouse hermes 2 solar, yarn solar 64k, nouse hermes 2 yi, nous hermes 2 mixtral, nouse hermes llama 2, starcode2, hermes 2 pro mistral, norskgpt mistral and norskgpt llama.

Nouse Hermes 2 Solar is the best model for Norwegian that I've tried so far. It's much better than NorskGPT Mistral/Llama. I actually got it to make fairly decent summaries of news articles, though it wouldn't follow any stricter commands like producing 5 keywords in a json list. Kept producing more than 5 keywords and if I doubled down on the restriction on the number of keywords it would start messing up the json.

The best competitor to GPT-4 was falcon 180b, it's still terrible compared to GPT-4. Mixtral is my new favourite though, it's faster than falcon and in general as good or better. Though I would still pick GPT-4 over Mixtral any day of the week, it's leagues ahead of Mixtral.

Tigerbot has a very interesting trait. It tends to disagree when you try to convince it that it's wrong.

I haven't been able to test out the new 8x22 mixtral or command r plus. These are the next ones on my list!

hmottestad · on April 12, 2024

Just tested out Command R+ with some niche SHACL constraint questions and it performs considerably worse than GTP-4. Might be a bit better than GPT-3.5 though, which is actually pretty amazing.

Eisenstein · on April 13, 2024

You need to use their beginning and end token scheme and set rep pen to 1 to get good quality out of cr+.

segmondy · on April 10, 2024

With open models, yes we are at the performance of at least the first release of ChatGPT 4.

sp332 · on April 10, 2024

Could you recommend one or a few in particular?

sanjiwatsuki · on April 10, 2024

The current best open weights model is probably Cohere Command-R+. The memory requirements on it are quite high, though.

bevekspldnw · on April 11, 2024

I really want to see some benchmarks with performance weighted by energy use. I think Mistral 7B performance to watt would be the leader by a huge margin. On many tasks I get equal performance on zero shot classification tasks on Mistral than in bigger models.