Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Out of topic but are we now back at the same performance than ChatGPT 4 at the time people said it worked like magic (meaning before the nerf to make it more politically correct but making his performance crash)?



I’ve been testing a lot of LLMs on my MacBook and I would say that all of them are far away from being as good as GPT-4, at any time. Many are as good as GPT-3 though. There are also a lot of models that are fine tuned for specific tasks.

Language support is one big thing that is missing from open models. I’ve only found one model that can do anything useful with Norwegian, which has never been an issue GPT-4.


Which ones have you tested? There were some huge ones released recently.


Samantha, llama 2 pubmed, marcoroni, openchat, fashiongpt, falcon 180B, deepseek llm chat, orca 2, orca 2 alpac uncersored, meditron, tigerbot, mixtral instruct, wizardcoder, gemma, nouse hermes 2 solar, yarn solar 64k, nouse hermes 2 yi, nous hermes 2 mixtral, nouse hermes llama 2, starcode2, hermes 2 pro mistral, norskgpt mistral and norskgpt llama.

Nouse Hermes 2 Solar is the best model for Norwegian that I've tried so far. It's much better than NorskGPT Mistral/Llama. I actually got it to make fairly decent summaries of news articles, though it wouldn't follow any stricter commands like producing 5 keywords in a json list. Kept producing more than 5 keywords and if I doubled down on the restriction on the number of keywords it would start messing up the json.

The best competitor to GPT-4 was falcon 180b, it's still terrible compared to GPT-4. Mixtral is my new favourite though, it's faster than falcon and in general as good or better. Though I would still pick GPT-4 over Mixtral any day of the week, it's leagues ahead of Mixtral.

Tigerbot has a very interesting trait. It tends to disagree when you try to convince it that it's wrong.

I haven't been able to test out the new 8x22 mixtral or command r plus. These are the next ones on my list!


Just tested out Command R+ with some niche SHACL constraint questions and it performs considerably worse than GTP-4. Might be a bit better than GPT-3.5 though, which is actually pretty amazing.


You need to use their beginning and end token scheme and set rep pen to 1 to get good quality out of cr+.


With open models, yes we are at the performance of at least the first release of ChatGPT 4.


Could you recommend one or a few in particular?


The current best open weights model is probably Cohere Command-R+. The memory requirements on it are quite high, though.


I really want to see some benchmarks with performance weighted by energy use. I think Mistral 7B performance to watt would be the leader by a huge margin. On many tasks I get equal performance on zero shot classification tasks on Mistral than in bigger models.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: