Benchmarking GPT-4 Turbo – A Cautionary Tale

anotherpaulg · on Nov 9, 2023

Aider has had an Exercism benchmarking suite for quite some time.

Interestingly, my benchmark results of GPT 4 Turbo show an opposite result: the new gpt-4-1106-preview did significantly better on the first try than the March and June models.

https://aider.chat/docs/benchmarks-1106.html

Aider benchmarks against the 133 Exercism python exercises, not js exercises that mentat's benchmark uses. So this is not an apples-to-apples comparison, but there doesn't seem to be a strong reason to expect qualitatively different results.

I also notice that the instructions prompt that mentat uses seems to be inspired by the aider benchmark? Glad to see others adopting similar benchmarking approaches.

https://github.com/AbanteAI/mentat/blob/main/tests/benchmark...

https://github.com/paul-gauthier/aider/blob/main/benchmark/p...

Edit: Not sure if the mentat authors are in this thread? After looking around a bit, there seems to be a bunch of aider code in your repo. Some attribution would be appreciated. It might even be required under aider's Apache 2.0 license?

biobootloader · on Nov 9, 2023

Hey Paul, I'm a Mentat author.

> I also notice that the instructions prompt that mentat uses seems to be inspired by the aider benchmark? Glad to see others adopting similar benchmarking approaches.

We were inspired by you to use Exercism as a benchmark, thank you! We will add attribution for that. We switched our original instruction prompts for that benchmark to be similar to Aiders to allow for fair comparison.

> After looking around a bit, there seems to be a bunch of aider code in your repo. Some attribution would be appreciated.

We have an unused implementation of your output response format (https://github.com/AbanteAI/mentat/blob/main/mentat/parsers/...), but I don't know what else you are seeing? We implemented that to compare with our response formats and didn't find much difference in performance.

anotherpaulg · on Nov 9, 2023

I didn't spend much time looking, but your benchmark prompting inspired me to search your repo for "aider". The results were 3 PRs where aider was mentioned in the conversations [0].

The "code map" PR in particular mentions being "inspired by aider", links to aider and seems to include a bunch of code from aider's old ctags based "repo map" implementation. This isn't an insignificant component of an AI coding tool.

Aider is open source and I try and share my learnings as I'm building it. So it's great when other projects get inspiration from aider! But it is polite to provide attribution for such inspiration, especially if you crib from code with an attribution license.

[0] https://github.com/search?q=repo%3AAbanteAI%2Fmentat+aider&t...

derwiki · on Nov 9, 2023

I’ve been using the new model with Aider since it was released, and my anecdata agrees—the “edits applied successfully “ failure rate is much lower than classic gpt4.

Also THANK YOU for Aider! I talk it up to all my programmer friends; it really feels like a glimpse into the future of coding.

hanselot · on Nov 9, 2023

Isn't it a good thing that of the benchmarks they ran, the newer model has fewer of the answers memorized (aka, its parroting less)?

Wouldn't this actually be exactly proof that the model has improved over its predecessor by having to solve the problem itself rather than rely on memory?

What use is a model that memorizes the answers to all the benchmarks (see the 7b models on open llm leaderboard for more info on that).

throwaway675309 · on Nov 10, 2023

I feel like I see this A LOT these days. If you do a Show HN (for example) and your project is directly inspired by somebody else's who came before you, the least you can do is give nominal attribution.

What is it about software development in particular that makes people so seemingly ethically unfettered by blatant plagiarism?

naiv · on Nov 9, 2023

I am also noticing a massive improvement over the old model

ja3k · on Nov 9, 2023

Sorry about that. We updated the blog with attribution and put an attributing comment in our code base where we use your benchmarking prompts. We'll probably delete our implementation of your response format later today since we just had it for benchmarking.

strich · on Nov 9, 2023

Does aider work with c# at all?

anotherpaulg · on Nov 9, 2023

Yes!

Thanks for asking. I've been meaning to address these kinds of questions in the aider FAQ [0]. Here's the entry I just added:

Aider supports pretty much all the popular coding languages. This is partly because GPT-4 is fluent in most mainstream languages, and familiar with popular libraries, packages and frameworks.

In fact, coding with aider is sometimes the most magical when you're working in a language that you are less familiar with. GPT often knows the language better than you, and can generate all the boilerplate to get to the heart of your problem. GPT will often solve your problem in an elegant way using a library or package that you weren't even aware of.

Aider uses tree-sitter to do code analysis and help GPT navigate larger code bases by producing a repository map [1].

Aider can currently produce repository maps for most mainstream languages, listed below. But aider should work quite well for other languages, even without repo map support.

  - C
  - C#
  - C++
  - Emacs Lisp
  - Elixir
  - Elm
  - Go
  - Java
  - Javascript
  - OCaml
  - PHP
  - Python
  - QL
  - Ruby
  - Rust
  - Typescript

[0] https://aider.chat/docs/faq.html#what-code-languages-does-ai...

[1] https://aider.chat/docs/repomap.html

epiccoleman · on Nov 9, 2023

I've just started playing with aider this week, and I find it extremely fun and exciting. But I will say that I've had middling results with an Elixir / Phoenix app. I don't think this has anything to do with aider - rather, I think that the GPT models haven't quite internalized the new approaches in Phoenix 1.7, since up until Turbo their training data was fairly old and probably still contains more pre 1.7 Phoenix examples than post 1.7.

In spite of these frustrations, I have had some genuinely amazing moments coding with GPT-4 lately though. I upgraded to ChatGPT plus lately and it's just mindblowing how helpful it can be in the right contexts. I'm hoping that as I get better with aider I might just drop the ChatGPT sub and stick to API usage.

I totally understand the skepticism many have, because this stuff is still a bit finicky - but I'm overwhelmed by a sense of how fucking _cool_ this stuff is quite often.

lemming · on Nov 9, 2023

I was actually wondering this myself yesterday. So it's not possible to plug a different tree-sitter implementation in for a niche language?

anotherpaulg · on Nov 9, 2023

It should be possible, but not currently. Aider would need a bit more configurability to be able to load up arbitrary tree-sitter language implementations at runtime.

There's an open issue you might want to follow for updates:

https://github.com/paul-gauthier/aider/issues/321

jphoward · on Nov 9, 2023

The problem is the discussed results are comparing proportions of a relatively small number - 67 questions. If you model this as a binomial distribution, then 62/67 which GPT4-turbo got gives a 95% confidence interval of the 'true' performance of 83.4% to 97.5%, ie it comfortably includes the proportion that GPT4 achieved (64/67=95.5%).

I think the evidence from these tests are not strong enough to draw conclusions from.

epups · on Nov 9, 2023

Yes. I see people make this mistake time and again when evaluating LLMs. For a proper comparison, it's not enough to simply throw less than a hundred questions at it and point to a single digit difference. Not to mention that LLMs have some inherent randomness, so even if you passed the exact same tasks to the same model you would expect some variance.

I see a lot of room of improvement in how we apply statistics to understanding LLM performance.

Maxion · on Nov 9, 2023

I’m not surprised, most people can’t even tell the median from the mean.

Racing0461 · on Nov 9, 2023

> I think the evidence from these tests are not strong enough to draw conclusions from.

I've used gpt4 turbo for some coding problems yesterday. It was worse. That's enough to draw conclusions for me.

jessenaser · on Nov 9, 2023

The thing is why does the GPT-4 Turbo and the Updated GPT 3.5 Turbo have only an output of 4,096 tokens?

Previous Model: gpt-3.5-turbo-16k, 16385 tokens context and completion (shared)

New Model: gpt-3.5-turbo-1106, 16385 tokens context, 4096 tokens completion

Previous Model: gpt-4, 8192 tokens context and completion (shared)

New Model: gpt-4-1106-preview, 128000 tokens context, 4096 tokens completion

Why would the same size of a 16K GPT-3.5 model now not allow larger completion sizes?

Why would the new GPT-4 reduce the completion tokens as well, gpt-4 can do 8192 and gpt-4-32k can do 32768 completion tokens. Now the limit is 4096.

So you would need to change the way you prompt (split the responses) to be able to get a longer response.

---

So are these new models taking the old base models of 4K tokens context and completion and changing the context to 128000 but leaving the completion the same? If they could get gpt-4 to have gpt-4-8k and gpt-4-32k, why couldn't have it been 128000 context and 32768 completion?

srdjanr · on Nov 9, 2023

Probably because it's too expensive. Prompt can be processed quickly but output tokens are much slower (and that means more expensive).

From my local test on a 13B model, output tokens are 20-30x more expensive than input tokens. So OpenAI's pricing structure is based on expectation that there's much more input than output tokens in an average response. It didn't matter too much if a small percentage of requests used all 4k tokens for output, but with 128k it's a different story.

Racing0461 · on Nov 9, 2023

I believe openai wants to lower the time it takes for requests to finish to be able to accept more requests per server/gpu. ie money.

qup · on Nov 9, 2023

if i'm not mistaken, the model has to be trained for a specific context window

refulgentis · on Nov 10, 2023

More or less, like there's stuff you can do to extend the window of an existing model fairly easily, i.e. LoRA type training budget, O($1000).

But in practice, even when context_size max output token count was enabled, it simply couldn't make use of it, no matter how many prompt engineering tricks I threw at it.[1] And I've heard anecdotally that it's true for that LoRA-type technique.

[1] TL;DR, about 1/5th the actual length: write 100 pages, 3 paragraphs each, number the pages as you go and write 1 page at a time until 100. Also write out "I have written page N and need to write 100 pages total" after each page.

Inevitably it would "get tired" and be like "end page 23...now page 100"

minihat · on Nov 9, 2023

GPT-4 Turbo is dramatically worse at one task I often try:

Read the following passage from [new ML article]. Identify their assumptions, and tell me which mathematical operations or procedures they use depend upon these assumptions.

GPT-4: Usually correctly identifies the assumptions, and often quotes the correct mathematics in its reply.

GPT-4 Turbo: Sometimes identifies the assumptions, and is guaranteed to stop trying at that point and then give me a Wikipedia-like summary about the assumptions rather than finish the task. Further prompting will not improve its result.

thorax · on Nov 9, 2023

Do you have a link or gist of an example run you tried? I'd be curious to try something similar.

boxed · on Nov 9, 2023

> We designed a test for this theory: we reran the benchmarks without showing the models the instructions to each exercise. Instead, we just told them that they were Exercism exercises, and gave them the exercise names and function stubs.

This summarizes all my skepticism agains the AI field. Pretty clear that they aren't solving the problem, they have them memorized.

DecayingOrganic · on Nov 9, 2023

Memorization often gets a bad rap as the underachiever's shortcut. However, it's a fundamental component of any learning process! Our ability to reason, solve problems, and innovate is all built upon a foundation of memorized information. In fact, it's precisely the reason humans have thrived for so long; we were able to memorize and pass down knowledge culturally long before the written word, not because we were 100 times smarter than our nearest cousins. Without memorization, be it in our brains or AI algorithms, there's no foundation to build upon for higher reasoning.

viraptor · on Nov 9, 2023

It's hard to decide for me without seeing the data. Even if you don't know the exact exercise, seeing the title and the function name/parameters is often enough for me to guess what the challenge is. I checked the public questions on exercism and almost all of those (that I spot checked) that contained the function name were extremely obvious. Knowing it's a programming challenge would also improve my guessing chances.

For example the function stubs I can find are "value_of_card(<card>)" in exercise "Black Jack", or "generate_seat_letters(<number>)" in exercise "Plane Tickets". I think I could guess those without seeing the rest of the question.

drcode · on Nov 9, 2023

You can call it whatever you want, all I know is I used to write programs in lines of code, then blocks of code at a time, spit out by LLMs

Using GPT-4 Turbo yesterday, I feel like I'm moving to pages of code at a time now.

Taking the ideas in my head and turning them into reality is so easy now

GaggiX · on Nov 9, 2023

So how can it solve novel problems? Internet does not have all combinations for every possible task with any random programming language, library or constraints. It can even solve problems with non-existing programming languages and libraries if you describe them, if that's just memorization then I don't know what it isn't.

caesil · on Nov 9, 2023

If that's your takeaway from this then you really missed the point. The implication here is that gpt-4 to gpt-4-turbo represents a leap away from memorization and toward better reasoning with a more complete world model.

og_kalu · on Nov 9, 2023

"They memorized all the problems" is not what was found here and still a wrong overcorrection.

dartos · on Nov 9, 2023

“Gpt-4 has more problems memorized than gpt-4 turbo” was exactly what was found here.

That doesn’t mean it’s only able to solve problems in its training set (tho it’s much better at that obviously.)

boxed · on Nov 9, 2023

If you are shown only the title of a coding problem and the site name where it's from, and you manage to solve it you are showing that you either cheated or knew the answer.

Terretta · on Nov 9, 2023

On the contrary, it could mean you were, to some percentage of success, able to guess what problem is, and then, to some multiplier percentage of success, solve it.

The key is, can you guess the problem from the title and the function name? I'd argue, sure, at least half the time?, why not...

og_kalu · on Nov 9, 2023

I mean sure, it memorized some of the answers. I'm not denying that. Clearly, it didn't memorize all of them.

boxed · on Nov 9, 2023

When people say "oh look how amazing, it can solve programming problems!" when in fact it has only seen the models CHEAT, is an enormous problem.

For cases where finding the answer it's perfectly fine, but it's not fine for claims that it can code. There's a huge difference.

notnaut · on Nov 9, 2023

It can generate never-before-seen strings of comprehensible language. It can react to the inherent logic embedded in words and text and provide a brute forced version of what a human could. That it can “solve” a problem only through “cheating” is an anthropomorphism that betrays the magic that is evident to anyone who has used these things.

FeepingCreature · on Nov 9, 2023

I've seen it code on completely novel tasks, so I'm not sure what you're suggesting here. The model can unquestionably code.

raincole · on Nov 9, 2023

Almost 2024 and people still can't accept that LLM can code...

danielmarkbruce · on Nov 9, 2023

Of course they can't. And self-driving cars also don't exist, it's like 10 years away at best.

og_kalu · on Nov 9, 2023

Okay... Funny how forcing it to not CHEAT did not increase apparent ability.

It can code and it has memorized some coding questions are not mutually exclusive.

hanselot · on Nov 9, 2023

Though this is exactly what happened. The initial test was ran on a model that "Cheated" (aka has memorized the answers). The second test was run on a model that didn't "Cheat" as much, yet still got only 2% less score. So, the question is not resolved really. How much did the first model cheat, and how much did the second? If the second model "cheats" less, then it wins.

Also, I don't understand your obsession with the word cheating. If you have solved a problem before on a different website and solve it again, did you cheat? Or did you just use your brain to store the solution for later?

boxed · on Nov 9, 2023

> Also, I don't understand your obsession with the word cheating.

It's all about the rule set yea. Since the rule set is not defined, technically nothing is cheating. I just interpret the rule set as "can it code?" and for this rule set, it seems to me that it's cheating.

boxed · on Nov 9, 2023

> How much did the first model cheat, and how much did the second? If the second model "cheats" less, then it wins.

They both cheated 100%. Because they both never saw the problem. AT ALL. They just saw the title and the name of the website.

boxed · on Nov 9, 2023

> Okay... Funny how forcing it to not CHEAT did not increase apparent ability.

The article did the opposite. It forced the models to cheat to solve the problems. Which it did happily. It should have stated "there is no actual problem to solve here, you must supply a problem for me to solve".

> It can code and it has memorized some coding questions are not mutually exclusive

This I will give you. Many humans try to cheat at basic math because they are lazy, so will this model. Maybe that's a sign of intelligence :P

raincole · on Nov 9, 2023

Me: What's 6x6?

You: 36

Me: You cheated! You just cited the answer you memorized! You should have started from addition.

You: ...okay? 6+6=12, 12+6=18, 18+...

Me: You cheated again! You just have 6+6=12 memorized! You should make the rule of addition out of Peano axioms.

You: ...you're being annoying, but okay? First axiom, we define 0 as...

Me: You cheated again! You memorized Peano Axioms! Jesus Christ, is there any intelligent creature left?

jsight · on Nov 9, 2023

TBH, people underestimate how much of coding is just memorization. I'm guessing those of us with bad memories understand this more than the ones with good memories. :)

I can't remember how many times I've googled, "how do I create a directory in Python?". Now bard often generates an inline answer for me.

boxed · on Nov 9, 2023

But in this case it's not like that at all. They only saw the NAME of the problem. Like if I said "Page 23 of Mathbook Y, problem number 3". Which happens to be 6x6.

qup · on Nov 9, 2023

I know this is deep down a bad comment thread, but I thought I'd chime in here.

I have been writing function names and test names, and then telling gpt to fill in the test, which is usually does how I want (maybe with errors, but it tests the correct thing), and then I tell it to fill out the answers.

this is in a thing I'm building that's never been built, with names that I made up (but describe the functionality well)

It cannot have this spot memorized, I just invented it myself

stnmtn · on Nov 9, 2023

If I gave you a programming problem and all I told you was that the problem name was Traveling Salesman, you might be able to solve it based on that.

If not that, then if I just said "fizzbuzz" to you, I'm sure you would be able to give the solution without me needing to say any other descriptions of the problem

boxed · on Nov 9, 2023

Again, because of memorization, not being able to code.

FeepingCreature · on Nov 10, 2023

But in that case, not memorization of the specific problem set, but "programming background knowledge." Hardly something to blame the machine for when we rely on it every day.

raincole · on Nov 9, 2023

Me: I was being in such a blah blah situation... does the article 3 of Digital Government Act applies here?

My lawyer: Hmm the article 3 says--

Me: I knew it! Lawyers are not intelligent...

furyofantares · on Nov 9, 2023

It said they gave the exercise name, which doesn't sound like just the exercise number but probably mildly descriptive -- and they also gave it function stubs.

blovescoffee · on Nov 9, 2023

Ok but you understand there's a body of literature that shows that LLMs don't "just" memorize

gloosx · on Nov 9, 2023

+100 to that. My biggest scepticism is people actually creating a new problem while thinking they are solving problem. Don't get me wrong, translating natural language ideas into code is fun and all, the truth it is also code, yet in ambiguous language format given to the machine.

When did natural language became better for expressing development ideas than code? I know – when you don't know how to code in the first place. Then you will have to bet on all of the ambiguities of the language, cultural and meta-physical which words carry in order to hack your thing together instead of expressing yourself directly and explicitly.

Finally what is beautiful about strict code format we are so used to - it is truly the fastest and shortest path to get your thing done, in case you possess the knowledge needed.

doug_durham · on Nov 9, 2023

Natural language isn't superior to computer languages. NL allows you to describe a software concept in a computer language and framework neutral way. The LLM generates the code. The real benefit is when you work across languages and frameworks. It is difficult to keep all of the details of all of the framework calls in your head all of the time.

gloosx · on Nov 10, 2023

Where is the evidence for that? Any real-world application made and running by describing software concepts to an LLM?

It is what it is – a novel search engine, lossy and non-credible. Effectively useless on codebases that extend beyond its fairly limited context

ericrallen · on Nov 9, 2023

That sounds a lot like gatekeeping.

These tools will empower folks who aren’t developers to build stuff and maybe learn a bit more about how programming works.

They will enable folks who have ideas, but can’t express them, to actually be able to create what they are imagining.

That’s awesome.

Code isn’t beautiful (except for a few rare exceptions). Creating something with code is.

gloosx · on Nov 10, 2023

I agree it is a great tool for learning, but I don't believe anything more complex or of real use can be made AND maintained with it.

ericrallen · on Nov 10, 2023

I think we’re probably way to early in the AI lifecycle to really form any strongly held beliefs yet.

In the 11 months since ChatGPT was released, things have come a long way. Who knows where we’ll be in another 11 months.

gloosx · on Nov 13, 2023

What I'm trying to say is that the problem is not approachable this way at all – efficiently generating code by describing what you want, since when you compress what you want into a prompt you lose the details, and in order to restore all of them you will need a much bigger prompt volume than code generated. Because it is code itself which compresses an idea but no idea can compress the code well enough. In another 11 months it will be exactly in the same spot - it will not be able to be more efficient at this task by the nature of it.

m3kw9 · on Nov 9, 2023

From a black box point of view and one angle, gpt is a web filter where it will try to find you the exact thing you are looking for but from memory. Vs google you have to distill all the info into what you need

nathanfig · on Nov 9, 2023

"memorize" implies they can only recite things verbatim and that's ignoring the massive leap in being able to synthesize disjoint "memories" in new ways.

ren_engineer · on Nov 9, 2023

even if it's not true AI or even an architecture with the potential to become AI, LLMs are already good enough to provide real world value. Obviously "super autocomplete" isn't as sexy as true AI, but still very useful

rldjbpin · on Nov 10, 2023

if the benchmark means replicating the experience of taking technical interviews by most people, then this is a spot on approach and serves the potential user right.

mavhc · on Nov 9, 2023

LLMs are lossy compression

stoicfungi · on Nov 9, 2023

All models are, including humain brain.

ChatGTP · on Nov 9, 2023

The human brain is a model?

rocgf · on Nov 9, 2023

It models the world around it, so it's fairly similar to what GPT does, especially with the newly-added image capabilities and stuff.

ChatGTP · on Nov 10, 2023

But the brain itself is not a model.

FrustratedMonky · on Nov 9, 2023

Consciousness itself is a model of the world.

Our experience of the world is a model executing.

Comparing the latest neuroscience to latest neural networks. They look and behave very similarly.

KaoruAoiShiho · on Nov 9, 2023

Why are all the comments here so negative... this is a good thing, turbo has less memorization but keeps the same reasoning ability. That's excellent and a relief.

CamperBob2 · on Nov 9, 2023

People here spent a lot of time (and money) in school learning to do things that can now be automated. The whining is just beginning.

boxed · on Nov 9, 2023

Or the programming quiz problems it tried to "solve" were in fact posted elsewhere also so it cheated on the ones it got right too.

maciejgryka · on Nov 9, 2023

I have similar conclusions so far. We have a custom data set (basically visual Q&A about web apps) and `gpt4` gets roughly 90% correct, while `gpt-4-1106-preview` only 86%. It's a little noisy (I didn't yet check out the new seeds functionality), but roughly consistent.

Since I created this dataset by hand, it can't really be memorized. I'm sure there's _similar_ data in the training set, but answering correctly still requires some reasoning-like capabilities.

cultureswitch · on Nov 9, 2023

Everyone who has any knowledge of machine learning knows that you don't evaluate your model by testing it on parts of its training data. The issue is, the training data for GPT-4 appears to be "everything".

msp26 · on Nov 9, 2023

I'm interested in more testing on the context side of things.

For my NLP pipelines, I batch n-articles together to process (extract fields from) in one prompt (final output is something like this {"1":[{}], "2": [{},{}]...}) in one message. Compute-wise it's inefficient but OpenAI charges by the token so it doesn't matter. It's very reliable on gpt-4 8k.

I was also pretty happy with the results on 4-turbo initially but it seems that once you go past 30k-ish tokens in context (needs way more testing), it shits itself. The indexes don't match anymore and n_final_output is different from n_articles.

Still, great model and even if the limits are lower in practice I suspect I'll get good use out of it.

Edit: With better prompting, it feels stable at n=42, ~42000 prompt tokens.

phillipcarter · on Nov 9, 2023

Interesting. I was skeptical about some of their claims regarding longer context, since it's been my experience that these models just get lost after enough of it.

msp26 · on Nov 9, 2023

Yeah, degraded performance on long contexts has been observed in plenty of other models [https://arxiv.org/abs/2307.03172] so I was cautious too. Unfortunately I don't have access to 4-32k. I would have liked to test that out too.

geraltofrivia · on Nov 9, 2023

In my day job we use GPT4 quite a bit and we shifted to GPT4 Turbo today. We got a 2-5% performance increase, and quite a bit of speed increase as well.

Not to say that the parent post is incorrect, of course. Only that its not as cut and dry as a "GPT4 Turbo is distilled (read: watered down) GPT4".

hu3 · on Nov 9, 2023

Interesting. What do you use it for?

geraltofrivia · on Nov 9, 2023

Currently only for unstructured (OCR) text to structured text conversion.

We're transitioning from a legacy codebase full of regexes and undocumented functions that are understood only by the developer and god. The developers left and I don't believe in god. We tried throwing the unstructed mess to GPT, alongwith a few examples and got surprisingly good results.

attentive · on Nov 10, 2023

> undocumented functions that are understood only by the developer and god

oh the irony :)

geraltofrivia · on Nov 10, 2023

I don't follow . Ironical how?

attentive · on Nov 10, 2023

You replaced it with the system that is even worse than "undocumented functions that are understood only by the developer and god" by design. It's not even deterministic.

geraltofrivia · on Nov 10, 2023

Oh yeah. One hundred percent true :D It just happens to be significantly better both in terms of precision and recall than the former solution.

driverdan · on Nov 9, 2023

This is a big problem with independent LLM testing. You need to make sure your test set isn't included in the training set which isn't easy with closed source models.

This makes me think of how hardware manufacturers optimize for benchmarks. Closed source LLMs can intentionally include likely test data in their training set to artificially inflate results. I'm not saying they are intentionally doing that now, but they could.

mufasachan · on Nov 9, 2023

> Although the author OCR’ed the SAT questions and believes that they weren’t in the training data

I agree that the author of the tweet fairly underestimates the potential portion of OCR'ed contents in OpenAI's training data. In late August, Nougat[1] is released by Meta, this is an OCR model. Its performance are wild and the model is open source.

I hardly believe that OpenAI does not spend effort on getting more training from OCR content. I also hardly believes that OpenAI waits for a Meta paper to have an internal performant OCR model.

[1]: https://arxiv.org/abs/2308.13418

Satam · on Nov 9, 2023

Very interesting and basically confirms that GPT-4 turbo is a faster but dumber model. When a task doesn't rely on memorization of the training set, it reasons similarly well to GPT-4. Where memorization is helpful, it performs worse (due to quantization-induced "memory loss").

This also makes me look at GPT-4 as a "weak reasoner with a lot of knowledge". That really aligns with my experience where it is immensely helpful and has a superhuman knowledge base but still needs handholding to solve real problems.

maxrmk · on Nov 9, 2023

I've always been skeptical of benchmarking because of the memorization problem. I recently made up my own (simple) date reasoning benchmark to test this, and found that GPT-4 Turbo actually outperformed GPT-4: https://open.substack.com/pub/talcai/p/making-up-a-new-llm-b...

tornato7 · on Nov 9, 2023

I like the test but do you take multiple samples / runs of a result? IMO for a proper benchmark you should ask it the same question 10+ times and show a confidence interval, otherwise you don't know if it's just a fluke or a lucky guess.

maxrmk · on Nov 9, 2023

Ahh good suggestion, I should clarify this in the article. I tried to compensate with volume -- I used a set of 200 questions for the testing. I was using temperature 0, so I'd get the same answer if I ran a single question multiple times.

mccrory · on Nov 9, 2023

GPT -4 Turbo is still in preview, maybe wait until it is fully released before judging?

msp26 · on Nov 9, 2023

The point of a preview phase is to test the model in real world use.

seanhunter · on Nov 9, 2023

This isn't really real-world use any more than putting these same problems to people as a whiteboard coding exercise in an interview is real-world coding, yet seemingly a lot of people seem to be generalising from this tiny sample to all manner of overarching statements about performance of the model in general "it's faster but dumber", "this proves it only memorises" etc.

iJohnDoe · on Nov 9, 2023

A little bit off topic question. When people are talking about costs with GPT, like the following link. Does the cost concern only apply to the API? If you’re using the WebUI and have a Plus account, is it always just the flat $20 amount?

https://news.ycombinator.com/item?id=38193978

tosh · on Nov 9, 2023

usually, yes (either cost of the API or cost to serve for OpenAI)

m3kw9 · on Nov 9, 2023

Now do a programming task that requires more than 32k of context and see who’s “better”. If you don’t bench mark that you cannot get an overall pic. GitHub copilot for example could benefit big from the increased context

broast · on Nov 9, 2023

Obviously it's a drawback but the silver lining of the small context window is it forces me to decouple everything and have very sensible and strict api's where I just write the docs and it writes the code.

biobootloader · on Nov 9, 2023

we are working on creating "real world" benchmarks that require a lot of context, and will report when we have results!

og_kalu · on Nov 9, 2023

I think it's interesting that forcing models out of memorization don't always show a steep drop in ability.

I've definitely had instances where 4 memorized a common puzzle and failed a subtly altered variant but then got the variant after changing variable names or otherwise making it look different from what it would have memorized.

m3kw9 · on Nov 9, 2023

That’s why is called 4 turbo, not “4.5”. But the context length is a bigger cargo space

FeepingCreature · on Nov 9, 2023

I wonder how often a human could guess the exercise based on just the function stub.

ConnorMooneyhan · on Nov 9, 2023

yeah, some of the exercises are like the following:

```

function helloWorld() {

  return "";

}

helloWorld()

```

but those sorts of obvious examples are mostly in the beginner exercises, so I wonder what the distribution of the correct answers was. If it was guessing based on function stubs, the prediction would be that correct answers would be clustered around the beginner exercises, and that as the exercises advanced in difficulty, there were fewer correct answers.

sunpazed · on Nov 9, 2023

This reflects my anecdotal findings in a very narrow domain: GPT-4 consistently resolves complex SQL from natural language, whereas GPT-4-Turbo is hit-and-miss, and similar to 3.5-Turbo in performance.

Semaphor · on Nov 9, 2023

But that’s not what the article is saying at all?

Havoc · on Nov 9, 2023

Is it know what exactly OpenAI does in the background when they make these turbo editions?

Seems like sacrificing some quality for large gains on speed and cost but anyone know more detail?

thorax · on Nov 9, 2023

Don't think so, but there were some guesses on 3.5-turbo-- i.e. training a much smaller model on quality questions/answers from GPT-4. Same tactic worked again and again for other LLMs.

I'm definitely curious on the context window increase-- I'm having a hard time telling if it's 'real' vs a fast specially trained summarization prework step. That being said, it's been doing a rather solid job not losing info in that context window in my minor anecdotal use cases.

Kuinox · on Nov 9, 2023

tldr: GPT-4 Turbo have worse score on synthetic benchmark of the first attempt because they speculate it's a smaller model, and isn't able to memorize as well the response.

Racing0461 · on Nov 9, 2023

OpenAI should be covered in lawsuits for nerfing a product people paid for and expect to not degrade over time while still paying the same amount.