GPT-4 Turbo with Vision Generally Available

simonw · on April 9, 2024

They also added both JSON and function support to the vision model - previously it didn't have those.

This means you can now use gpt-4-turbo vision to extract structured data from an image!

I was previously using a nasty hack where I'd run the image through the vision model to extract just the text, then run that text through regular gpt-4-turbo to extract structured data. I ditched that hack just now: https://github.com/datasette/datasette-extract/issues/19

geepytee · on April 9, 2024

One of the OpenAI PM's was also saying the model got substantially better at math: https://x.com/owencm/status/1777770827985150022

I'm trying it for coding and have added it to my VS Code Copilot extension Overall I'd say it's better at coding than the previous GPT-4 Turbo. https://double.bot if anyone wants to try it :)

g9yuayon · on April 9, 2024

Is being good math that important to the ChatGPT users, though? The ChatGPT's ability to do math is so limited that I'm not sure what math problems we have to ask ChatGPT to solve.

DiggyJohnson · on April 9, 2024

It means you don’t have to be as sketched out if you’re looking for something that requires basic math. Imagine generating the correct result of a unit test or something. I wouldn’t trust it either way, but I think this is a believable example.

pama · on April 9, 2024

Math is not just arithmetic and yes better math does help at least some GPT-4 users.

g9yuayon · on April 10, 2024

Do you have examples? The other commenter's example of generating correct unit test results is indeed interesting, and I was wondering if there are other cases.

pama · on April 12, 2024

Not sure what you are looking for, but perhaps these thoughts can help. Better geometry understanding helps with spatial tasks. Better probability theory, discrete math, and linear algebra background helps with algo development. More broadly, being better at things that a mathematician is good at, could help tackle complicated tasks in finance, science, or engineering.

abrichr · on April 9, 2024

We've been using `gpt-4-1106-vision-preview` and simply prompting the model to return json, with excellent results: https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (work in progress).

MuffinFlavored · on April 9, 2024

> This means you can now use gpt-4-turbo vision to extract structured data from an image!

How consistent and reliable is the extracted structure?

Did they add any kind of concept "for whatever token you generate/think it is next, "unit test it" / make sure it passes some sort of rules?"

simonw · on April 9, 2024

It's pretty good, but it's not reliable enough to exclude the need to check everything it does.

Same story as basically everything relating to LLMs to be honest.

_flux · on April 10, 2024

llama.cpp has the feature to enforce a certain structure from the LLM (but you still need to have the LLM be able to produce that structure and it's beneficial to prompt it towards the exact result). So this particular story doesn't need to be the same among all LLMs.

https://github.com/ggerganov/llama.cpp/blob/master/grammars/...

simonw · on April 10, 2024

llama.cpp grammars will get you results that definitely conform to your grammar, but that doesn't guarantee that they'll be semantically correct - the model could still hallucinate details incorrectly while returning valid JSON.

_flux · on April 11, 2024

Indeed it's not the solution to LLM hallucinations—as far as I know nobody knows a solution to it.

But it is the solution to needing to run the model again and checking the format of the output to ensure that it conforms to you expectations.

brianjking · on April 10, 2024

^ This.

I've found success generally better with Textract and then passing the OCR'ed text to OpenAI and using Pydantic to get it structured.

joshstrange · on April 9, 2024

In my testing I was better off running the image through AWS Textract then taking the output and feeding it to OpenAI. It was also much cheaper. Of course if all you are looking for is extraction then maybe you don't need OpenAI at all. I used it to clean up the OCR'd data and reformat it.

maciejgryka · on April 9, 2024

My experience is that it's pretty good at reading text and pretty bad at understanding layouts. So e.g. asking it to work with tables is asking for trouble.

dontupvoteme · on April 9, 2024

Yeah it's absolutely horrible at layouts.

I'm not 100% sure it's related but if you ask it to draw bounding boxes around things it's always off by quite a bit.

mahithefish · on April 10, 2024

Even if the GitHub link they just posted above. It made up the year at 2022 and hallucinated the start and end time.

https://github.com/datasette/datasette-extract/issues/19

Lucasoato · on April 9, 2024

It's very consistent, check this guy, he was able to structure LLM output using Pydantic in an elegant solution:

https://www.youtube.com/watch?v=yj-wSRJwrrc

jgalt212 · on April 9, 2024

> How consistent and reliable is the extracted structure?

That's the $100B question.

szundi · on April 9, 2024

Or $T these days

a_wild_dandan · on April 9, 2024

You restrict the model's next prediction to valid JSON tokens. (If you mean format reliability.)

rockwotj · on April 9, 2024

I'm waiting for being able to restrict to a specific JSON Schema

brianjking · on April 10, 2024

Providing examples generally helps me. Have you checked out the Instructor library too?

brianjking · on April 10, 2024

Great work, thank you, Simon.

Have you noticed GPT-4 Vision do weird selections of dates for 2019? In doing some processing of some data for work, I ended up switching to textract and then passing the OCR'ed text to GPT-4 and mapping to a Pydantic schema due to this issue.

maciejgryka · on April 9, 2024

He, cool to hear you're doing something like this too! We ended up close, but we also need good spatial relationships, which GPT4V isn't great at, so we're using other OCR system and adding the result to the context.

pathikghugare · on April 10, 2024

Does anyone know if we can access the same on Azure OpenAI? It still shows vision-preview at my end in West US region with no JSON mode

Zambyte · on April 9, 2024

What time is the event? I think 18:00 - 22:00 is completely fabricated.

hubraumhugo · on April 9, 2024

Their naming and versioning choices have definitely created some confusion. Apparently GPT-4 Turbo doesn't just come with the the new vision capabilities, but also with improved non-vision capabilities:

> "I’m hoping we can get evals out shortly to help quantify this. Until then - it’s a new model with various data and training improvements, resulting in better reasoning."[0]

> "- major improvements across the board in our evals (especially math)"[1]

[0] https://twitter.com/owencm/status/1777784000712761430

[1] https://twitter.com/stevenheidel/status/1777789577438318625

qazadex · on April 9, 2024

In my experience the turbo models are significantly worse than the old GPT-4 models, especially in structured output and following instructions. I'm guessing its due to the cheapened attention mechanism, but its a bit disappointing that OpenAI tries and hides these limitations.

MyFirstSass · on April 9, 2024

As a regular user of ChatGPT-4 i this press release makes little sense.

What does Vision mean? I've already been able to both upload docs, create images etc. for a while now?

What are the "other improvements" and what is Turbo, what is 4.5, what is this new one called?

How do i even see what version of the model i'm using in their interface when it just says "4"?

tedsanders · on April 9, 2024

Worth noting that GPT can be accessed both through ChatGPT and via the OpenAI API. The link in this thread is pointing to documentation for the OpenAI API.

Vision means the model can see image inputs.

In the API, GPT with vision was previously available in a limited capacity (only some people, and didn't work with all features, like JSON mode and function calling). Now this model is available to everyone and it works with JSON mode and function calling. It also should be smarter at some tasks.

This model is now available in the API, and will roll out to users in ChatGPT. In the API, it's named `gpt-4-turbo-2024-04-09`. In ChatGPT, it will be under the umbrella of GPT-4.

nuz · on April 9, 2024

People like to say AI is moving blazingly fast these days but this has been like a year in the waiting que. Guessing sora will take equally long if not way longer before the general audience gets to touch it.

belter · on April 9, 2024

Sounds like a complaint about a chair in the sky... https://youtu.be/8r1CZTLk-Gk

throwup238 · on April 9, 2024

The chair in the sky keeps on turnin'... and I don't know if I'll have access tomorrow.

arcanemachiner · on April 9, 2024

I think about this bit a lot, mostly while I'm being mildly inconvenienced by something.

dbbk · on April 9, 2024

Haven't seen this in years, love this video

brandon272 · on April 9, 2024

I could be misjudging the situation entirely, but Sora seems like it is on a much longer "general availability" timeline.

CuriouslyC · on April 9, 2024

I think with Sora, "general availability" will be a much more expensive higher tiered sub with a limited number of gens per day, and I have my doubts that you'll just be able to sign up for this sub through the web, I wouldn't be surprised if it's an invite only partners thing.

freedomben · on April 9, 2024

Yeah OpenAI's head of something (CTO?) said late fall, if it's safe enough. Gonna be awhile until us normal people get our hands on it.

simonw · on April 9, 2024

Here's a blog post about what I've been building with GPT-4 Turbo + Vision for structured data extraction into SQLite database tables from unstructured text and images:

https://www.datasette.cloud/blog/2024/datasette-extract/

YouTube video demo here: https://www.youtube.com/watch?v=g3NtJatmQR0 (3m43s)

anonymousDan · on April 9, 2024

Can I ask, how are people affording to play around with GPT4? Are you all doing it at work? Or is there some way I am unaware of to keep the costs down enough to play around with it for experimenting? It's so expensive!

ShamelessC · on April 9, 2024

How much are you using it? I have been accessing gpt-4-turbo via API in a small discord with a few friends using it as well. Have never gone above 5$/month in usage.

jasonjmcghee · on April 10, 2024

Depends what "so expensive means".

Roughly, $100 will get you 5-10 novels worth of output, assuming 100-200k words.

That's a lot.

debian3 · on April 10, 2024

If you want to use over API you need to pay for what you use. If you simply want to chat with it, currently the cheapest way is cody by sourcegraph ($9/month for unlimited gpt4 and opus with a limit of 7k token). Phind is $20 per month, same models, 32k token limit.

anonymousDan · on April 10, 2024

Thanks everyone. Cody sounds like a good option. Alternatively maybe I can try something like this :) https://twitter.com/mertdumenci/status/1777882582136529130

hiddencost · on April 9, 2024

How expensive? Plenty of folks here can afford to burn $1000 / mo to play with a hobby, I think that's how.

abhimanyue1998 · on April 10, 2024

apply for microsoft startup founders hub, you get 2.5k worth of credits for free

shmoogy · on April 9, 2024

Work for 4 api calls, for personal turbo 3.5 and ollama mixtral

jxaphx · on April 17, 2024

This has made a huge difference in the way we extract structured data from images. Previously we had to perform a number of steps to ensure the JSON result was what we were looking for. Now we just get the function call exactly where we expect it.

If you're a C# developer looking to take advantage of this improvement, check out our free open-source implementation at https://github.com/BackslashDev-LLC/img-to-json and our sample application at https://github.com/BackslashDev-LLC/med-intake-demo

yieldcrv · on April 9, 2024

I've used LLaVa a little bit through LM Studio, but its really subpar mostly due to the GUI I think, is there a model and GUI that's better than LM Studio for vision adapters?

GPT-4V in Open AI's chat interface is so seamless. LLM, Text, Speech input, Speech output with tones and emotion, Image Generation, Vision input, and soon its going to be outputting video with Sora...

its kind of amusing that GPT-4 early 2023 is still the benchmark of competition on both closed source and open source models when the lead is just expanding

robterrell · on April 9, 2024

What do you dislike about it?

yieldcrv · on April 9, 2024

the interface, that I'm needing to load multiple adapters, and that I don't have a language model at the same time. LM Studio is way better for language models only at the moment.

htrp · on April 9, 2024

Is gpt-4-turbo-2024-04-09 basically an updated version of the gpt-4-1106-vision-preview ?

ugh123 · on April 9, 2024

Over Easter I asked GPT4 to count a mess of colored eggs on the floor that I was preparing for an egg hunt. They were mostly evenly separated and clearly visible (there were just 36).

I gave it two tries to respond and it wasn't even close to the correct answer.

Was it confused on colored eggs vs. "natural" eggs it might have been expecting? Should it have understood what I meant?

ravenstine · on April 9, 2024

I imagine it would be better at describing an image in a general sense, but probably isn't processing it in a sense where it would actually be counting individual features. I could be wrong about that, but it seems like a combination of traditional CV and an LLM might be what's needed for more precise feature identification.

0x008 · on April 9, 2024

The use case of “counting objects” is basically completely solved already by yolov8. There is no need to use an LLM for that.

ugh123 · on April 10, 2024

Seems like we're pretty far away from "AGI" or even anything resembling a legit "multi model"

ugh123 · on April 10, 2024

multi modal

philipwhiuk · on April 10, 2024

I need an AI model to tell me which AI model/tool to use.

0x008 · on April 15, 2024

Then you might want to look into query routing or function calling.

Sharlin · on April 9, 2024

I'm pretty sure LLM vision capabilities are right now limited to something similar to subitizing in humans at best, ie. being able to perceive the number of items less than ~7 without counting. Expecting it to be able to actually count objects is a bit too much.

fzzzy · on April 9, 2024

LLMs are very bad at counting.

tkgally · on April 9, 2024

I've been disappointed with both GPT-4's and Gemini 1.5's image recognition abilities in general, not just counting. When I have asked them to describe a photo containing multiple objects--a street scene, a room--they identify some of the objects correctly but invariably hallucinate others, naming things that are not present. Usually the hallucinations are of things that might appear in similar photos but definitely are not in the photo I gave them.

dvfjsdhgfv · on April 9, 2024

Someone could explain why on their documentation page (https://help.openai.com/en/articles/8555496-gpt-4-vision-api) they link to https://web.archive.org/web/20240324122632/https://platform.... instead of https://platform.openai.com/docs/guides/vision? As a parson who pays OpenAI a lot of money each month, I see this as a bit parasitic (unless they donate substantially to the Internet Archive).

blowski · on April 9, 2024

I would assume a mistake, albeit one that raises questions about their QA processes. A static HTML page is hardly going to break their bank.

SushiHippie · on April 9, 2024

Seems to be changed to the correct link now

maciejgryka · on April 9, 2024

It's cool to see this update and JSON mode and function calling will both be useful with vision. I wonder, though, if there were any other specific changes to the models since the `preview` versions besides that?

ChrisLTD · on April 9, 2024

Is it possible to upload an image to the OpenAI Chat Playground to try it out?

brianjking · on April 10, 2024

curl https://api.openai.com/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -d '{ "model": "gpt-4-turbo", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What'\''s in this image?" }, { "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gf..." } } ] } ], "max_tokens": 300 }'

ndr_ · on April 9, 2024

Apparently not. But here: https://huggingface.co/spaces/ndurner/oai_chat

brianjking · on April 10, 2024

No, but you can use the API via Postman.

IshKebab · on April 9, 2024

Slightly OT question - can I use GTP-4 vision to drive a web browser, e.g. to automate tasks like "sign up for this website using this email and password; don't subscribe to promotional emails"?

illnewsthat · on April 10, 2024

Yes, I played around with Skyvern and it works really well!

https://www.skyvern.com/

https://github.com/Skyvern-AI/Skyvern

wewtyflakes · on April 9, 2024

I believe tying it together might be a challenge. For example, if you were to use the model to get the text of buttons, you would still have to write code to find the HTML elements for those buttons and drive the click/fill actions.

peterleiser · on April 9, 2024

I've used GPT-4 to help with selenium and it gets the answer eventually, but almost never on the first try. So automating this without human intervention sounds tricky.

jiggawatts · on April 9, 2024

Rubs crystal ball: Widespread availability in Azure will take four months. No wait, it’s just a software change, I’m being silly… six months.

7thpower · on April 10, 2024

I bet they’ll bump you up in the queue if you buy some sweet sweet PTUs

jiggawatts · on April 10, 2024

Even the previous model version is unavailable in my entire region.

ilaksh · on April 9, 2024

For all of these posts that says function calling is now available, I feel like it's actually more of an optimization than a new capability.

All of the leading edge models will output JSON in any format you ask for, such as {"fn_name": {"arg1":10}}. I think this is about making it more accurate and having a standard input/output format.

minimaxir · on April 9, 2024

You have to specify the schema for that regardless.

Function calling has been available for text inputs for awhile, now it's also available for image inputs. OpenAI's function calling/structured data mode is much more strict and reliable at following the schema than just putting "return the output in JSON" in a system prompt.

hansonw · on April 9, 2024

Yes. But also note that the new function calling is actually “tool calling” where the model is also fine-tuned to expect and react to the output of the function (and there are various other nuances like being able to call multiple functions in parallel and matching up the outputs to function calls precisely).

When used in multi-turn “call/response” mode it actually does start to unlock some new capabilities.

tucnak · on April 9, 2024

This is just responding to Anthropic, right? Funny how it took competition for them to make Vision-class models available.

minimaxir · on April 9, 2024

GPT-4-Vision has been around for awhile in beta, it's just GA now.

It's expensive though: Anthropic's Claude Haiku can process images significantly cheaper.

choilive · on April 9, 2024

That's typically how competition works right? Your competitors push you to do better

chimney · on April 9, 2024

Probably the Gemini 1.5 Pro GA that was announced hours earlier.

maciejgryka · on April 9, 2024

Competition is great! But in this case, I don't know, adding JSON mode and function calling was a pretty obvious next steps for the vision model - I bet it'd have happened anyway.

philip1209 · on April 9, 2024

`Moderation` doesn't support images yet, I believe. Does anybody have a good image moderation API they are using?

sireat · on April 10, 2024

ChatGPT Plus was able to tell me what type of coffee capsules I should buy based on a picture with multiple objects with text markings. The sole coffeemaker had a sideways NS -> Nespresso compatible ones.

Is this basically the same functionality except through API?

jessenaser · on April 9, 2024

Why does the production model merge vision with the base turbo model such that the output tokens remains 2048 instead of 4096?

Because if we are using just text, why is the extra output size reduced still?

tedsanders · on April 9, 2024

That's our mistake - it's still 4096. Assuming you saw this on the Playground, we'll fix it shortly. If you saw it somewhere else, please let me know.

https://platform.openai.com/playground/chat?model=gpt-4-turb...

jessenaser · on April 9, 2024

Yes I saw it in the playground. I will keep checking for updates for 4096. Thank you for the clarification.

Edit (4:57 ET): "gpt-4-turbo" shows the updated 4096 in playground. "gpt-4-turbo-2024-04-09" remains 2048 in playground.

jessenaser · on April 10, 2024

(9:32 PM ET) Update: "gpt-4-turbo-2024-04-09" now shows the updated 4096 in the playground too. Thank you for the fix! :)

traspler · on April 10, 2024

Are there any additional improvements to function calling/json mode? In the non-turbo models it really struggled with enums.

alexpogosyan · on April 9, 2024

What software people use to interact with these models via chat?

ndr_ · on April 9, 2024

https://huggingface.co/spaces/ndurner/oai_chat (bring your own API key)

djmips · on April 10, 2024

I've had this for a while, was I in a test bucket?

greatpostman · on April 9, 2024

Apparently Gemini by google has a 20% LLM market share.

ceejayoz · on April 9, 2024

According to whom? How would you assess it?

chimney · on April 9, 2024

Probably saw this tweet from nat friedman https://twitter.com/natfriedman/status/1777739863678386268

ceejayoz · on April 9, 2024

This methodology would ignore every API-driven use of the models that doesn't go through the first-party web interfaces.

bilsbie · on April 9, 2024

Is it worth still paying the $20?

ndr_ · on April 9, 2024

This news piece is, first and foremost, about the Model, not the ChatGPT System. (More about the difference between “Model” and “System”: https://ndurner.github.io/antropic-claude-amazon-bedrock). Not sure what their upgrade policy/process for ChatGPT is like, though.

pama · on April 9, 2024

As per Steven Heidel’s tweet this version will be released to chatGPT soon.

ndr_ · on April 10, 2024

ref: https://x.com/stevenheidel/status/1777794809270575135?s=61&t...

brcmthrowaway · on April 9, 2024

KPIs going down

andrewstuart · on April 9, 2024

"a fix for a bug" <laugh emoji>

gpt-3.5-turbo-0125 New

Updated GPT 3.5 Turbo The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls. Returns a maximum of 4,096 output tokens. Learn more.