Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GPT-4 Turbo with Vision Generally Available (platform.openai.com)
220 points by davidbarker on April 9, 2024 | hide | past | favorite | 107 comments


They also added both JSON and function support to the vision model - previously it didn't have those.

This means you can now use gpt-4-turbo vision to extract structured data from an image!

I was previously using a nasty hack where I'd run the image through the vision model to extract just the text, then run that text through regular gpt-4-turbo to extract structured data. I ditched that hack just now: https://github.com/datasette/datasette-extract/issues/19


One of the OpenAI PM's was also saying the model got substantially better at math: https://x.com/owencm/status/1777770827985150022

I'm trying it for coding and have added it to my VS Code Copilot extension Overall I'd say it's better at coding than the previous GPT-4 Turbo. https://double.bot if anyone wants to try it :)


Is being good math that important to the ChatGPT users, though? The ChatGPT's ability to do math is so limited that I'm not sure what math problems we have to ask ChatGPT to solve.


It means you don’t have to be as sketched out if you’re looking for something that requires basic math. Imagine generating the correct result of a unit test or something. I wouldn’t trust it either way, but I think this is a believable example.


Math is not just arithmetic and yes better math does help at least some GPT-4 users.


Do you have examples? The other commenter's example of generating correct unit test results is indeed interesting, and I was wondering if there are other cases.


Not sure what you are looking for, but perhaps these thoughts can help. Better geometry understanding helps with spatial tasks. Better probability theory, discrete math, and linear algebra background helps with algo development. More broadly, being better at things that a mathematician is good at, could help tackle complicated tasks in finance, science, or engineering.


We've been using `gpt-4-1106-vision-preview` and simply prompting the model to return json, with excellent results: https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (work in progress).


> This means you can now use gpt-4-turbo vision to extract structured data from an image!

How consistent and reliable is the extracted structure?

Did they add any kind of concept "for whatever token you generate/think it is next, "unit test it" / make sure it passes some sort of rules?"


It's pretty good, but it's not reliable enough to exclude the need to check everything it does.

Same story as basically everything relating to LLMs to be honest.


llama.cpp has the feature to enforce a certain structure from the LLM (but you still need to have the LLM be able to produce that structure and it's beneficial to prompt it towards the exact result). So this particular story doesn't need to be the same among all LLMs.

https://github.com/ggerganov/llama.cpp/blob/master/grammars/...


llama.cpp grammars will get you results that definitely conform to your grammar, but that doesn't guarantee that they'll be semantically correct - the model could still hallucinate details incorrectly while returning valid JSON.


Indeed it's not the solution to LLM hallucinations—as far as I know nobody knows a solution to it.

But it is the solution to needing to run the model again and checking the format of the output to ensure that it conforms to you expectations.


^ This.

I've found success generally better with Textract and then passing the OCR'ed text to OpenAI and using Pydantic to get it structured.


In my testing I was better off running the image through AWS Textract then taking the output and feeding it to OpenAI. It was also much cheaper. Of course if all you are looking for is extraction then maybe you don't need OpenAI at all. I used it to clean up the OCR'd data and reformat it.


My experience is that it's pretty good at reading text and pretty bad at understanding layouts. So e.g. asking it to work with tables is asking for trouble.


Yeah it's absolutely horrible at layouts.

I'm not 100% sure it's related but if you ask it to draw bounding boxes around things it's always off by quite a bit.


Even if the GitHub link they just posted above. It made up the year at 2022 and hallucinated the start and end time.

https://github.com/datasette/datasette-extract/issues/19


It's very consistent, check this guy, he was able to structure LLM output using Pydantic in an elegant solution:

https://www.youtube.com/watch?v=yj-wSRJwrrc


> How consistent and reliable is the extracted structure?

That's the $100B question.


Or $T these days


You restrict the model's next prediction to valid JSON tokens. (If you mean format reliability.)


I'm waiting for being able to restrict to a specific JSON Schema


Providing examples generally helps me. Have you checked out the Instructor library too?


Great work, thank you, Simon.

Have you noticed GPT-4 Vision do weird selections of dates for 2019? In doing some processing of some data for work, I ended up switching to textract and then passing the OCR'ed text to GPT-4 and mapping to a Pydantic schema due to this issue.


He, cool to hear you're doing something like this too! We ended up close, but we also need good spatial relationships, which GPT4V isn't great at, so we're using other OCR system and adding the result to the context.


Does anyone know if we can access the same on Azure OpenAI? It still shows vision-preview at my end in West US region with no JSON mode


What time is the event? I think 18:00 - 22:00 is completely fabricated.


Their naming and versioning choices have definitely created some confusion. Apparently GPT-4 Turbo doesn't just come with the the new vision capabilities, but also with improved non-vision capabilities:

> "I’m hoping we can get evals out shortly to help quantify this. Until then - it’s a new model with various data and training improvements, resulting in better reasoning."[0]

> "- major improvements across the board in our evals (especially math)"[1]

[0] https://twitter.com/owencm/status/1777784000712761430

[1] https://twitter.com/stevenheidel/status/1777789577438318625


In my experience the turbo models are significantly worse than the old GPT-4 models, especially in structured output and following instructions. I'm guessing its due to the cheapened attention mechanism, but its a bit disappointing that OpenAI tries and hides these limitations.


As a regular user of ChatGPT-4 i this press release makes little sense.

What does Vision mean? I've already been able to both upload docs, create images etc. for a while now?

What are the "other improvements" and what is Turbo, what is 4.5, what is this new one called?

How do i even see what version of the model i'm using in their interface when it just says "4"?


Worth noting that GPT can be accessed both through ChatGPT and via the OpenAI API. The link in this thread is pointing to documentation for the OpenAI API.

Vision means the model can see image inputs.

In the API, GPT with vision was previously available in a limited capacity (only some people, and didn't work with all features, like JSON mode and function calling). Now this model is available to everyone and it works with JSON mode and function calling. It also should be smarter at some tasks.

This model is now available in the API, and will roll out to users in ChatGPT. In the API, it's named `gpt-4-turbo-2024-04-09`. In ChatGPT, it will be under the umbrella of GPT-4.


People like to say AI is moving blazingly fast these days but this has been like a year in the waiting que. Guessing sora will take equally long if not way longer before the general audience gets to touch it.


Sounds like a complaint about a chair in the sky... https://youtu.be/8r1CZTLk-Gk


The chair in the sky keeps on turnin'... and I don't know if I'll have access tomorrow.


I think about this bit a lot, mostly while I'm being mildly inconvenienced by something.


Haven't seen this in years, love this video


I could be misjudging the situation entirely, but Sora seems like it is on a much longer "general availability" timeline.


I think with Sora, "general availability" will be a much more expensive higher tiered sub with a limited number of gens per day, and I have my doubts that you'll just be able to sign up for this sub through the web, I wouldn't be surprised if it's an invite only partners thing.


Yeah OpenAI's head of something (CTO?) said late fall, if it's safe enough. Gonna be awhile until us normal people get our hands on it.


Here's a blog post about what I've been building with GPT-4 Turbo + Vision for structured data extraction into SQLite database tables from unstructured text and images:

https://www.datasette.cloud/blog/2024/datasette-extract/

YouTube video demo here: https://www.youtube.com/watch?v=g3NtJatmQR0 (3m43s)


Can I ask, how are people affording to play around with GPT4? Are you all doing it at work? Or is there some way I am unaware of to keep the costs down enough to play around with it for experimenting? It's so expensive!


How much are you using it? I have been accessing gpt-4-turbo via API in a small discord with a few friends using it as well. Have never gone above 5$/month in usage.


Depends what "so expensive means".

Roughly, $100 will get you 5-10 novels worth of output, assuming 100-200k words.

That's a lot.


If you want to use over API you need to pay for what you use. If you simply want to chat with it, currently the cheapest way is cody by sourcegraph ($9/month for unlimited gpt4 and opus with a limit of 7k token). Phind is $20 per month, same models, 32k token limit.


Thanks everyone. Cody sounds like a good option. Alternatively maybe I can try something like this :) https://twitter.com/mertdumenci/status/1777882582136529130


How expensive? Plenty of folks here can afford to burn $1000 / mo to play with a hobby, I think that's how.


apply for microsoft startup founders hub, you get 2.5k worth of credits for free


Work for 4 api calls, for personal turbo 3.5 and ollama mixtral


This has made a huge difference in the way we extract structured data from images. Previously we had to perform a number of steps to ensure the JSON result was what we were looking for. Now we just get the function call exactly where we expect it.

If you're a C# developer looking to take advantage of this improvement, check out our free open-source implementation at https://github.com/BackslashDev-LLC/img-to-json and our sample application at https://github.com/BackslashDev-LLC/med-intake-demo


I've used LLaVa a little bit through LM Studio, but its really subpar mostly due to the GUI I think, is there a model and GUI that's better than LM Studio for vision adapters?

GPT-4V in Open AI's chat interface is so seamless. LLM, Text, Speech input, Speech output with tones and emotion, Image Generation, Vision input, and soon its going to be outputting video with Sora...

its kind of amusing that GPT-4 early 2023 is still the benchmark of competition on both closed source and open source models when the lead is just expanding


What do you dislike about it?


the interface, that I'm needing to load multiple adapters, and that I don't have a language model at the same time. LM Studio is way better for language models only at the moment.


Is gpt-4-turbo-2024-04-09 basically an updated version of the gpt-4-1106-vision-preview ?


Over Easter I asked GPT4 to count a mess of colored eggs on the floor that I was preparing for an egg hunt. They were mostly evenly separated and clearly visible (there were just 36).

I gave it two tries to respond and it wasn't even close to the correct answer.

Was it confused on colored eggs vs. "natural" eggs it might have been expecting? Should it have understood what I meant?


I imagine it would be better at describing an image in a general sense, but probably isn't processing it in a sense where it would actually be counting individual features. I could be wrong about that, but it seems like a combination of traditional CV and an LLM might be what's needed for more precise feature identification.


The use case of “counting objects” is basically completely solved already by yolov8. There is no need to use an LLM for that.


Seems like we're pretty far away from "AGI" or even anything resembling a legit "multi model"


multi modal


I need an AI model to tell me which AI model/tool to use.


Then you might want to look into query routing or function calling.


I'm pretty sure LLM vision capabilities are right now limited to something similar to subitizing in humans at best, ie. being able to perceive the number of items less than ~7 without counting. Expecting it to be able to actually count objects is a bit too much.


LLMs are very bad at counting.


I've been disappointed with both GPT-4's and Gemini 1.5's image recognition abilities in general, not just counting. When I have asked them to describe a photo containing multiple objects--a street scene, a room--they identify some of the objects correctly but invariably hallucinate others, naming things that are not present. Usually the hallucinations are of things that might appear in similar photos but definitely are not in the photo I gave them.


Someone could explain why on their documentation page (https://help.openai.com/en/articles/8555496-gpt-4-vision-api) they link to https://web.archive.org/web/20240324122632/https://platform.... instead of https://platform.openai.com/docs/guides/vision? As a parson who pays OpenAI a lot of money each month, I see this as a bit parasitic (unless they donate substantially to the Internet Archive).


I would assume a mistake, albeit one that raises questions about their QA processes. A static HTML page is hardly going to break their bank.


Seems to be changed to the correct link now


It's cool to see this update and JSON mode and function calling will both be useful with vision. I wonder, though, if there were any other specific changes to the models since the `preview` versions besides that?


Is it possible to upload an image to the OpenAI Chat Playground to try it out?


curl https://api.openai.com/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -d '{ "model": "gpt-4-turbo", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What'\''s in this image?" }, { "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gf..." } } ] } ], "max_tokens": 300 }'



No, but you can use the API via Postman.


Slightly OT question - can I use GTP-4 vision to drive a web browser, e.g. to automate tasks like "sign up for this website using this email and password; don't subscribe to promotional emails"?


Yes, I played around with Skyvern and it works really well!

https://www.skyvern.com/

https://github.com/Skyvern-AI/Skyvern


I believe tying it together might be a challenge. For example, if you were to use the model to get the text of buttons, you would still have to write code to find the HTML elements for those buttons and drive the click/fill actions.


I've used GPT-4 to help with selenium and it gets the answer eventually, but almost never on the first try. So automating this without human intervention sounds tricky.


Rubs crystal ball: Widespread availability in Azure will take four months. No wait, it’s just a software change, I’m being silly… six months.


I bet they’ll bump you up in the queue if you buy some sweet sweet PTUs


Even the previous model version is unavailable in my entire region.


For all of these posts that says function calling is now available, I feel like it's actually more of an optimization than a new capability.

All of the leading edge models will output JSON in any format you ask for, such as {"fn_name": {"arg1":10}}. I think this is about making it more accurate and having a standard input/output format.


You have to specify the schema for that regardless.

Function calling has been available for text inputs for awhile, now it's also available for image inputs. OpenAI's function calling/structured data mode is much more strict and reliable at following the schema than just putting "return the output in JSON" in a system prompt.


Yes. But also note that the new function calling is actually “tool calling” where the model is also fine-tuned to expect and react to the output of the function (and there are various other nuances like being able to call multiple functions in parallel and matching up the outputs to function calls precisely).

When used in multi-turn “call/response” mode it actually does start to unlock some new capabilities.


This is just responding to Anthropic, right? Funny how it took competition for them to make Vision-class models available.


GPT-4-Vision has been around for awhile in beta, it's just GA now.

It's expensive though: Anthropic's Claude Haiku can process images significantly cheaper.


That's typically how competition works right? Your competitors push you to do better


Probably the Gemini 1.5 Pro GA that was announced hours earlier.


Competition is great! But in this case, I don't know, adding JSON mode and function calling was a pretty obvious next steps for the vision model - I bet it'd have happened anyway.


`Moderation` doesn't support images yet, I believe. Does anybody have a good image moderation API they are using?


ChatGPT Plus was able to tell me what type of coffee capsules I should buy based on a picture with multiple objects with text markings. The sole coffeemaker had a sideways NS -> Nespresso compatible ones.

Is this basically the same functionality except through API?


Why does the production model merge vision with the base turbo model such that the output tokens remains 2048 instead of 4096?

Because if we are using just text, why is the extra output size reduced still?


That's our mistake - it's still 4096. Assuming you saw this on the Playground, we'll fix it shortly. If you saw it somewhere else, please let me know.

https://platform.openai.com/playground/chat?model=gpt-4-turb...


Yes I saw it in the playground. I will keep checking for updates for 4096. Thank you for the clarification.

Edit (4:57 ET): "gpt-4-turbo" shows the updated 4096 in playground. "gpt-4-turbo-2024-04-09" remains 2048 in playground.


(9:32 PM ET) Update: "gpt-4-turbo-2024-04-09" now shows the updated 4096 in the playground too. Thank you for the fix! :)


Are there any additional improvements to function calling/json mode? In the non-turbo models it really struggled with enums.


What software people use to interact with these models via chat?



I've had this for a while, was I in a test bucket?


Apparently Gemini by google has a 20% LLM market share.


According to whom? How would you assess it?


Probably saw this tweet from nat friedman https://twitter.com/natfriedman/status/1777739863678386268


This methodology would ignore every API-driven use of the models that doesn't go through the first-party web interfaces.


Is it worth still paying the $20?


This news piece is, first and foremost, about the Model, not the ChatGPT System. (More about the difference between “Model” and “System”: https://ndurner.github.io/antropic-claude-amazon-bedrock). Not sure what their upgrade policy/process for ChatGPT is like, though.


As per Steven Heidel’s tweet this version will be released to chatGPT soon.



KPIs going down


"a fix for a bug" <laugh emoji>

gpt-3.5-turbo-0125 New

Updated GPT 3.5 Turbo The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls. Returns a maximum of 4,096 output tokens. Learn more.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: