They also added both JSON and function support to the vision model - previously it didn't have those.
This means you can now use gpt-4-turbo vision to extract structured data from an image!
I was previously using a nasty hack where I'd run the image through the vision model to extract just the text, then run that text through regular gpt-4-turbo to extract structured data. I ditched that hack just now: https://github.com/datasette/datasette-extract/issues/19
I'm trying it for coding and have added it to my VS Code Copilot extension Overall I'd say it's better at coding than the previous GPT-4 Turbo. https://double.bot if anyone wants to try it :)
Is being good math that important to the ChatGPT users, though? The ChatGPT's ability to do math is so limited that I'm not sure what math problems we have to ask ChatGPT to solve.
It means you don’t have to be as sketched out if you’re looking for something that requires basic math. Imagine generating the correct result of a unit test or something. I wouldn’t trust it either way, but I think this is a believable example.
Do you have examples? The other commenter's example of generating correct unit test results is indeed interesting, and I was wondering if there are other cases.
Not sure what you are looking for, but perhaps these thoughts can help. Better geometry understanding helps with spatial tasks. Better probability theory, discrete math, and linear algebra background helps with algo development. More broadly, being better at things that a mathematician is good at, could help tackle complicated tasks in finance, science, or engineering.
llama.cpp has the feature to enforce a certain structure from the LLM (but you still need to have the LLM be able to produce that structure and it's beneficial to prompt it towards the exact result). So this particular story doesn't need to be the same among all LLMs.
llama.cpp grammars will get you results that definitely conform to your grammar, but that doesn't guarantee that they'll be semantically correct - the model could still hallucinate details incorrectly while returning valid JSON.
In my testing I was better off running the image through AWS Textract then taking the output and feeding it to OpenAI. It was also much cheaper. Of course if all you are looking for is extraction then maybe you don't need OpenAI at all. I used it to clean up the OCR'd data and reformat it.
My experience is that it's pretty good at reading text and pretty bad at understanding layouts. So e.g. asking it to work with tables is asking for trouble.
Have you noticed GPT-4 Vision do weird selections of dates for 2019? In doing some processing of some data for work, I ended up switching to textract and then passing the OCR'ed text to GPT-4 and mapping to a Pydantic schema due to this issue.
He, cool to hear you're doing something like this too! We ended up close, but we also need good spatial relationships, which GPT4V isn't great at, so we're using other OCR system and adding the result to the context.
Their naming and versioning choices have definitely created some confusion. Apparently GPT-4 Turbo doesn't just come with the the new vision capabilities, but also with improved non-vision capabilities:
> "I’m hoping we can get evals out shortly to help quantify this. Until then - it’s a new model with various data and training improvements, resulting in better reasoning."[0]
> "- major improvements across the board in our evals (especially math)"[1]
In my experience the turbo models are significantly worse than the old GPT-4 models, especially in structured output and following instructions. I'm guessing its due to the cheapened attention mechanism, but its a bit disappointing that OpenAI tries and hides these limitations.
Worth noting that GPT can be accessed both through ChatGPT and via the OpenAI API. The link in this thread is pointing to documentation for the OpenAI API.
Vision means the model can see image inputs.
In the API, GPT with vision was previously available in a limited capacity (only some people, and didn't work with all features, like JSON mode and function calling). Now this model is available to everyone and it works with JSON mode and function calling. It also should be smarter at some tasks.
This model is now available in the API, and will roll out to users in ChatGPT. In the API, it's named `gpt-4-turbo-2024-04-09`. In ChatGPT, it will be under the umbrella of GPT-4.
People like to say AI is moving blazingly fast these days but this has been like a year in the waiting que. Guessing sora will take equally long if not way longer before the general audience gets to touch it.
I think with Sora, "general availability" will be a much more expensive higher tiered sub with a limited number of gens per day, and I have my doubts that you'll just be able to sign up for this sub through the web, I wouldn't be surprised if it's an invite only partners thing.
Here's a blog post about what I've been building with GPT-4 Turbo + Vision for structured data extraction into SQLite database tables from unstructured text and images:
Can I ask, how are people affording to play around with GPT4? Are you all doing it at work? Or is there some way I am unaware of to keep the costs down enough to play around with it for experimenting? It's so expensive!
How much are you using it? I have been accessing gpt-4-turbo via API in a small discord with a few friends using it as well. Have never gone above 5$/month in usage.
If you want to use over API you need to pay for what you use. If you simply want to chat with it, currently the cheapest way is cody by sourcegraph ($9/month for unlimited gpt4 and opus with a limit of 7k token). Phind is $20 per month, same models, 32k token limit.
This has made a huge difference in the way we extract structured data from images. Previously we had to perform a number of steps to ensure the JSON result was what we were looking for. Now we just get the function call exactly where we expect it.
I've used LLaVa a little bit through LM Studio, but its really subpar mostly due to the GUI I think, is there a model and GUI that's better than LM Studio for vision adapters?
GPT-4V in Open AI's chat interface is so seamless. LLM, Text, Speech input, Speech output with tones and emotion, Image Generation, Vision input, and soon its going to be outputting video with Sora...
its kind of amusing that GPT-4 early 2023 is still the benchmark of competition on both closed source and open source models when the lead is just expanding
the interface, that I'm needing to load multiple adapters, and that I don't have a language model at the same time. LM Studio is way better for language models only at the moment.
Over Easter I asked GPT4 to count a mess of colored eggs on the floor that I was preparing for an egg hunt. They were mostly evenly separated and clearly visible (there were just 36).
I gave it two tries to respond and it wasn't even close to the correct answer.
Was it confused on colored eggs vs. "natural" eggs it might have been expecting? Should it have understood what I meant?
I imagine it would be better at describing an image in a general sense, but probably isn't processing it in a sense where it would actually be counting individual features. I could be wrong about that, but it seems like a combination of traditional CV and an LLM might be what's needed for more precise feature identification.
I'm pretty sure LLM vision capabilities are right now limited to something similar to subitizing in humans at best, ie. being able to perceive the number of items less than ~7 without counting. Expecting it to be able to actually count objects is a bit too much.
I've been disappointed with both GPT-4's and Gemini 1.5's image recognition abilities in general, not just counting. When I have asked them to describe a photo containing multiple objects--a street scene, a room--they identify some of the objects correctly but invariably hallucinate others, naming things that are not present. Usually the hallucinations are of things that might appear in similar photos but definitely are not in the photo I gave them.
It's cool to see this update and JSON mode and function calling will both be useful with vision. I wonder, though, if there were any other specific changes to the models since the `preview` versions besides that?
Slightly OT question - can I use GTP-4 vision to drive a web browser, e.g. to automate tasks like "sign up for this website using this email and password; don't subscribe to promotional emails"?
I believe tying it together might be a challenge. For example, if you were to use the model to get the text of buttons, you would still have to write code to find the HTML elements for those buttons and drive the click/fill actions.
I've used GPT-4 to help with selenium and it gets the answer eventually, but almost never on the first try. So automating this without human intervention sounds tricky.
For all of these posts that says function calling is now available, I feel like it's actually more of an optimization than a new capability.
All of the leading edge models will output JSON in any format you ask for, such as {"fn_name": {"arg1":10}}. I think this is about making it more accurate and having a standard input/output format.
You have to specify the schema for that regardless.
Function calling has been available for text inputs for awhile, now it's also available for image inputs. OpenAI's function calling/structured data mode is much more strict and reliable at following the schema than just putting "return the output in JSON" in a system prompt.
Yes. But also note that the new function calling is actually “tool calling” where the model is also fine-tuned to expect and react to the output of the function (and there are various other nuances like being able to call multiple functions in parallel and matching up the outputs to function calls precisely).
When used in multi-turn “call/response” mode it actually does start to unlock some new capabilities.
Competition is great! But in this case, I don't know, adding JSON mode and function calling was a pretty obvious next steps for the vision model - I bet it'd have happened anyway.
ChatGPT Plus was able to tell me what type of coffee capsules I should buy based on a picture with multiple objects with text markings. The sole coffeemaker had a sideways NS -> Nespresso compatible ones.
Is this basically the same functionality except through API?
This news piece is, first and foremost, about the Model, not the ChatGPT System. (More about the difference between “Model” and “System”: https://ndurner.github.io/antropic-claude-amazon-bedrock). Not sure what their upgrade policy/process for ChatGPT is like, though.
Updated GPT 3.5 Turbo
The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls. Returns a maximum of 4,096 output tokens. Learn more.
This means you can now use gpt-4-turbo vision to extract structured data from an image!
I was previously using a nasty hack where I'd run the image through the vision model to extract just the text, then run that text through regular gpt-4-turbo to extract structured data. I ditched that hack just now: https://github.com/datasette/datasette-extract/issues/19