Hacker News new | past | comments | ask | show | jobs | submit login
Improving image generation with better captions [pdf] (cdn.openai.com)
126 points by alievk on Oct 19, 2023 | hide | past | favorite | 37 comments



Core takeaways:

- major improvement is from detailed image captioning

- they trained an image captioning model to produce short and detailed captions

- they use T5 text encoder

- they use GPT-4 to "upsample" short user prompts

- they train a U-net decoder and distill it to 2 denoising steps

- text rendering is still unreliable, they believe the model has hard time mapping word tokens into letters in image


They have only distilled the diffusion model used as a replacement for the VAE decoder, not the main diffusion model.


So the latent to pixel space conversion is made with another diffusion model?

This is kind of refiner used in SDXL?


>So the latent to pixel space conversion is made with another diffusion model?

Yup

>This is kind of refiner used in SDXL?

The refiner only works on the latent.


> During testing, we have noticed that this capability is unreliable as words are have missing or extra characters. We suspect this may have to do with the T5 text encoder we used: when the model encounters text in a prompt, it actually sees tokens that represent whole words and must map those to letters in an image.

Ah, interesting remark regarding text rendering.


This isn't a new problem. A Google paper from late 2022 mentioned this fact; it went away when they used a byte/character-level text encoder (vs. tokenizing text first).


"Character-Aware Models Improve Visual Text Rendering" https://arxiv.org/abs/2212.10562 for anyone curious


This is an aside, but in general, aren’t tokens a pretty bad hack to improve context length? I don’t know enough about the theory to say, but it seems like it should make all kinds of things much more fragile, like how well it can understand misspellings, or languages that don’t map well to the tokens available.


It still is an amazing state machine.


It's pretty unreal how much better DALL-E 3 is at adhering to prompt accuracy versus its largest for-profit competitor, Midjourney. Quality-wise I think MJ still has the edge, but if it takes 2000 v-rolls to get there (assuming you can at all), then I'd say MJ has a steep hill to climb.


If you haven't tried Ideogram or DeepFloyd, those are better yet at the specific case of "writing verbatim text in your prompt". To the point Ideogram's trending page is entirely taken over by images of Latina women's names with bling effects and Disney characters last I looked.

DALLE3 is definitely amazingly higher quality but I still feel it's kind of… useless. It's too hard to control in conversation form, because it's not a multimodal LLM but rather just works by rewriting text prompts. ChatGPT doesn't really know what dalle can and can't do, the actual dalle model still frequently fights you and just generates whatever it wants, when it generates prompts it sometimes leaves out or writes conflicting details, and it writes every prompt in "friendly harmless AI voice" so there's superfluous hype adjectives.

This is odd because GPT4V actually is a multimodal model, and asking it to describe images as text works really well IME.

Common failure modes I saw playing with it:

* if you set something in eg "France", ChatGPT will rewrite it to be diverse and inclusive by literally saying "diverse people" a lot, but then also adds a lot of super-stereotypical details, and then dalle ignores half of it. So you get sometimes-diverse French people who are all wearing striped shirts and berets in front of the Eiffel Tower.

* You can have a conversation with it and tell it to edit images, which it does by writing new prompts, but it's dumb and thinks the prompts are part of the conversation. So a second round prompt will sometimes leave out all the details and just say "The girl from before" and produce a useless image.

* Composition is usually boring (very center-aligned and symmetrical) and if you try to control it by using words like "camera", it will put cameras in the picture half the time.

* Typical AI failure modes: "anime" generates a kind of commercial/fanart style that doesn't exist rather than frames from a TV show. You can't generate things if they're hard to explain in English, like "a keyboard whose keys are house keys".

But it's really good at putting things in Minecraft.


> Composition is usually boring (very center-aligned and symmetrical) and if you try to control it by using words like "camera", it will put cameras in the picture half the time.

- Be more descriptive and specific with your prompts. I always add "shot from afar" or "wide-angled" or "shot with 85mm lens" - never had an issue of boring composition.

> Typical AI failure modes: "anime" generates a kind of commercial/fanart style that doesn't exist rather than frames from a TV show.

- Again, be more descriptive and specific with your prompts. ie; 'a character in the style of 'Naruto'. Need to specify the show or artists instead of using the broad term "anime".

> You can't generate things if they're hard to explain in English, like "a keyboard whose keys are house keys".

- One more time, you just need to be more specific. do: 'a computer keyboard, but instead of regular keys, replace each one with a metal house key.'


DALLE3 is explicitly designed around not having to do this; the point is that it'll write the detailed prompt for you in four different ways so you get more variation.

"Wide angle" and "fisheye" do work, but "lens" is dangerous; anything that can be read as an object will tend to cause that object to appear instead of being used metaphorically. (Though trying a bit more, that doesn't usually happen, if only because it rewrites the prompt to not mention it.)

> - Again, be more descriptive and specific with your prompts. ie; 'a character in the style of 'Naruto'. Need to specify the show or artists instead of using the broad term "anime".

Explicitly against the rules. It'll block you if you try to use any real person's name or anything it thinks is copyrighted. You can paraphrase though, or argue with it which weirdly sometimes works.

> One more time, you just need to be more specific. do: 'a computer keyboard, but instead of regular keys, replace each one with a metal house key.'

Tried it, doesn't work well. If it gets to dalle it's not smart enough to reliably do "instead of" (or generally anything that's a "deletion"). It'll just put in all three concepts.

https://imgur.com/a/l8RS1Lu

Synonyms can help if there is one in English. Emoji do interesting things but can't tell how well they work.


My typical use cases don't usually involve actual text in the pictures themselves, but I definitely have seen lots of situations where it gets confused and tries to insert the description of the image into random speech bubbles. I can "usually" fix this most of the time by explicitly stating that there should be no text in the image.

I've gotten some pretty accurate images in one shot though that would've taken me 1000 rolls/mangling to get MJ to do:

- Historical 1980s photograph of the Kool-Aid man busting through the Berlin Wall

I haven't tried DeepFloyd or Ideogram, so maybe I'll give them a shot. From what I've gathered 99% of people who use these just use them to either generate anime, or facial portrait type stuff so most of the models out there are tuned for that kind of thing. DALL-E 3 is the first model I've used (including SDXL, etc) that can actually get pretty close to matching my prompt.


>Typical AI failure modes: "anime" generates a kind of commercial/fanart style that doesn't exist rather than frames from a TV show.

People usually want anime artwork, not frames from a show. If you want that take the effort to write it.


I mention this one because Dalle2 actually does work if you say "anime screenshot" (it knows studio names too.) 3 does not understand "screenshot" (makes pictures of people watching something on a TV) and blocks you if you try to reference anything copyrighted, so no "N64 game" either. Time periods like "80s anime" kinda work though.

But the main point is that it makes it online English centric. That style isn't called "anime" in Chinese/Japanese, but if you try to prompt in those languages it translates it first. Basically, if you could control it with images as well as text that'd help.


I can fix most issues by characterizing the issues to ChatGPT. I'll tell it what DALL-E's limitations are and why its prompts don't work and it (usually) "listens" and fixes things. It "knows" what a prompt is, so you could probably just tell it that it needs to be explicit with characters each time and it will be.

I've been just playing with it and generating silly images, I find working with the LLM to generate prompts is really entertaining and can go in directions I wouldn't. I can just ask for "software development memes", then "make some more but reminiscent of famous memes", and then maybe ask it to "create some images blending game development with cosmic horror", then "I like prompt #2, create some variants of that but in the style of Junji Ito" and on and on.


Yeah, it's good at that. It's very good at combining concepts it is familiar with and can do a decent job at comics.

I've seen other people do comics with "meme characters" like Pepe in, but I don't know the trick; it usually complains about proper names and when I tried a few paraphrases it inexplicably produced Reddit ragefaces.


I recognise some of what you're saying here. I've used DALL-E 3 in anger recently and while the results have really impressed me, every time I've tried to actually tweak and improve something I've ended up getting further and further away from what I want.


Quick test on Dall-E 3 (bing.com/create) vs ideogram, and ideogram is not even close in terms of quality.


Image quality no, it's just especially good at being able to spell words compared to Midjourney.

Deepfloyd is a local model sponsored by Stability that's older than SDXL.


Is there evidence that bing creator does this? I remember when it first rolled out there the generations were lower quality but didn't stray as far as Chatgpt. It's just anecdotal though, so might not be true.


Personally I think DALL-E does better quality, especially at photorealistic stuff.

Here's a few of mine, many photorealistic.

https://www.karmatics.com/stuff/dalle.html


Ideogram is really far behind dalle-3 and midjourney


I have to say that the GPT4 Dalle plugin is very good at making what you ask for. I was a bit disappointed to lose access to the new model directly through the dalle interface, because i enjoyed pushing its limits to make wierd images, But its amazing to be able to do things like, say "hey lets come up with a yaml schema to describe an outfit, and then make me some images of that" and get very very consistent results. When I was writing prompts myself I felt I was more just jumping into the soup image vector space.


Just want to call out this is fresh from OpenAI that actually talks about data preparation, their models (to some extents) and some negative learnings. Still not as complete as what's from their peers (Stability AI, Meta, adept.ai etc). Probably it just shows their core focus is on LLM or there is a change of heart?


I re-generated some of these images using the same prompts. It's truly amazing how good some of the artwork is. Most AI-generated images I've seen look good on the surface, but when you look closer there are obvious blemishes and weird things going on. But I've generated a bunch of images in Dall-E 3 and have been amazed at the results.

e.g. my version of the "fierce garden gnome warrior" in widescreen: https://1drv.ms/i/s!Avl8TQnojpIQrsRCFFRQQDpNcyNKNw?e=VAZfVe


Nice to see this - also, looks like one of the primary authors is jbetker, who built the best open source TTS model that I've seen yet (TorToiSe):

https://nonint.com/2023/09/23/dall-e-3/

https://github.com/neonbjb/tortoise-tts


Dang, they went far. Good for them.


See? Accessibility improves everything, even LLM's. Captions/Alt-text are great, even for this. :)


Here you can try part of the process yourself (image to text):

https://huggingface.co/spaces/fffiloni/CLIP-Interrogator-2


Well, not quite. The linked hf space uses the ViT-H-14 OpenCLIP model, which was trained on the Laion-2B dataset[0], which I'd categorize as fitting the reports description of "noisy and inaccurate image captions" perfectly.

[0] https://laion.ai/blog/large-openclip/


I see, so they relabelled a smaller (more representative) subset for the captioner manually but more diligently, then used that to relabel their large set (analogous to LAION) with more descriptive captions, then trained DALL-E 3 on that.

By the way, in the case of CLIP-Interrogator-2, I wonder how they came up with their wordlists (included in the repo). Are they just all unique terms from OpenCLIP?


When I used GPT4-Image to get a full detailed description of an image, I was sure it was going to be used as a method for generating training data.


I just tried this and wow, I'm impressed. Input / output: https://imgur.com/a/nTSkJUK


The images seem to be broken.


The first image they put is a furry. They know that the biggest target audience is furries commissioning images to artists




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: