Improving image generation with better captions [pdf]

alievk · on Oct 19, 2023

Core takeaways:

- major improvement is from detailed image captioning

- they trained an image captioning model to produce short and detailed captions

- they use T5 text encoder

- they use GPT-4 to "upsample" short user prompts

- they train a U-net decoder and distill it to 2 denoising steps

- text rendering is still unreliable, they believe the model has hard time mapping word tokens into letters in image

GaggiX · on Oct 19, 2023

They have only distilled the diffusion model used as a replacement for the VAE decoder, not the main diffusion model.

alievk · on Oct 20, 2023

So the latent to pixel space conversion is made with another diffusion model?

This is kind of refiner used in SDXL?

GaggiX · on Oct 26, 2023

>So the latent to pixel space conversion is made with another diffusion model?

Yup

>This is kind of refiner used in SDXL?

The refiner only works on the latent.

gzer0 · on Oct 20, 2023

> During testing, we have noticed that this capability is unreliable as words are have missing or extra characters. We suspect this may have to do with the T5 text encoder we used: when the model encounters text in a prompt, it actually sees tokens that represent whole words and must map those to letters in an image.

Ah, interesting remark regarding text rendering.

ronsor · on Oct 20, 2023

This isn't a new problem. A Google paper from late 2022 mentioned this fact; it went away when they used a byte/character-level text encoder (vs. tokenizing text first).

ollin · on Oct 20, 2023

"Character-Aware Models Improve Visual Text Rendering" https://arxiv.org/abs/2212.10562 for anyone curious

tobr · on Oct 20, 2023

This is an aside, but in general, aren’t tokens a pretty bad hack to improve context length? I don’t know enough about the theory to say, but it seems like it should make all kinds of things much more fragile, like how well it can understand misspellings, or languages that don’t map well to the tokens available.

mycall · on Oct 20, 2023

It still is an amazing state machine.

vunderba · on Oct 19, 2023

It's pretty unreal how much better DALL-E 3 is at adhering to prompt accuracy versus its largest for-profit competitor, Midjourney. Quality-wise I think MJ still has the edge, but if it takes 2000 v-rolls to get there (assuming you can at all), then I'd say MJ has a steep hill to climb.

astrange · on Oct 20, 2023

If you haven't tried Ideogram or DeepFloyd, those are better yet at the specific case of "writing verbatim text in your prompt". To the point Ideogram's trending page is entirely taken over by images of Latina women's names with bling effects and Disney characters last I looked.

DALLE3 is definitely amazingly higher quality but I still feel it's kind of… useless. It's too hard to control in conversation form, because it's not a multimodal LLM but rather just works by rewriting text prompts. ChatGPT doesn't really know what dalle can and can't do, the actual dalle model still frequently fights you and just generates whatever it wants, when it generates prompts it sometimes leaves out or writes conflicting details, and it writes every prompt in "friendly harmless AI voice" so there's superfluous hype adjectives.

This is odd because GPT4V actually is a multimodal model, and asking it to describe images as text works really well IME.

Common failure modes I saw playing with it:

* if you set something in eg "France", ChatGPT will rewrite it to be diverse and inclusive by literally saying "diverse people" a lot, but then also adds a lot of super-stereotypical details, and then dalle ignores half of it. So you get sometimes-diverse French people who are all wearing striped shirts and berets in front of the Eiffel Tower.

* You can have a conversation with it and tell it to edit images, which it does by writing new prompts, but it's dumb and thinks the prompts are part of the conversation. So a second round prompt will sometimes leave out all the details and just say "The girl from before" and produce a useless image.

* Composition is usually boring (very center-aligned and symmetrical) and if you try to control it by using words like "camera", it will put cameras in the picture half the time.

* Typical AI failure modes: "anime" generates a kind of commercial/fanart style that doesn't exist rather than frames from a TV show. You can't generate things if they're hard to explain in English, like "a keyboard whose keys are house keys".

But it's really good at putting things in Minecraft.

gzer0 · on Oct 20, 2023

> Composition is usually boring (very center-aligned and symmetrical) and if you try to control it by using words like "camera", it will put cameras in the picture half the time.

- Be more descriptive and specific with your prompts. I always add "shot from afar" or "wide-angled" or "shot with 85mm lens" - never had an issue of boring composition.

> Typical AI failure modes: "anime" generates a kind of commercial/fanart style that doesn't exist rather than frames from a TV show.

- Again, be more descriptive and specific with your prompts. ie; 'a character in the style of 'Naruto'. Need to specify the show or artists instead of using the broad term "anime".

> You can't generate things if they're hard to explain in English, like "a keyboard whose keys are house keys".

- One more time, you just need to be more specific. do: 'a computer keyboard, but instead of regular keys, replace each one with a metal house key.'

astrange · on Oct 20, 2023

DALLE3 is explicitly designed around not having to do this; the point is that it'll write the detailed prompt for you in four different ways so you get more variation.

"Wide angle" and "fisheye" do work, but "lens" is dangerous; anything that can be read as an object will tend to cause that object to appear instead of being used metaphorically. (Though trying a bit more, that doesn't usually happen, if only because it rewrites the prompt to not mention it.)

> - Again, be more descriptive and specific with your prompts. ie; 'a character in the style of 'Naruto'. Need to specify the show or artists instead of using the broad term "anime".

Explicitly against the rules. It'll block you if you try to use any real person's name or anything it thinks is copyrighted. You can paraphrase though, or argue with it which weirdly sometimes works.

> One more time, you just need to be more specific. do: 'a computer keyboard, but instead of regular keys, replace each one with a metal house key.'

Tried it, doesn't work well. If it gets to dalle it's not smart enough to reliably do "instead of" (or generally anything that's a "deletion"). It'll just put in all three concepts.

https://imgur.com/a/l8RS1Lu

Synonyms can help if there is one in English. Emoji do interesting things but can't tell how well they work.

vunderba · on Oct 20, 2023

My typical use cases don't usually involve actual text in the pictures themselves, but I definitely have seen lots of situations where it gets confused and tries to insert the description of the image into random speech bubbles. I can "usually" fix this most of the time by explicitly stating that there should be no text in the image.

I've gotten some pretty accurate images in one shot though that would've taken me 1000 rolls/mangling to get MJ to do:

- Historical 1980s photograph of the Kool-Aid man busting through the Berlin Wall

I haven't tried DeepFloyd or Ideogram, so maybe I'll give them a shot. From what I've gathered 99% of people who use these just use them to either generate anime, or facial portrait type stuff so most of the models out there are tuned for that kind of thing. DALL-E 3 is the first model I've used (including SDXL, etc) that can actually get pretty close to matching my prompt.

GaggiX · on Oct 20, 2023

>Typical AI failure modes: "anime" generates a kind of commercial/fanart style that doesn't exist rather than frames from a TV show.

People usually want anime artwork, not frames from a show. If you want that take the effort to write it.

astrange · on Oct 20, 2023

I mention this one because Dalle2 actually does work if you say "anime screenshot" (it knows studio names too.) 3 does not understand "screenshot" (makes pictures of people watching something on a TV) and blocks you if you try to reference anything copyrighted, so no "N64 game" either. Time periods like "80s anime" kinda work though.

But the main point is that it makes it online English centric. That style isn't called "anime" in Chinese/Japanese, but if you try to prompt in those languages it translates it first. Basically, if you could control it with images as well as text that'd help.

crustaceansoup · on Oct 20, 2023

I can fix most issues by characterizing the issues to ChatGPT. I'll tell it what DALL-E's limitations are and why its prompts don't work and it (usually) "listens" and fixes things. It "knows" what a prompt is, so you could probably just tell it that it needs to be explicit with characters each time and it will be.

I've been just playing with it and generating silly images, I find working with the LLM to generate prompts is really entertaining and can go in directions I wouldn't. I can just ask for "software development memes", then "make some more but reminiscent of famous memes", and then maybe ask it to "create some images blending game development with cosmic horror", then "I like prompt #2, create some variants of that but in the style of Junji Ito" and on and on.

astrange · on Oct 20, 2023

Yeah, it's good at that. It's very good at combining concepts it is familiar with and can do a decent job at comics.

I've seen other people do comics with "meme characters" like Pepe in, but I don't know the trick; it usually complains about proper names and when I tried a few paraphrases it inexplicably produced Reddit ragefaces.

thom · on Oct 20, 2023

I recognise some of what you're saying here. I've used DALL-E 3 in anger recently and while the results have really impressed me, every time I've tried to actually tweak and improve something I've ended up getting further and further away from what I want.

bbstats · on Oct 20, 2023

Quick test on Dall-E 3 (bing.com/create) vs ideogram, and ideogram is not even close in terms of quality.

astrange · on Oct 20, 2023

Image quality no, it's just especially good at being able to spell words compared to Midjourney.

Deepfloyd is a local model sponsored by Stability that's older than SDXL.

thequadehunter · on Oct 20, 2023

Is there evidence that bing creator does this? I remember when it first rolled out there the generations were lower quality but didn't stray as far as Chatgpt. It's just anecdotal though, so might not be true.

robbrown451 · on Oct 20, 2023

Personally I think DALL-E does better quality, especially at photorealistic stuff.

Here's a few of mine, many photorealistic.

https://www.karmatics.com/stuff/dalle.html

domatic1 · on Oct 20, 2023

Ideogram is really far behind dalle-3 and midjourney

totetsu · on Oct 20, 2023

I have to say that the GPT4 Dalle plugin is very good at making what you ask for. I was a bit disappointed to lose access to the new model directly through the dalle interface, because i enjoyed pushing its limits to make wierd images, But its amazing to be able to do things like, say "hey lets come up with a yaml schema to describe an outfit, and then make me some images of that" and get very very consistent results. When I was writing prompts myself I felt I was more just jumping into the soup image vector space.

liuliu · on Oct 20, 2023

Just want to call out this is fresh from OpenAI that actually talks about data preparation, their models (to some extents) and some negative learnings. Still not as complete as what's from their peers (Stability AI, Meta, adept.ai etc). Probably it just shows their core focus is on LLM or there is a change of heart?

amanzi · on Oct 20, 2023

I re-generated some of these images using the same prompts. It's truly amazing how good some of the artwork is. Most AI-generated images I've seen look good on the surface, but when you look closer there are obvious blemishes and weird things going on. But I've generated a bunch of images in Dall-E 3 and have been amazed at the results.

e.g. my version of the "fierce garden gnome warrior" in widescreen: https://1drv.ms/i/s!Avl8TQnojpIQrsRCFFRQQDpNcyNKNw?e=VAZfVe

forgingahead · on Oct 20, 2023

Nice to see this - also, looks like one of the primary authors is jbetker, who built the best open source TTS model that I've seen yet (TorToiSe):

https://nonint.com/2023/09/23/dall-e-3/

https://github.com/neonbjb/tortoise-tts

danielbln · on Oct 20, 2023

Dang, they went far. Good for them.

devinprater · on Oct 20, 2023

See? Accessibility improves everything, even LLM's. Captions/Alt-text are great, even for this. :)

aargh_aargh · on Oct 20, 2023

Here you can try part of the process yourself (image to text):

https://huggingface.co/spaces/fffiloni/CLIP-Interrogator-2

Jackson__ · on Oct 20, 2023

Well, not quite. The linked hf space uses the ViT-H-14 OpenCLIP model, which was trained on the Laion-2B dataset[0], which I'd categorize as fitting the reports description of "noisy and inaccurate image captions" perfectly.

[0] https://laion.ai/blog/large-openclip/

aargh_aargh · on Oct 20, 2023

I see, so they relabelled a smaller (more representative) subset for the captioner manually but more diligently, then used that to relabel their large set (analogous to LAION) with more descriptive captions, then trained DALL-E 3 on that.

By the way, in the case of CLIP-Interrogator-2, I wonder how they came up with their wordlists (included in the repo). Are they just all unique terms from OpenCLIP?

visarga · on Oct 20, 2023

When I used GPT4-Image to get a full detailed description of an image, I was sure it was going to be used as a method for generating training data.

cocoflunchy · on Oct 20, 2023

I just tried this and wow, I'm impressed. Input / output: https://imgur.com/a/nTSkJUK

danielbln · on Oct 20, 2023

The images seem to be broken.

throwawy18231 · on Oct 20, 2023

The first image they put is a furry. They know that the biggest target audience is furries commissioning images to artists