If you haven't tried Ideogram or DeepFloyd, those are better yet at the specific case of "writing verbatim text in your prompt". To the point Ideogram's trending page is entirely taken over by images of Latina women's names with bling effects and Disney characters last I looked.
DALLE3 is definitely amazingly higher quality but I still feel it's kind of… useless. It's too hard to control in conversation form, because it's not a multimodal LLM but rather just works by rewriting text prompts. ChatGPT doesn't really know what dalle can and can't do, the actual dalle model still frequently fights you and just generates whatever it wants, when it generates prompts it sometimes leaves out or writes conflicting details, and it writes every prompt in "friendly harmless AI voice" so there's superfluous hype adjectives.
This is odd because GPT4V actually is a multimodal model, and asking it to describe images as text works really well IME.
Common failure modes I saw playing with it:
* if you set something in eg "France", ChatGPT will rewrite it to be diverse and inclusive by literally saying "diverse people" a lot, but then also adds a lot of super-stereotypical details, and then dalle ignores half of it. So you get sometimes-diverse French people who are all wearing striped shirts and berets in front of the Eiffel Tower.
* You can have a conversation with it and tell it to edit images, which it does by writing new prompts, but it's dumb and thinks the prompts are part of the conversation. So a second round prompt will sometimes leave out all the details and just say "The girl from before" and produce a useless image.
* Composition is usually boring (very center-aligned and symmetrical) and if you try to control it by using words like "camera", it will put cameras in the picture half the time.
* Typical AI failure modes: "anime" generates a kind of commercial/fanart style that doesn't exist rather than frames from a TV show. You can't generate things if they're hard to explain in English, like "a keyboard whose keys are house keys".
But it's really good at putting things in Minecraft.
> Composition is usually boring (very center-aligned and symmetrical) and if you try to control it by using words like "camera", it will put cameras in the picture half the time.
- Be more descriptive and specific with your prompts. I always add "shot from afar" or "wide-angled" or "shot with 85mm lens" - never had an issue of boring composition.
> Typical AI failure modes: "anime" generates a kind of commercial/fanart style that doesn't exist rather than frames from a TV show.
- Again, be more descriptive and specific with your prompts. ie; 'a character in the style of 'Naruto'. Need to specify the show or artists instead of using the broad term "anime".
> You can't generate things if they're hard to explain in English, like "a keyboard whose keys are house keys".
- One more time, you just need to be more specific. do: 'a computer keyboard, but instead of regular keys, replace each one with a metal house key.'
DALLE3 is explicitly designed around not having to do this; the point is that it'll write the detailed prompt for you in four different ways so you get more variation.
"Wide angle" and "fisheye" do work, but "lens" is dangerous; anything that can be read as an object will tend to cause that object to appear instead of being used metaphorically. (Though trying a bit more, that doesn't usually happen, if only because it rewrites the prompt to not mention it.)
> - Again, be more descriptive and specific with your prompts. ie; 'a character in the style of 'Naruto'. Need to specify the show or artists instead of using the broad term "anime".
Explicitly against the rules. It'll block you if you try to use any real person's name or anything it thinks is copyrighted. You can paraphrase though, or argue with it which weirdly sometimes works.
> One more time, you just need to be more specific. do: 'a computer keyboard, but instead of regular keys, replace each one with a metal house key.'
Tried it, doesn't work well. If it gets to dalle it's not smart enough to reliably do "instead of" (or generally anything that's a "deletion"). It'll just put in all three concepts.
My typical use cases don't usually involve actual text in the pictures themselves, but I definitely have seen lots of situations where it gets confused and tries to insert the description of the image into random speech bubbles. I can "usually" fix this most of the time by explicitly stating that there should be no text in the image.
I've gotten some pretty accurate images in one shot though that would've taken me 1000 rolls/mangling to get MJ to do:
- Historical 1980s photograph of the Kool-Aid man busting through the Berlin Wall
I haven't tried DeepFloyd or Ideogram, so maybe I'll give them a shot. From what I've gathered 99% of people who use these just use them to either generate anime, or facial portrait type stuff so most of the models out there are tuned for that kind of thing. DALL-E 3 is the first model I've used (including SDXL, etc) that can actually get pretty close to matching my prompt.
I mention this one because Dalle2 actually does work if you say "anime screenshot" (it knows studio names too.) 3 does not understand "screenshot" (makes pictures of people watching something on a TV) and blocks you if you try to reference anything copyrighted, so no "N64 game" either. Time periods like "80s anime" kinda work though.
But the main point is that it makes it online English centric. That style isn't called "anime" in Chinese/Japanese, but if you try to prompt in those languages it translates it first. Basically, if you could control it with images as well as text that'd help.
I can fix most issues by characterizing the issues to ChatGPT. I'll tell it what DALL-E's limitations are and why its prompts don't work and it (usually) "listens" and fixes things. It "knows" what a prompt is, so you could probably just tell it that it needs to be explicit with characters each time and it will be.
I've been just playing with it and generating silly images, I find working with the LLM to generate prompts is really entertaining and can go in directions I wouldn't. I can just ask for "software development memes", then "make some more but reminiscent of famous memes", and then maybe ask it to "create some images blending game development with cosmic horror", then "I like prompt #2, create some variants of that but in the style of Junji Ito" and on and on.
Yeah, it's good at that. It's very good at combining concepts it is familiar with and can do a decent job at comics.
I've seen other people do comics with "meme characters" like Pepe in, but I don't know the trick; it usually complains about proper names and when I tried a few paraphrases it inexplicably produced Reddit ragefaces.
I recognise some of what you're saying here. I've used DALL-E 3 in anger recently and while the results have really impressed me, every time I've tried to actually tweak and improve something I've ended up getting further and further away from what I want.
Is there evidence that bing creator does this? I remember when it first rolled out there the generations were lower quality but didn't stray as far as Chatgpt. It's just anecdotal though, so might not be true.
DALLE3 is definitely amazingly higher quality but I still feel it's kind of… useless. It's too hard to control in conversation form, because it's not a multimodal LLM but rather just works by rewriting text prompts. ChatGPT doesn't really know what dalle can and can't do, the actual dalle model still frequently fights you and just generates whatever it wants, when it generates prompts it sometimes leaves out or writes conflicting details, and it writes every prompt in "friendly harmless AI voice" so there's superfluous hype adjectives.
This is odd because GPT4V actually is a multimodal model, and asking it to describe images as text works really well IME.
Common failure modes I saw playing with it:
* if you set something in eg "France", ChatGPT will rewrite it to be diverse and inclusive by literally saying "diverse people" a lot, but then also adds a lot of super-stereotypical details, and then dalle ignores half of it. So you get sometimes-diverse French people who are all wearing striped shirts and berets in front of the Eiffel Tower.
* You can have a conversation with it and tell it to edit images, which it does by writing new prompts, but it's dumb and thinks the prompts are part of the conversation. So a second round prompt will sometimes leave out all the details and just say "The girl from before" and produce a useless image.
* Composition is usually boring (very center-aligned and symmetrical) and if you try to control it by using words like "camera", it will put cameras in the picture half the time.
* Typical AI failure modes: "anime" generates a kind of commercial/fanart style that doesn't exist rather than frames from a TV show. You can't generate things if they're hard to explain in English, like "a keyboard whose keys are house keys".
But it's really good at putting things in Minecraft.