It seems like their image-tokenization model might be useful for this, but also ...

Filligree · 2024-06-18T23:01:14 1718751674

Control nets are useful for defining the exact position and behaviour of people in a picture. I won't deny. They don't allow you to get specific, repeated characters, however.

advael · 2024-06-19T01:54:46 1718762086

Ah, so like a multimodal model you can give like "[picture] this guy but doing a handstand in a penguin suit"?

There have been a few attempts at prompt-driven editing (I think it was called Instruct-something) and there's a whole field of work on stuff like animation transfer (megaportraits and emo come to mind for close-up, and there are a few things that do broad motion out there) but it's hard to suggest something without knowing your use case, and if you're looking for general purpose the research seems to just not be there yet. Like that might be the kind of thing that needs a "world model"

I think probably controlnet + some kind of LoRA embedding for your specific character is going to be the closest you can get right now, but it's definitely some extra work and not quite what you want

Filligree · 2024-06-19T11:25:18 1718796318

Exactly that. I think GPT-4o has demonstrated those abilities, but since they haven’t enabled image output yet we can only hope.

spywaregorilla · 2024-06-19T02:43:06 1718764986

Control net + the same prompt + the same seed tends to be pretty decent if you're not savvy enough to train a lora