Hacker News new | past | comments | ask | show | jobs | submit login

It seems like their image-tokenization model might be useful for this, but also have you looked into stuff like ControlNet?



Control nets are useful for defining the exact position and behaviour of people in a picture. I won't deny. They don't allow you to get specific, repeated characters, however.


Ah, so like a multimodal model you can give like "[picture] this guy but doing a handstand in a penguin suit"?

There have been a few attempts at prompt-driven editing (I think it was called Instruct-something) and there's a whole field of work on stuff like animation transfer (megaportraits and emo come to mind for close-up, and there are a few things that do broad motion out there) but it's hard to suggest something without knowing your use case, and if you're looking for general purpose the research seems to just not be there yet. Like that might be the kind of thing that needs a "world model"

I think probably controlnet + some kind of LoRA embedding for your specific character is going to be the closest you can get right now, but it's definitely some extra work and not quite what you want


Exactly that. I think GPT-4o has demonstrated those abilities, but since they haven’t enabled image output yet we can only hope.


Control net + the same prompt + the same seed tends to be pretty decent if you're not savvy enough to train a lora




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: