Hacker News new | past | comments | ask | show | jobs | submit login
ControlNET and Stable Diffusion: A Game Changer for AI Image Generation (uxdesign.cc)
135 points by belltaco on Feb 20, 2023 | hide | past | favorite | 36 comments



Here is a preliminary test for video editing using ControlNet I made: https://www.youtube.com/watch?v=u52MOA4YaGk

As you can see, there is still quite a bit of flicker, I'm working to reduce that. But the consistency is much better compared to, say, img2img.

I'm hoping to ship a prototype this week.


Haven't read the paper yet, but curious how different ControlNet is from Text2LIVE ([1], [2]). Seems it's solving the same problem with temporal consistency, no?

[1] https://www.youtube.com/watch?v=8U9o5aZ2y5w

[2] https://text2live.github.io/


No, ControlNet wasn't made to solve temporal consistency, it was made to add more control (hence the name) to image models. I am using it in a way that the authors may not have thought of, because the paper doesn't mention video editing.


S-so the girl on the right in the second half of the video is ot real...?


Correct, it's generated by SD.


really curious about this too OP, is the face generated? How do you keep temporal consistency with that?


So, the video was generated by applying ControlNet to the input video frame by frame. Every inference setting is the same for every frame -- seed, prompt, CFG, steps, and sampler. The only thing that changes frame to frame is that the pose changes slightly. So actually, if SD was well behaved, you would expect the difference between adjacent frames to be small, because the change in the input is small. But SD is somewhat schizophrenic so you get this amount of flicker from even small changes in input.

I also had to specify what the outfit should be (I got a lot more discrepancies when I didn't do this from the outfit changing frame to frame). You can see that the outfit changes color in the second version, I bet you can get that to be even more consistent if you specify the color in the prompt too.

If you create a dreambooth model of a character, you can probably also get consistency of the face that way. In this case I didn't need to do this because I didn't care who I got, I just asked for an "average woman".


Is there like a "temperature" setting you can change? And set it to 0 to produce less flickering?


The flickering comes from the fundamental nature of the de-noising mechanism involved in the diffusion model. The ability to create multiple novel images for the same input comes from adding noise with a random seed. Currently this is more or less done every frame which is why you get the flickering. Keeping the same seed wouldn't be helpful if you want the image to move.

What could be of use here is a noise transformation layer that can use the same noise for every frame but transformed to match desired motion. For video conversion you could possibly extract motion vectors from successive frames to warp the noise.

I assume someone is working on this somewhere.


"The flickering comes from the fundamental nature of the de-noising mechanism involved in the diffusion model." -- agreed

"Keeping the same seed wouldn't be helpful if you want the image to move." -- No, I'm using the same seed (and prompt). The image moves because ControlNet opens up another channel of input, in this case the pose data.


Yes but that still produces temporal aliasing because the unmoving noise is battling the moving controlnet input. I can't find it right now but there was a good example showing a gallery of one word prompts with the same seed. While the images were of different subjects you could clearly see the impact of the noise controlling layout. What was was a capital letter A in one image was a persons legs in another, but that same overall structure was visible in the same place in 90% of the images


I wonder if putting an adversary network on top would reduce the flickering. A mechanism that only accepts a frame of it is detected to be three next frame in a video of the same person, otherwise regenerate


Not really, but I think there are other things you can do to reduce flickering that I'm looking into.


that’s really really really good, I have an overtrained Dreambooth model I was using with controlnet and even mine was flickering in the face more than this


Are you using canny mode? One of the other modes (HED, segmentation, or depth) may give you more consistency. Lmk how it goes if you try this.


Impressive that it doesn’t just do pose transfer but applies the correct reverse kinematics too (hand on the wall/rail etc).


Yes, it's asking SD for an image with some set of characteristics, and SD has some notion of what is plausible from what it saw during training.


Can you mask the background out?


I’ve been using this for the past week. Game changer is absolutely right.

I spent hours trying to get a specific pose, hundreds of generations and seed changes, trying to dial in aspects of one vs another, taking aspects I like and clumsily patching together a Krita mash up then passing that back through img2img… only to get something kinda close.

One shot with canny or depth on Controlnet and I had exactly what I wanted.

Amazing how fast and easily it works. It’s been a WILD 6 months that this tech has been available.


You've convinced me to go for it. I'm downloading the models from HuggingFace and the source from Github.

I'm not sure how it all fits together but I'll try to figure it out. I have experience with Python development, git LFS, and I've been following research papers for years but this is my first attempt.

Is this all self-contained? Should I start elsewhere? I've been pretty intimidated by this stuff TBH, relying on hosted Midjourney / Dall-E for my artwork.

I notice the two repos have the same name. Do I merge the file-trees? How do they work together?



Excellent, thank you!



That's a fair recommendation, but I think there is a massive difference when you have offline-hands-on compared to this. At first, it's all iterations, which at it's fastest, huggingface won't be able to do. It's fair to try, but I think you really need hands on local.


Thank you, I'm new to the site and didn't know about spaces. I'm more interested in trying to run a model locally. All I've ever used locally are toy implementations, and 20-line DQN experiments that never seemed to converge.


If you want to run it locally, the Automatic1111 repo that was mentioned does have most up-to-date models, but it's also messy and breaks often. But is your GPU up to the task?


Yes.


https://twitter.com/TDS_95514874/status/1626817468839911426 This is probably the best use of controlnet I have found to combat temporal inconsistency.

Also fun fact, the human poses can be out-of-distribution and still works: https://twitter.com/toyxyz3/status/1626977005270102016


Also a sample on Reddit caught my eye lately: https://www.reddit.com/r/StableDiffusion/comments/116azlb/yo...

The use of eBsynth to stabilise leads to a pretty incredible result.


It seems to work quite well but StableDiffusion's architecture is quite out of date and the next one (DeepFloyd) may be totally different and not support it.

At least, we know it's capable of generating readable text, which SD sure isn't, and there are newer model papers out there like DiT that should beat latent diffusion on quality and speed.


Controlnets are just a clever method of finetuning a network to enable additional conditioning. Nothing about it is specific to latent diffusion, so why would another approach like pixel space diffusion not work with it?

Also, I'd hold my horses with calling latent diffusion "out of date". The Imagen paper notes that the results greatly scaled with text-encoder size and the model DeepFloyd (it's a team) are working on uses T5-XXL (the unet is also a decent bit larger, but the Imagen paper claims that this should have a much smaller influence). Latent diffusion might also be able to work with text if you scale up the text encoder.

I'm very skeptical of Google's "trust me, I have the hottest model, it just goes to another school". Muse might have nice fidelity and be fast, but the example images have worse visual quality than Imagen. I definitely wouldn't be surprised if transformers took over, but I don't think it's a foregone conclusion.


All progress is progress. If your model is better but stable diffusion has all these tools that let you more practically generate the images you want, a whole web UI that's got years of development in it?

Your advanced model becomes a toy, like creating a Twitter clone on your local machine.

You then have to recreate all these things like control net, because I know for sure that this fine tuned control is way more important than a jump in base model quality.

This is why open source kicks ass. The SD people don't have to be the best, not with a community floating them.


Temporal consistency is still unresolved though. Most definitely looking forward to more work based on this.


January 1st next year will be great:

Steamboat Willie remixes.


So the next thing is a text-to-pose model, so that you can describe your pose and your image with separate prompts, and have them combined via controlnet.


[dead]


Gaming assets creation is going to be much cheaper moving forward.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: