stefanbaumann's comments

stefanbaumann · on Jan 24, 2024

The models presented in the paper are trained on class-conditional ImageNet (where the input is Gaussian noise and one of 1000 classes, e.g., "car") and unconditional FFHQ (where the input is only Gaussian noise).

stefanbaumann · on Jan 24, 2024

Not yet, we focused on the architecture for this paper. I totally agree with you though - pixel space is generally less limiting than a latent space for diffusion, so we would expect good performance inpainting behavior and other editing tasks.

stefanbaumann · on Jan 24, 2024

The "input image" is just the noisy sample from the previous timestep, yes.

The overall architecture diagram does not explicitly show the conditioning mechanism, which is a small separate network. For this paper, we only trained on class-conditional ImageNet and completely unconditional megapixel-scale FFHQ.

Training large-scale text-to-image models with this architecture is something we have not yet attempted, although there's no indication that this shouldn't work with a few tweaks.

sorenjan · on Jan 24, 2024

Thank you, I'm not used to reading this kind of research papers but I think I got the gist of it now.

Can this architecture be used to distill models that need fewer timesteps like LCMs or SDXL turbo?

stefanbaumann · on Jan 24, 2024

Both Latent Consistency Models and Adversarial Diffusion Distillation (the method behind SDXL Turbo) are methods that do not depend on any specific properties of the backbone. So, as Hourglass Diffusion Transformers are just a new kind of backbone that can be used just like the Diffusion U-Nets in Stable Diffusion (XL), these methods should also be applicable to it.

stefanbaumann · on Jan 23, 2024

Thanks a lot!

Yeah, the main motivation was trying to find a way to enable transformers to do high-resolution image synthesis: transformers are known to scale well to extreme, multi-billion parameter scales and typically offer superior coherency & composition in image generation, but current architectures are too expensive to train at scale for high-resolution inputs.

By using a hierarchical architecture and local attention at high-resolution scales (but retaining global attention at low-resolution scales), it becomes viable to apply transformers at these scales. Additionally, this architecture can now directly be trained on megapixel-scale inputs and generate high-quality results without having to progressively grow the resolution over the training or applying other "tricks" typically needed to make models at these resolutions work well.

stefanbaumann · on Sept 11, 2023

It's already a thing [1]. They also have a project website [2] with some nice videos, although the code hasn't yet been released.

[1] https://arxiv.org/abs/2308.09713

[2] https://dynamic3dgaussians.github.io/

evrimoztamur · on Sept 12, 2023

The recording from the point of view of the football they were tossing at each other made me feel things. My friend mentioned that it's like the 'braindance' from Cyberpunk 2077.

gsuuon · on Sept 12, 2023

Wow - it didn't occur to me but it feels exactly like a braindance.