Seems like a solid paper from a skim through it. My rough summary:
The popular large scale diffusion models like StableDiffusion are CNN based at their heart, with attention layers sprinkled throughout. This paper builds on recent research exploring whether competitive image diffusion models can be built out of purely transformers, no CNN layers.
In this paper they build a similar U-Net like structure, but out of transformer layers, to improve efficiency compared to a straight Transformer. They also use local attention when the resolution is high to save on computational cost, but regular global attention in the middle to maintain global coherence.
Based on ablation studies this allows them to maintain or slightly improve FID score compared to Transformer-only diffusion models that don't do U-net like structures, but at 1/10th the computation cost. An incredible feat for sure.
There is a variety of details: RoPE positional encoding, GEGELU activations, RMSNorm, learnable skip connections, learnable cosine-sim attention, neighborhood attention for the local attention, etc.
The biggest gains in FID occur when the authors use "soft-min-snr" as the loss function; FID drops from 41 to 28!
Lots of ablation study was done across all their changes (see Table 1).
Training is otherwise completely standard AdamW, 5e-4, 0.01, 256 batch, constant LR, 400k steps for most experiments at 128x128 resolution.
So yeah, overall seems like solid work that combines a great mixture of techniques and pushes Transformer based diffusion forward.
If scaled up I'm not sure it would be "revolutionary" in terms of FID compared to SDXL or DALLE3, mostly because SD and DALLE already use attention obviating the scaling issue, and lots of tricks like diffusion based VAEs. But it's likely to provide a nice incremental improvement in FID, since in general Transformers perform better than CNNs unless the CNNs are _heavily_ tuned.
And being pixel based rather than latent based has many advantages.
FID doesn't reward high-resolution detail. the inception feature size is 299x299! so we are forced to downsample our FFHQ-1024 samples to compute FID.
it also doesn't punish poor detail either! this advantages latent diffusion, which can claim to achieve a high resolution but without actually needing to have correct textures to get good metrics.
The popular large scale diffusion models like StableDiffusion are CNN based at their heart, with attention layers sprinkled throughout. This paper builds on recent research exploring whether competitive image diffusion models can be built out of purely transformers, no CNN layers.
In this paper they build a similar U-Net like structure, but out of transformer layers, to improve efficiency compared to a straight Transformer. They also use local attention when the resolution is high to save on computational cost, but regular global attention in the middle to maintain global coherence.
Based on ablation studies this allows them to maintain or slightly improve FID score compared to Transformer-only diffusion models that don't do U-net like structures, but at 1/10th the computation cost. An incredible feat for sure.
There is a variety of details: RoPE positional encoding, GEGELU activations, RMSNorm, learnable skip connections, learnable cosine-sim attention, neighborhood attention for the local attention, etc.
The biggest gains in FID occur when the authors use "soft-min-snr" as the loss function; FID drops from 41 to 28!
Lots of ablation study was done across all their changes (see Table 1).
Training is otherwise completely standard AdamW, 5e-4, 0.01, 256 batch, constant LR, 400k steps for most experiments at 128x128 resolution.
So yeah, overall seems like solid work that combines a great mixture of techniques and pushes Transformer based diffusion forward.
If scaled up I'm not sure it would be "revolutionary" in terms of FID compared to SDXL or DALLE3, mostly because SD and DALLE already use attention obviating the scaling issue, and lots of tricks like diffusion based VAEs. But it's likely to provide a nice incremental improvement in FID, since in general Transformers perform better than CNNs unless the CNNs are _heavily_ tuned.
And being pixel based rather than latent based has many advantages.