The article misses to highlight some of the important elements of the current state of the image synthesis.
It's quite important to emphasise the dichotomy between the 2 current approaches in the image synthesis today. 1. Implicit distribution learning - VAE or autoregressive based techniques 2. Explicit learning of the distribution - GAN based models. The way these two model the distribution is fundamentally different.
Fundamentally, GAN presents huge drawbacks when it comes to the actual inference during the synthesis. There have been dozens of models with workarounds but most of these present new challenges on their own especially instability and mode collapse being one of the primary.
VQVAE2 as the most advanced VAE based technique has eliminated major drawbacks of VAE and GAN and has produced phenomenal quality [1]
However the main challenge in the area is not synthesising just any kind of image. VQVAE2 is doing that already very well. Where none of the current techniques win today, is the multi-object image synthesis. That requires a new paradigm in the architecture and distribution learning.
You mixed up implicit and explicit models. For anyone interested in the difference - implicit models such as GANs don't allow you to evaluate the probability density over datapoints - you can only sample from some surrogate model of the distribution learned by minimizing some 'distance' between the surrogate and the true empirical distribution.
'Explicit' models (I think this term is nonstandard) parameterize the density directly and modify the parameters via maximum likelihood. This allows one in theory to both directly evaluate the density and sample from the learned distribution. VAEs (only give a lower bound on the density), autoregressive models, and normalizing flows all fall under this category.
Note that while it is theoretically possible for 'explicit' models to go in both directions (sample and evaluate), one direction may be much more efficient than the other for certain models. e.g. for autoregressive models you can read the first two pages of [1] for a good explanation of why.
As shown in a figure in the section of VQGAN, VQGAN offers superior quality over VQVAE2 for a given amount of compute budget, and given that the generator of VQGAN is based on an architecture similar to VQVAE-1/2 (like DALL-E), it does not suffer from mode collapse or instability you mentioned.
~I think you have these two backwards: "1. Implicit distribution learning - VAE or autoregressive based techniques 2. Explicit learning of the distribution - GAN based models. The way these two model the distribution is fundamentally different."~
edit, someone else pointed this out hours ago and provided a much more detailed answer.
It's quite important to emphasise the dichotomy between the 2 current approaches in the image synthesis today. 1. Implicit distribution learning - VAE or autoregressive based techniques 2. Explicit learning of the distribution - GAN based models. The way these two model the distribution is fundamentally different.
Fundamentally, GAN presents huge drawbacks when it comes to the actual inference during the synthesis. There have been dozens of models with workarounds but most of these present new challenges on their own especially instability and mode collapse being one of the primary.
VQVAE2 as the most advanced VAE based technique has eliminated major drawbacks of VAE and GAN and has produced phenomenal quality [1]
However the main challenge in the area is not synthesising just any kind of image. VQVAE2 is doing that already very well. Where none of the current techniques win today, is the multi-object image synthesis. That requires a new paradigm in the architecture and distribution learning.
[1](https://arxiv.org/abs/1906.00446)