Author here. There are a few reasons, but the biggest one is simply the compress...

espadrine · 2025-10-21T18:51:31 1761072691

That makes sense.

Why RVQ though, rather than using the raw VAE embedding?

If I compare rvq-without-quantization-v4.png with rvq-2-level-v4.png, the quality seems oddly similar, but the former takes a 32-sized vector, while the latter takes two 32-sized (one-hot) vectors, (2 = number of levels, 32 = number of quantization cluster centers). Isn't that more?

vvolhejn · 2025-10-21T19:40:01 1761075601

I had a part about this but I took it out: for compression, you could keep the embeddings unquantized and it would still compress quite well, depending on the embedding dimension and the number of quantization levels.

But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want.

Check out Section 5.3 and Figure 6 from PixelRNN, where they discuss this phenomenon: https://arxiv.org/pdf/1601.06759

At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926

programjames · 2025-10-22T01:30:21 1761096621

Hmm, I think a mixture of beta distributions could work just as well as cateogrical here. I'm going to train it for PixelRNN, but it's going to take hours or days to train (it's a very inefficient and unparallelizable architecture). I'll report back tomorrow.

programjames · 2025-10-24T03:10:41 1761275441

Update 2:

After another 24 hours of training and around 100 epochs, we get down to 4.4 bits/dim and colors are starting to emerge[1]. However, an issue a friend brought up is that log-likelihood + beta distribution weights values near 0 and 1 much higher:

     log(Beta likelihood) = alpha * log(x) + beta * log(1-x)
                                      ^
                               log(0) --> oo

This means we should see most outputs be pure colors: black, white, red, blue, green, cyan, magenta, or yellow. 3.6% of the channels are 0 or 255, up from 1.4% after 50 epochs[2]. Apparently, an earth-mover loss might be better:

    E_{x ~ output distribution}[|correct - x|]

I could retrain this for another day or two, but PixelRNN is really slow, and I want to use my GPU for other things. Instead, I trained a 50x faster PixelCNN for 50 epochs with this new loss and... it just went to the average pixel value (0.5). There's probably a way to train a mixture of betas, but I haven't figured it out yet.

[1]: https://imgur.com/kGbERDg [2]: https://imgur.com/iJYwHr0

programjames · 2025-10-23T01:25:32 1761182732

Update 1: After ~12 hours of training and 45 epochs on CIFAR, I'm starting to see textures.

https://imgur.com/MzKUKhH

programjames · 2025-10-25T02:55:01 1761360901

Update 3:

Okay, so my PixelCNN masking was wrong... which is why it went to the mean. The earth-mover did get better results than negative log-likelihood, but I found a better solution!

The issue with negative log-likelihood was the neural network could optimize solely around zero and one because there are poles there. The key insight is that the color value in the image is not zero or one. If we are given #00, all we really know is the image from the real world had a brightness between #00 and #01, so we should be integrating the probability density function from 0 to 1/256 to get the likelihood.

It turns out PyTorch does not have a good implementation of Beta.cdf(), so I had to roll my own. Realistically, I just asked the chatbots to tell me what good algorithms there were and to write me code. I ended up with two:

(1) There's a known continued fraction form for the CDF, so combined with Lentz' algorithm it can be computed.

(2) Apparently there's a pretty good closed-form approximation as well (Temme [1]).

The first one was a little unstable in training, but worked well enough (output: [2], color hist: [3]). The second was a little more stable in training, but had issues with nan's near zero and one, so I had to clamp things there which makes it a little less accurate (output: [4], color hist: [5]).

The bits/dim gets down to ~3.5 for both of these, which isn't terrible, but there's probably something that can be done better to get it below 3.0. I don't have any clean code to upload, but I'll probably do that tomrrow and edit (or reply to) this comment. But, that's it for the experiments!

Anyway, the point of this experiment was because this sentence was really bothering me:

> But categorical distributions are better for modelling.

And when I investigated why you said that, it turns out the PixelRNN authors used a mixture of Gaussians, and even said they're probably losing some bits because Gaussians go out of bounds and need to be clipped! So, I really wanted to say, "seems like a skill issue, just use Beta distributions," but then I had to go check if that really did work. My hypothesis was Betas should work even better than a categorical distribution because the categorical model would have to learn nearby outputs are indeed nearby while this is baked into the Beta model. We see the issue show up in the PixelRNN paper, where their outputs are very noisy compared to mine (histogram for a random pixel: [6]).

[1]: https://ir.cwi.nl/pub/2294/2294D.pdf [2]: https://imgur.com/e8xbcfu [3]: https://imgur.com/z0wnqu3 [4]: https://imgur.com/Z2Tcoue [5]: https://imgur.com/p7sW4r9 [6]: https://imgur.com/P4ZV9n4