I think it's easy to explain. If we split all those images into small 8x8 chunks...

I think it's easy to explain. If we split all those images into small 8x8 chunks, and put all the chunks into a fuzzy and a bit lossy hashtable, we'll see that many chunks are very similar and can be merged into one. To address this "space of 8x8 chunks" we'll apply PCA to them, just like in jpeg, and use only the top most significant components of the PCA vectors.

So in essense, this SD model is like an Alexandria library of visual elements, arranged on multidomensional shelves.