The key idea of Latent MHA is that "regular" multi-headed attention needs you to...

The key idea of Latent MHA is that "regular" multi-headed attention needs you to keep a bunch of giant key-value (KV) matrices around in memory to do inference. The "Latent" part just means that DeepSeek takes the `n` KV matrices in a given n-headed attention block and replaces them with a lower-rank approximation (think of this as compressing the matrices), so that they take up less VRAM in a GPU at the cost of a little extra compute and a little lost accuracy. So not caching, strictly speaking, but weight compression to trade compute off for better memory usage, which is good because the KV matrices are one of the more expensive part of this transformer architecture. MoE addresses the other expensive part (the fully-connected layers) by making it so only a subset of the fully-connected layers are active at any given forward pass.