First of all, the process of converting a stereo pair into a flat image and a disparity map would be lossy and introduce artifacts. Even assuming you could accurately capture pixels that were occluded from one viewpoint and not the other, the approach is inherently unable to handle effects such as partially-transparent or glossy surfaces.
Secondly, the limiting factor described in the post is not space efficiency, it's decoding performance. It doesn't do much good to halve the amount of data required to represent a frame if it takes twice as long to reconstruct the raw pixels for display.
The difference between two images can be non-lossy, if you like.
> partially-transparent or glossy surfaces.
It's all just RGB values; there is no gloss or transparency in an image. (Image layers can have transparency for compositing, but that's obviously something else.)
If audio encoding can have "joint stereo", why not visual coding.
Many areas of a stereo image are nearly identical, like the distant background.
Secondly, the limiting factor described in the post is not space efficiency, it's decoding performance. It doesn't do much good to halve the amount of data required to represent a frame if it takes twice as long to reconstruct the raw pixels for display.