Another approach to space efficiency would be to do away with the need for dual video streams entirely and just average the stereo images together to form a single monocular image. Then, send a disparity map along with the monocular video. Decode the mono video and use the disparity map to interpolate the view of either eye. You’ll have all the information you need for reconstruction and the disparity map can be efficiently compressed via normalization and perhaps even by sending just the vectors of the contours.
Another idea is to take advantage of the fact head motions are really just a translation vector of the camera. There’s no need to send pixels that have just transposed locations unless they have changed in time.
If I was designing such a system I’d try to take advantage of the fact there isn’t a lot changing fundamentally in the scene when you move your head, and maintain some sort of state and only request chunks of pixels that are actually needed. You wouldn’t even have to use a traditional video codec as the preservation of state would be far more efficient than thinking about things in terms of flat pixels and video.
> Decode the mono video and use the disparity map to interpolate the view of either eye.
By "disparity map" are you thinking something like a heightmap applied to the scene facing the viewer and then you use that to skew things for each eye?
If so, how would that handle parts of the scene that are occluded/revealed to one eye but not the other?
How does video encoding like H.264 handle parts of a scene that are occluded in one frame, but not occluded in the next frame?
A three inch difference between two cameras producing simultaneous frames is similar to a three inch sideways step of one camera in time between two frames.
True, occlusions would be a problem but we’re taking about fake autostereoscopic 3D here, where most of the stereo rigs used for capture have but a modest baseline. Almost all of the depth perception comes from disparity, occlusions would still in fact be very visible by the averaging method I described and be at whatever depth plane of the occluder which is probably a good guess anyway. Not like your other eye would receive a correspondence from an occlusion in the real world.
FYI, there's online software[1] to recreate 3D/stereoscopic 3D imagery from the depth-enabled photos taken e.g. by Moto G5S (which has a dual-camera setup that computes the depth map, but no API to extract/store the image taken by the other camera).
My personal opinion is that true stereoscopic images feel better when there's enough detail; those occlusions do matter. For some imagery it doesn't matter as much though.
First of all, the process of converting a stereo pair into a flat image and a disparity map would be lossy and introduce artifacts. Even assuming you could accurately capture pixels that were occluded from one viewpoint and not the other, the approach is inherently unable to handle effects such as partially-transparent or glossy surfaces.
Secondly, the limiting factor described in the post is not space efficiency, it's decoding performance. It doesn't do much good to halve the amount of data required to represent a frame if it takes twice as long to reconstruct the raw pixels for display.
The difference between two images can be non-lossy, if you like.
> partially-transparent or glossy surfaces.
It's all just RGB values; there is no gloss or transparency in an image. (Image layers can have transparency for compositing, but that's obviously something else.)
If audio encoding can have "joint stereo", why not visual coding.
Many areas of a stereo image are nearly identical, like the distant background.
Yeah, there's a lot you could do, but unfortunately the only thing that makes decoding high resolution video feasible on these devices is fixed-function video decoding hardware which can't support new ideas like this. You'd have to lobby standards bodies to add VR-specific features to codecs and wait many years for hardware to implement them.
What you are suggesting is essentially a new codec. It sounds like a good idea, however, that's a thing for the future.
The "disparity map" you suggest seems to exist in 3D Blu-rays (Multiview Video Coding), but there may be some technical limitations that make it unsuited for the Oculus Go.
Sure, those ideas might hold up for objects far away from the eyes, but for nearby objects there can a pretty big difference between what each eye sees. I think the human brain would quickly call BS on an image processed through that kind of compression, and it would not be very immersive or realistic.
There is a difference between what the successive frames of a video depict. Yet, video compression heavily relies on encoding just the differences between successive frames, which is very effective.
I think this more-or-less corresponds to the first approach you described. It's part of the Multiview Video Coding amendment to H.264: https://en.wikipedia.org/wiki/2D_plus_Delta
Another idea is to take advantage of the fact head motions are really just a translation vector of the camera. There’s no need to send pixels that have just transposed locations unless they have changed in time.
If I was designing such a system I’d try to take advantage of the fact there isn’t a lot changing fundamentally in the scene when you move your head, and maintain some sort of state and only request chunks of pixels that are actually needed. You wouldn’t even have to use a traditional video codec as the preservation of state would be far more efficient than thinking about things in terms of flat pixels and video.