From my experience, NERF works great, but depends on highly accurate camera location information. Unless the VR device has this baked in, one must run a Colmap-style or SFM-style process to generate those camera extrinsics. Is there anything special HybridNeRF does around this?
Congrats on the paper! Any chance the code will be released?
Also I’d be curious to hear, what are you excited about in terms of future research ideas?
Personally I’m excited by the trend of eliminating the need for traditional SfM preprocessing (sparse point clouds via colmap, camera pose estimation, etc).
Thank you! The code is unlikely to be released (it's built upon Meta-internal codebases that I no longer have access to post-internship), at least not in the form that we specifically used at submission time. The last time I caught up with the team someone was expressing interest in releasing some broadly useful rendering code, but I really can't speak on their behalf so no guarantees.
IMHO it's a really exciting time to be in the neural rendering / 3D vision space - the field is moving quickly and there's interesting work across all dimensions. My personal interests lean towards large-scale 3D reconstruction, and to that effect eliminating the need for traditional SfM/COLMAP preprocessing would be great. There's a lot of relevant recent work (https://dust3r.europe.naverlabs.com/, https://cameronosmith.github.io/flowmap/, https://vggsfm.github.io/, etc), but scaling these methods beyond several dozen images remains a challenge. I’m also really excited about using learned priors that can improve NeRF quality in underobserved regions (https://reconfusion.github.io). IMO using these priors will be super important to enabling dynamic 4D reconstruction (since it’s otherwise unfeasible to directly observe every space-time point in a scene). Finally, making NeRF environments more interactive (as other posts have described) would unlock many use cases especially in simulation (ie: for autonomous driving). This is kind of tricky for implicit representations (like the original NeRF and this work), but there have been some really cool papers in the 3D Gaussian space (https://xpandora.github.io/PhysGaussian/) that are exciting.
I'm interested in using Nerf to generate interpolated frames from a set of images. I want to do a poor man's animation. I'm interested in finding Nerf with code but it feels hard to find. Do you know of a good starting point? I tried running nerfstudio and the results weren't great.
I might be misunderstanding your use case, but are you trying to interpolate movement between frames (ie: are you trying to reconstruct a dynamic 4D scene or is the capture fully static?)
If you are trying to capture dynamics, most of the Nerfstudio methods are geared towards static captures and will give poor results for scenes with movement. There are many dynamic NeRF works out there
- for example https://dynamic3dgaussians.github.io/ and https://github.com/andrewsonga/Total-Recon both provide code if you want to play around. With that being said, robust 4D reconstruction is still very much an open research problem (especially when limited to monocular RGB data / casual phone captures). I'd expect a lot of movement in the space in the months/years to come!
I'm trying to recreate my experiences with claymation when I was a kid. I want to take a picture of an object, like a lego figure, and then move it slightly, take another picture, then move the figure again slightly. Once I have some frames, I want to use a NERFs to interpolate between those frames.
When I was young and doing claymation, I would move the figure, shoot two frames, and do that 12 times per second of film. And, the lights I used were often so hot, that my clay would melt and alter the figure. It was a chore.
I thought I could capture fewer in-between frames and let the NERF figure out the interpolation, and perhaps get some weird side effects. Especially if it hallucinates.
I'm not sure if a NERF is the right approach, but it seems like a good starting point.
Re: the 4FPS example, if one renders a VR180 view as an equirectangular image and lets the headset handle the 6DOF head movement, then 4FPS is plenty. Especially if there’s one render per-eye. 110% if there are no objects within a meter of the ego camera.
So your motivating problem does not exist.
More FPS is better, and yes we all do want to find a hybrid of NeRF and splats that works well, but then you should emphasize your theoretical and experimental contributions. Flatly claiming 4FPS doesn’t work is specious to most readers. Even Deva knows this is being too aggressive for a paper like this.
Can regular phones capture the data required? How to get into this, as a hobbyist? I’m interested in the possibilities of scanning coral reefs and other ecological settings.
One of the datasets we evaluated against in our paper uses a bespoke capture rig (https://github.com/facebookresearch/EyefulTower?tab=readme-o...) but you can definitely train very respectable NeRFs using a phone camera. In my experience it's less about camera resolution and more about getting a good capture - many NeRF methods assume that the scene is static, so minimizing things like lighting changes and transient shadows can make a big difference. If you're interested in getting your feet wet, I highly recommend Nerfstudio (https://docs.nerf.studio)!