> Do you need position data to go along with the photos or just the photos?
Short answer: Yes.
Long answer: Yes, but it can typically be derived from images. Structure-from-motion methods are typically used to derive lens and position information for each photo in the training set. These are then used by Zip-NeRF (our teacher) and SMERF (our model) to train a model.
For VR, there’s going to be some very weird depth data from those reflections, but maybe they would not be so bad when you are in headset.