If your entire existence was constrained to seeing 2d images, not of your choosing, could a perplexity-optimizing process "learn the physics"?
Basic things that are not accessible to such a learning process:
- moving around to get a better view of a 3d object
- see actual motion
- measure the mass of an object participating in an interaction
- set up an experiment and measure its outcomes
- choose to look at a particular sample at a closer resolution (e.g. microscopy)
- see what's out of frame from a given image
I think we have at this point a lot of evidence that optimizing models to understand distributions of images is not the same thing as understanding the things in those images. In 2013 that was 'DeepDream' dog worms, in 2018 that was "this person does not exist" portraits where people's garments or hair or jewelry fused together or merged with their background. In 2022 it was diffusion images of people with too many fingers, or whose hands melted together if you asked for people shaking hands. In the Sora announcement earlier this year it was a woman's jacket morphing while the shot zoomed into her face.
I think in the same way that LLMs do better at some reasoning tasks by generating a program to produce the answer, I suspect models which are trained to generate 3D geometry and scenes, and run a simulation -> renderer -> style transfer process may end up being the better way to get to image models that "know" about physics.
Indeed. It will be very interesting when we start letting models choose their own training data. Humans and other animals do this simply by interacting with the world around them. If you want to know what is on the back of something, you simply turn it over.
My guess is that the models will come up with much more interesting and fruitful training sets than what a bunch of researchers can come up with.
They're being trained on video, 3d patches are being fed into the ViT (3rd dimension is time) instead of just 2d patches. So they should learn about motion. But they can't interact with the world so maybe can't have an intuitive understanding of weight yet. Until embodiment at least.
I mean, the original article doesn't say anything about video models (where, frankly, spotting fakes is currently much easier), so I think you're shifting what "they" are.
But still:
- input doesn't distinguish what's real vs constructed nonphysical motion (e.g. animations, moving title cards, etc)
- input doesn't distinguish what's motion of the camera versus motion of portrayed objects
- input doesn't distinguish what changes are unnatural filmic techniques (e.g. change of shot, fade-in/out) vs what are in footage
Some years ago, I saw a series of results about GANs for image completion, and they had an accidental property of trying to add points of interest. If you showed it the left half of a photo of just the ocean, horizon and sky, and asked for the right half, it would try to put a boat, or an island, because generally people don't take and publish images of just the empty ocean -- though most chunks of the horizon probably are quite empty. The distribution on images is not like reality.
Basic things that are not accessible to such a learning process:
- moving around to get a better view of a 3d object
- see actual motion
- measure the mass of an object participating in an interaction
- set up an experiment and measure its outcomes
- choose to look at a particular sample at a closer resolution (e.g. microscopy)
- see what's out of frame from a given image
I think we have at this point a lot of evidence that optimizing models to understand distributions of images is not the same thing as understanding the things in those images. In 2013 that was 'DeepDream' dog worms, in 2018 that was "this person does not exist" portraits where people's garments or hair or jewelry fused together or merged with their background. In 2022 it was diffusion images of people with too many fingers, or whose hands melted together if you asked for people shaking hands. In the Sora announcement earlier this year it was a woman's jacket morphing while the shot zoomed into her face.
I think in the same way that LLMs do better at some reasoning tasks by generating a program to produce the answer, I suspect models which are trained to generate 3D geometry and scenes, and run a simulation -> renderer -> style transfer process may end up being the better way to get to image models that "know" about physics.