If your entire existence was constrained to seeing 2d images, not of your choosi...

jappgar · on July 20, 2024

Indeed. It will be very interesting when we start letting models choose their own training data. Humans and other animals do this simply by interacting with the world around them. If you want to know what is on the back of something, you simply turn it over.

My guess is that the models will come up with much more interesting and fruitful training sets than what a bunch of researchers can come up with.

energy123 · on July 20, 2024

They're being trained on video, 3d patches are being fed into the ViT (3rd dimension is time) instead of just 2d patches. So they should learn about motion. But they can't interact with the world so maybe can't have an intuitive understanding of weight yet. Until embodiment at least.

abeppu · on July 20, 2024

I mean, the original article doesn't say anything about video models (where, frankly, spotting fakes is currently much easier), so I think you're shifting what "they" are.

But still:

- input doesn't distinguish what's real vs constructed nonphysical motion (e.g. animations, moving title cards, etc)

- input doesn't distinguish what's motion of the camera versus motion of portrayed objects

- input doesn't distinguish what changes are unnatural filmic techniques (e.g. change of shot, fade-in/out) vs what are in footage

Some years ago, I saw a series of results about GANs for image completion, and they had an accidental property of trying to add points of interest. If you showed it the left half of a photo of just the ocean, horizon and sky, and asked for the right half, it would try to put a boat, or an island, because generally people don't take and publish images of just the empty ocean -- though most chunks of the horizon probably are quite empty. The distribution on images is not like reality.