I don’t like to bring unrealistic expectations to this sort of thing, but even so, all the examples look pretty bad. Am I missing something?
In addition to all the noise and haze -- so the intermediate frames wouldn’t be usable alongside the originals -- the start and end points of each element hardly ever connect up. Each wall, door, etc flies vaguely towards its destination, but fades out just as the “same” element fades in at its final position a few feet away.
It’s a lovely idea, though, and it would be great to see an actually working version.
Yes, it looks pretty bad imho. It seems that the researchers have learned about a handful of recent techniques, such as Gaussian Splatting, and decided to apply it to a novel domain (hand drawn images) without any deeper understanding.
Gaussian Splatting is, in my opinion, simply the wrong tool for geometrically inconsistent images even if you manually annotate a bunch of keypoints. Another thing is that the spherical harmonics color representation makes it easy for the model to "cheat" when there are relatively few views, i.e. even when the Gaussian is completely geometrically wrong, it can still show the right color in the directions of the views. Perhaps they should have just disabled the spherical harmonics thing (i.e. making each Gaussian the same color regardless of which direction you're looking at it from), since most cartoons have flat-ish shading anyway.
Furthermore, they didn't attempt any sort of photometric calibration or color estimation between the different views. For example the paintings each show the building in a very different lighting condition, and it seems they made no attempt to handle that at all, leading to a very ugly reconstruction.
Finally, this method requires significant amounts of human work to do the manual annotation for a very subpar result, making us wonder what the whole point of it is. It would seem to me that diffusion models like Sora or Veo could do a much better job if you just want to interpolate between different views. It isn't much different from image inpainting, which diffusion models excel at.
In addition to all the noise and haze -- so the intermediate frames wouldn’t be usable alongside the originals -- the start and end points of each element hardly ever connect up. Each wall, door, etc flies vaguely towards its destination, but fades out just as the “same” element fades in at its final position a few feet away.
It’s a lovely idea, though, and it would be great to see an actually working version.