We do have a general and convincing model of how motion should work - we have video examples of "proper" motion, so we can optimize models with the explicit goal of creating motion that's indistinguishable from real motion with e.g. generative adversarial neural network models.
Things like https://grail.cs.washington.edu/projects/AudioToObama/ are just the start, we can generate facial expressions from scratch with some quality today, and we can work to make those generated expressions better and more realistic in future.
A single sample is not sufficient for a model, but if you have enough samples to infer the whole distribution, then that is sufficient.
A thousand pictures of a particular surface may easily carry sufficient information to enable rendering a million different pictures that are not distinguishable from the real ones; a thousand motion capture sets of different people walking carry sufficient information to allow generating new motion sets that are different from the original samples, but not distinguishable from real people walking.
Things like https://grail.cs.washington.edu/projects/AudioToObama/ are just the start, we can generate facial expressions from scratch with some quality today, and we can work to make those generated expressions better and more realistic in future.