this would be very useful in busy tourist spots where you want a video of something, the leaning tower of piza for example, but minus all the annoying people pretending to push it over
Isn't it what "long exposure photography" (and photo stacking in photoshop) is used for ? ;-)
I mean: if there's too much crowd, you have to wait a long time to get every little part of what's behind. So I'm not sure that ML will be useful except maybe to detect "humans" and decide what parts need to be replaced
Yes! Just take a lot of photos over a couple minutes (depending on how busy your scene is) from a fixed position, or just a video if you're lazy, then use imagemagick and combine them using "median" (NOT average). It's not always perfect but can deliver most of the time. That way even a command line dork like me can do it. :-)
A similar simple trick allows to reduce reflections from photos of flat objects under museum glasses (books, pictures, coins): take several photos from slightly different angle, co-register them using some panorama assembly techique (I've used a simple Python OpenCV script) and merge with minimum or some bottom quantile, since reflections are additive.
Also useful if you want to live stream in a country where privacy laws make it illegal unless you get everyones consent. Live face blurring would probably be enough however.