3D photographic inpainting from a single source image

echelon · on April 11, 2020

Just take a look at the historical photos results! That's one of the coolest things I've seen in well over a year. I've come across most of these photos before in textbooks and on the web, but this demo makes all of those historical figures and moments in time feel as real as if they were happening in the world today. It's much more pronounced than colorizing black and white photos. I feel connected.

I'm convinced that the folks that present at SIGGRAPH are capable of nothing short of black magic wizardry. I can understand the technology being used, but my visual cortex only sees the impossible becoming real.

This is truly deserving of the word awesome, because I am in awe.

fxtentacle · on April 11, 2020

Problem is, this paper won't be able to produce those results from historical photos. I find them quite misleading, to be honest.

You'll need to either use a separate 3D depth estimation AI, or (more likely) have someone do a manual stereoscopic 3D conversion of your historical image. Only then (when you have depth data) can the algorithm presented in this paper start its work.

johndough · on April 11, 2020

> You'll need to either use a separate 3D depth estimation AI

They seem to be using MiDaS https://github.com/intel-isl/MiDaS for depth estimation which does a reasonable job on some random image pulled from pixabay https://i.imgur.com/IfbeaqY.jpg

fxtentacle · on April 11, 2020

That looks good, indeed :)

fxtentacle · on April 12, 2020

In the meantime, I tried our Midas on other images and it failed most of the time. Really only works for those stock-is pictures with clear foreground and background and I believe it mainly just detects bokeh blur as opposed to actually understanding the scene.

jbhuang0604 · on April 12, 2020

Yes, you are right. Existing single-image depth estimation models are not far from perfect. We hope to see future development in that direction to further improve the visual quality of 3D photos.

ladberg · on April 11, 2020

If you like this then you'll love neural radiance fields: http://www.matthewtancik.com/nerf

IshKebab · on April 11, 2020

Very cool! Not quite the same because they use 20-50 input images and this uses 1. But it looks like it might be really cool for photogrammetry. The traditional photogrammetry software sucks.

Very cool application using Matterport I saw recently is to scan AirB&Bs, e.g. https://breconretreat.co.uk/accommodation/swn-y-nant/floor-p...

tmilard · on April 11, 2020

Very very impressive 3D reconstruction. Apparently It can even reconstruct very well ropes, fine details. That's amazing ! The best system I have ever seen this year on reconstruction... Beavo

echelon · on April 11, 2020

Thank you for sharing! This is really cool tech.

petargyurov · on April 11, 2020

This is blowing my mind. Incredible.

peteforde · on April 11, 2020

I wonder if this can be tweaked to generate the 45-perspective quilt files required to feed this into a Looking Glass. https://lookingglassfactory.com/

fxtentacle · on April 12, 2020

After reviewing the looking glass docs, yes. The paper presented here can produce depth layers. If you would then flatten those depth layers with 45 different parallax angles and concatenate the results, you'd have something similar to their quilt files.

mrfusion · on April 11, 2020

Wow how does that work?

modeless · on April 11, 2020

It's the same principle as those plastic toys with printed 3D images that you've surely seen: https://www.youtube.com/watch?v=jIfAi_zJ2F4

peteforde · on April 11, 2020

Yeah, just like that.

Except that there's 45 distinct planes and it directly interfaces with Unity, Unreal and ThreeJS in minutes.

Sorry if I'm reading too deeply into your note. There are just so many haters. Meanwhile, I backed these guys on Kickstarter, have had a unit on my desk for 18 months and think it's one of the most incredible things I've ever had to experiment with... and it cost me under $500.

tsukurimashou · on April 11, 2020

I agree his comment is somewhat diminishing for the product but he is not wrong saying it's the same principle.

peteforde · on April 11, 2020

I think we can safely agree that the video they linked to was a solid tip-off that snark was dialed to 11.

fxtentacle · on April 12, 2020

They probably don't know how difficult it is to generate good lenticular printing data.

I'm the author of a lenticular vray plugin and people love the technology for product design previews. Plus the fact that it's being mass-produced as children's toys makes it very affordable.

mrfusion · on April 11, 2020

Can you explain it more? Or is there something I can read up on? It seems like quite a breakthrough.

peteforde · on April 11, 2020

Can you be a bit more specific?

Are you asking for more information on https://lookingglassfactory.com/ or https://www.youtube.com/results?search_query=lenticular ?

fxtentacle · on April 12, 2020

It's a bit sad that their active resolution is so low, with 2560x1600 being divided into 9x5 quilt frames. So that would mean 320x280 or so effective resolution for the 8.9" display

peteforde · on April 11, 2020

https://docs.lookingglassfactory.com/Appendix/how-it-works/

anigbrowl · on April 10, 2020

Outstanding work. I've been really impressed with the quality of recent colorization, perspective, and texture reconstruction tools and think they do a wonderful job of bringing historical or degraded images 'back to life.' I wonder if we are headed for a future where many of these tools reside client-side and image/video transmission and storage can be reduced to a lightweight stream of vector data.

wokwokwok · on April 11, 2020

Take a moment and have a deep breath before you get excited about full 3d reconstruction with a single image.

That isn't what this is.

Watch the video in the bottom right corner, entitled "Comparison with State of the art".

Now go and rewatch the examples and actually look at the edges of the objects as they move the camera 'a bit'. You'll see a tonne of artifacting. Less, clearly, than the existing state of the art, so I tip my hat to the efforts here.

...but all this is doing is generating an 'empty gap' generated by perspective and then basically using the equivalent of photoshop's content aware fill to fill that gap with plausible pixels.

Since humans don't really pay much attention to edge details, it's quite plausible.

To quote the paper:

> In this work we present a new learning-based method that generates a 3D photo from an RGB-D input.

Ie. This work is taking a depth image as the input and working on that to generate a 3d photo, rebuilding the full content of each 2d layer in the image.

Ie. The output is not a 3d model, it is a series of 2D images at depth intervals, where the occluded content in each layer is in-painted (ie. generated artifically).

(NB. The 'from a single source image' work used here is not novel; they're just using existing approaches to estimate a depth image)

saagarjha · on April 11, 2020

> all

It might be "all" that it's doing, but it does it quite well and in a way that is quite believable, which is significantly better than what came before, which makes it almost realistic. That's what you said, but the way you said it felt like it lessened the achievement.

wokwokwok · on April 11, 2020

> which is significantly better than what came before

It's a bit better. That's the point I'm making; it's just incremental improvement on existing process. Read the actual paper, eg. under 'Quantitative comparison'.

If you think I'm belittling the effort I'm sorry, that's not my intention; ...but, for example, the other comments talking about using it to generate a full 3d model to display on a looking glass surface, or in VR displays a total lack of understanding of what has been achieved here.

Robotbeat · on April 10, 2020

Their example images are still 2D, but potentially very interesting for making 360 degree videos much more immersive in virtual reality. (May be some extra steps to ensure consistent rendering among adjacent frames?).

jsilence · on April 11, 2020

Would be awesome if this could be integrated into gallery software as a 3D Ken Burns effect. The artificial camera would not have to move that much so that the inevitable artifacts would be much less visible.

fxtentacle · on April 11, 2020

The main issue that I see with this work is that they require RGBD data, meaning you have to do a lidar scan or something similar to measure the depth map. Alternatively, you could pay someone to draw it by hand, but that takes a long time.

So basically, if you have impossible to get input data, this network can do its magic.

What it then does is hallucinary inpainting, so something like the Photoshop content aware fill. If there's a tree in your photo, this one will make up a fake background behind it, so that you could move or remove the tree without things looking weird.

yorwba · on April 11, 2020

> So basically, if you have impossible to get input data, this network can do its magic.

Except they evidently got the input data for the examples in the paper, so it can't be impossible to get.

They cite at least two different methods for adding depth information to a single image to generate the necessary RGBD data, different views of which can then be rendered with their inpainting applied:

- https://arxiv.org/abs/1907.01341

- https://research.cs.cornell.edu/megadepth/

fxtentacle · on April 12, 2020

Yes, you can always create that data by hand, but it's too expensive and, hence, impossible to scale.

As for using other AIs, they tend to not work too well on more complex images.

But in any case, getting the data that you need to be able to use the paper here is very challenging.

Edit: I should probably say that I have hands-on experience with MegaDepth and Midas and that it was underwhelming. Both of them assume a Gradient from close to far from the bottom to the top and both of them assume that optical variation will be in the foreground. A photo of a dining table from the side is already enough to confuse both of them.

nestorD · on April 11, 2020

True but their collab demo uses a separate network to infer depth information from a simple RGB picture (its a classical task nowadays) so it is not a problem in practice.

fxtentacle · on April 12, 2020

Sadly not, it's still an unsolved problem in the general case. Yes, it works tolerably well for some images, but converting RGB to RGBD is anything but easy. That's why the 3D-ification of cinema movies still requires thousands of people and millions in budget.

nestorD · on April 12, 2020

True, we get results good enough for that usage but not for general applications.

But it is a problem well studied in machine learning and for which you can get off-the-shelf networks to use in conjunction with this paper.

ConradKilroy · on April 11, 2020

This looks like Lytro https://en.wikipedia.org/wiki/Lytro

ladberg · on April 11, 2020

Kind of an interesting tangent, but the founder of Lytro (Ren Ng) is the advisor on a really cool paper that came out recently and reminds me a lot of this: http://www.matthewtancik.com/nerf

Basically, it's similar to this but also re-lights reflective parts at the cost of needing more than one source image.

bryant · on April 11, 2020

It does, but applied in reverse. Check the Legacy Photos video for examples.

signaru · on April 11, 2020

Now we just need an automated image stabilizer for those shaky videos...

Just kidding. This is awesome! Just imagine the possibilities with enough computing power.

frfl · on April 10, 2020

Anyone with ML development experience know what changes are needed to make this work on a CPU without a CUDA GPU? Seems heavily coupled to CUDA.

johndough · on April 11, 2020

Here is a quick hack to make it work on CPU: https://github.com/983/3d-photo-inpainting

Maxed out at 4 GB RAM for 256x144 images for me.

PyTorch CPU installation instructions: https://pytorch.org/get-started/previous-versions/

OpenCV should work without CUDA. If not, build from source and consider `WITH_CUDA` flag.

fxtentacle · on April 11, 2020

Port it from Torch to TensorFlow framework, export the graph to tflite, compile with xla&aot and you end up with a c++ library running on CP.

rayuela · on April 11, 2020

You could translate this to a non-CUDA GPU, such as a mobile GPU, but even that would require a bit of effort to be able to condense it such that is wasn't a total lag fest. Executing this on CPU seems damn near impossible from a usability standpoint given the large matrix multiplication involved. You really need the parallel capabilities of a GPU.

nestorD · on April 11, 2020

It relies on torch and openCV: - I have never tried running openCV explicitely on CPU but I believe it is doable. - It is trivial to run torch on CPU instead of GPU (just comment the line that sends the code to the GPU).

mopierotti · on April 11, 2020

It would be interesting to see this applied as a photo viewing app for VR, or applied to videos.

akdor1154 · on April 11, 2020

Is this what is already implemented on Facebook's newsfeed for certain photos?

viggity · on April 11, 2020

I can't speak for all photos - but I'm fairly certain those 3d looking photos on FB must be taken with with a stereoscopic camera (2+ lenses). They can calculate depth when they know the distance between the lenses (and differences in focal lengths, etc).

That isn't to say they couldn't do it retroactively for single lens photos, but I'm guessing not right now they're not.

nestorD · on April 11, 2020

Not at the moment. They even compare with facebook's algorithm (the proposed new algorithm seem to do much better thanks to inpainting).

felixyz · on April 11, 2020

The experience of Mark Twaine's shoes was worth it for me. So dapper.

amelius · on April 11, 2020

This would be great as a CSS scroll effect /s

muglug · on April 10, 2020

Very cool work, but it's a little jarring to see photos of segregation, war and famine being used to show off the algorithm (https://filebox.ece.vt.edu/~jbhuang/project/3DPhoto/3DPhoto_...)

robbrown451 · on April 11, 2020

I don't see the problem with it. I think their point is that they are important historical photos, and seeing them this way puts some new life in them and can enhance their emotional resonance.

It's not like they come off as "let's remember the good ol' days with the separate water fountains for the colored folks." Not to me anyway.

Robotbeat · on April 10, 2020

I'm not sure I agree. It seemed moving to me to see these historical pictures come alive. It did not seem in bad taste to me.

jshevek · on April 11, 2020

The ensemble was effective in delivering an emotional impact. Its not often that a tech demo makes me feel such a range of emotions, moving me through profound sadness and hope. It left an impact far greater than typical marketing videos, giving me the subliminal impression (of which I am consciously skeptical) that this technology was meant for legit art.

freepor · on April 11, 2020

If they didn’t do stuff like this then the only purpose of these efforts would be to make your cat pics 0.1% more monetizable for Mark Zuckerberg. It’s the application to historical photos like this that is perhaps the most important result of this work.

peteretep · on April 11, 2020

These are some of humanity’s most important photos, because they show segregation, war, and famine. Bringing them to life arguably increases their impact.

craftinator · on April 11, 2020

Please do not allow me to see that which I do not want to see.

sp332 · on April 11, 2020

It's weird to use negative content in a promotional context. I agree with Robotbeat but it's still fair to say that it's "jarring".

jshevek · on April 11, 2020

Are you trying to paraphrase the parent? I disagree with the person you are responding to, but I feel you severely misrepresent what they actually said.