Hacker News new | past | comments | ask | show | jobs | submit login
3D photographic inpainting from a single source image (shihmengli.github.io)
223 points by dsr12 on April 10, 2020 | hide | past | favorite | 59 comments



Just take a look at the historical photos results! That's one of the coolest things I've seen in well over a year. I've come across most of these photos before in textbooks and on the web, but this demo makes all of those historical figures and moments in time feel as real as if they were happening in the world today. It's much more pronounced than colorizing black and white photos. I feel connected.

I'm convinced that the folks that present at SIGGRAPH are capable of nothing short of black magic wizardry. I can understand the technology being used, but my visual cortex only sees the impossible becoming real.

This is truly deserving of the word awesome, because I am in awe.


Problem is, this paper won't be able to produce those results from historical photos. I find them quite misleading, to be honest.

You'll need to either use a separate 3D depth estimation AI, or (more likely) have someone do a manual stereoscopic 3D conversion of your historical image. Only then (when you have depth data) can the algorithm presented in this paper start its work.


> You'll need to either use a separate 3D depth estimation AI

They seem to be using MiDaS https://github.com/intel-isl/MiDaS for depth estimation which does a reasonable job on some random image pulled from pixabay https://i.imgur.com/IfbeaqY.jpg


That looks good, indeed :)


In the meantime, I tried our Midas on other images and it failed most of the time. Really only works for those stock-is pictures with clear foreground and background and I believe it mainly just detects bokeh blur as opposed to actually understanding the scene.


Yes, you are right. Existing single-image depth estimation models are not far from perfect. We hope to see future development in that direction to further improve the visual quality of 3D photos.


If you like this then you'll love neural radiance fields: http://www.matthewtancik.com/nerf


Very cool! Not quite the same because they use 20-50 input images and this uses 1. But it looks like it might be really cool for photogrammetry. The traditional photogrammetry software sucks.

Very cool application using Matterport I saw recently is to scan AirB&Bs, e.g. https://breconretreat.co.uk/accommodation/swn-y-nant/floor-p...


Very very impressive 3D reconstruction. Apparently It can even reconstruct very well ropes, fine details. That's amazing ! The best system I have ever seen this year on reconstruction... Beavo


Thank you for sharing! This is really cool tech.


This is blowing my mind. Incredible.


I wonder if this can be tweaked to generate the 45-perspective quilt files required to feed this into a Looking Glass. https://lookingglassfactory.com/


After reviewing the looking glass docs, yes. The paper presented here can produce depth layers. If you would then flatten those depth layers with 45 different parallax angles and concatenate the results, you'd have something similar to their quilt files.


Wow how does that work?


It's the same principle as those plastic toys with printed 3D images that you've surely seen: https://www.youtube.com/watch?v=jIfAi_zJ2F4


Yeah, just like that.

Except that there's 45 distinct planes and it directly interfaces with Unity, Unreal and ThreeJS in minutes.

Sorry if I'm reading too deeply into your note. There are just so many haters. Meanwhile, I backed these guys on Kickstarter, have had a unit on my desk for 18 months and think it's one of the most incredible things I've ever had to experiment with... and it cost me under $500.


I agree his comment is somewhat diminishing for the product but he is not wrong saying it's the same principle.


I think we can safely agree that the video they linked to was a solid tip-off that snark was dialed to 11.


They probably don't know how difficult it is to generate good lenticular printing data.

I'm the author of a lenticular vray plugin and people love the technology for product design previews. Plus the fact that it's being mass-produced as children's toys makes it very affordable.


Can you explain it more? Or is there something I can read up on? It seems like quite a breakthrough.


Can you be a bit more specific?

Are you asking for more information on https://lookingglassfactory.com/ or https://www.youtube.com/results?search_query=lenticular ?


It's a bit sad that their active resolution is so low, with 2560x1600 being divided into 9x5 quilt frames. So that would mean 320x280 or so effective resolution for the 8.9" display



Outstanding work. I've been really impressed with the quality of recent colorization, perspective, and texture reconstruction tools and think they do a wonderful job of bringing historical or degraded images 'back to life.' I wonder if we are headed for a future where many of these tools reside client-side and image/video transmission and storage can be reduced to a lightweight stream of vector data.


Take a moment and have a deep breath before you get excited about full 3d reconstruction with a single image.

That isn't what this is.

Watch the video in the bottom right corner, entitled "Comparison with State of the art".

Now go and rewatch the examples and actually look at the edges of the objects as they move the camera 'a bit'. You'll see a tonne of artifacting. Less, clearly, than the existing state of the art, so I tip my hat to the efforts here.

...but all this is doing is generating an 'empty gap' generated by perspective and then basically using the equivalent of photoshop's content aware fill to fill that gap with plausible pixels.

Since humans don't really pay much attention to edge details, it's quite plausible.

To quote the paper:

> In this work we present a new learning-based method that generates a 3D photo from an RGB-D input.

Ie. This work is taking a depth image as the input and working on that to generate a 3d photo, rebuilding the full content of each 2d layer in the image.

Ie. The output is not a 3d model, it is a series of 2D images at depth intervals, where the occluded content in each layer is in-painted (ie. generated artifically).

(NB. The 'from a single source image' work used here is not novel; they're just using existing approaches to estimate a depth image)


> all

It might be "all" that it's doing, but it does it quite well and in a way that is quite believable, which is significantly better than what came before, which makes it almost realistic. That's what you said, but the way you said it felt like it lessened the achievement.


> which is significantly better than what came before

It's a bit better. That's the point I'm making; it's just incremental improvement on existing process. Read the actual paper, eg. under 'Quantitative comparison'.

If you think I'm belittling the effort I'm sorry, that's not my intention; ...but, for example, the other comments talking about using it to generate a full 3d model to display on a looking glass surface, or in VR displays a total lack of understanding of what has been achieved here.


Their example images are still 2D, but potentially very interesting for making 360 degree videos much more immersive in virtual reality. (May be some extra steps to ensure consistent rendering among adjacent frames?).


Would be awesome if this could be integrated into gallery software as a 3D Ken Burns effect. The artificial camera would not have to move that much so that the inevitable artifacts would be much less visible.


The main issue that I see with this work is that they require RGBD data, meaning you have to do a lidar scan or something similar to measure the depth map. Alternatively, you could pay someone to draw it by hand, but that takes a long time.

So basically, if you have impossible to get input data, this network can do its magic.

What it then does is hallucinary inpainting, so something like the Photoshop content aware fill. If there's a tree in your photo, this one will make up a fake background behind it, so that you could move or remove the tree without things looking weird.


> So basically, if you have impossible to get input data, this network can do its magic.

Except they evidently got the input data for the examples in the paper, so it can't be impossible to get.

They cite at least two different methods for adding depth information to a single image to generate the necessary RGBD data, different views of which can then be rendered with their inpainting applied:

- https://arxiv.org/abs/1907.01341

- https://research.cs.cornell.edu/megadepth/


Yes, you can always create that data by hand, but it's too expensive and, hence, impossible to scale.

As for using other AIs, they tend to not work too well on more complex images.

But in any case, getting the data that you need to be able to use the paper here is very challenging.

Edit: I should probably say that I have hands-on experience with MegaDepth and Midas and that it was underwhelming. Both of them assume a Gradient from close to far from the bottom to the top and both of them assume that optical variation will be in the foreground. A photo of a dining table from the side is already enough to confuse both of them.


True but their collab demo uses a separate network to infer depth information from a simple RGB picture (its a classical task nowadays) so it is not a problem in practice.


Sadly not, it's still an unsolved problem in the general case. Yes, it works tolerably well for some images, but converting RGB to RGBD is anything but easy. That's why the 3D-ification of cinema movies still requires thousands of people and millions in budget.


True, we get results good enough for that usage but not for general applications.

But it is a problem well studied in machine learning and for which you can get off-the-shelf networks to use in conjunction with this paper.



Kind of an interesting tangent, but the founder of Lytro (Ren Ng) is the advisor on a really cool paper that came out recently and reminds me a lot of this: http://www.matthewtancik.com/nerf

Basically, it's similar to this but also re-lights reflective parts at the cost of needing more than one source image.


It does, but applied in reverse. Check the Legacy Photos video for examples.


Now we just need an automated image stabilizer for those shaky videos...

Just kidding. This is awesome! Just imagine the possibilities with enough computing power.


Anyone with ML development experience know what changes are needed to make this work on a CPU without a CUDA GPU? Seems heavily coupled to CUDA.


Here is a quick hack to make it work on CPU: https://github.com/983/3d-photo-inpainting

Maxed out at 4 GB RAM for 256x144 images for me.

PyTorch CPU installation instructions: https://pytorch.org/get-started/previous-versions/

OpenCV should work without CUDA. If not, build from source and consider `WITH_CUDA` flag.


Port it from Torch to TensorFlow framework, export the graph to tflite, compile with xla&aot and you end up with a c++ library running on CP.


You could translate this to a non-CUDA GPU, such as a mobile GPU, but even that would require a bit of effort to be able to condense it such that is wasn't a total lag fest. Executing this on CPU seems damn near impossible from a usability standpoint given the large matrix multiplication involved. You really need the parallel capabilities of a GPU.


It relies on torch and openCV: - I have never tried running openCV explicitely on CPU but I believe it is doable. - It is trivial to run torch on CPU instead of GPU (just comment the line that sends the code to the GPU).


It would be interesting to see this applied as a photo viewing app for VR, or applied to videos.


Is this what is already implemented on Facebook's newsfeed for certain photos?


I can't speak for all photos - but I'm fairly certain those 3d looking photos on FB must be taken with with a stereoscopic camera (2+ lenses). They can calculate depth when they know the distance between the lenses (and differences in focal lengths, etc).

That isn't to say they couldn't do it retroactively for single lens photos, but I'm guessing not right now they're not.


Not at the moment. They even compare with facebook's algorithm (the proposed new algorithm seem to do much better thanks to inpainting).


The experience of Mark Twaine's shoes was worth it for me. So dapper.


This would be great as a CSS scroll effect /s


Very cool work, but it's a little jarring to see photos of segregation, war and famine being used to show off the algorithm (https://filebox.ece.vt.edu/~jbhuang/project/3DPhoto/3DPhoto_...)


I don't see the problem with it. I think their point is that they are important historical photos, and seeing them this way puts some new life in them and can enhance their emotional resonance.

It's not like they come off as "let's remember the good ol' days with the separate water fountains for the colored folks." Not to me anyway.


I'm not sure I agree. It seemed moving to me to see these historical pictures come alive. It did not seem in bad taste to me.


The ensemble was effective in delivering an emotional impact. Its not often that a tech demo makes me feel such a range of emotions, moving me through profound sadness and hope. It left an impact far greater than typical marketing videos, giving me the subliminal impression (of which I am consciously skeptical) that this technology was meant for legit art.


If they didn’t do stuff like this then the only purpose of these efforts would be to make your cat pics 0.1% more monetizable for Mark Zuckerberg. It’s the application to historical photos like this that is perhaps the most important result of this work.


These are some of humanity’s most important photos, because they show segregation, war, and famine. Bringing them to life arguably increases their impact.


Please do not allow me to see that which I do not want to see.


It's weird to use negative content in a promotional context. I agree with Robotbeat but it's still fair to say that it's "jarring".


Are you trying to paraphrase the parent? I disagree with the person you are responding to, but I feel you severely misrepresent what they actually said.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: