Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
FastPhotoStyle from Nvidia (github.com/nvidia)
231 points by scraft on Feb 20, 2018 | hide | past | favorite | 79 comments


I'm probably missing something obvious here, maybe someone can explain the following to me.

- Their approach is a composition of 2 steps, what they call "stylization" and "smoothing".

- Top left of 2nd page they claim: "Both of the steps have closed-form solutions"

- Equations 5 is the closed form solution for the "smoothing" step.

My question: Where's the closed-form solution for the stylization step that they're claiming?

Are they calling equation 3 a closed-form expression? In this case the title and the claim in the introduction are rather misleadinng, because computing 3 requires you to train autoencoders.


You don't train it for every image; in this way, a neural network often is a "closed-form solution": it provides you an equation, admittedly a very convoluted one, which can be used to obtain its solution, admittedly usually an approximation, in a finite amount of time. The normal solution to this problem (according to the paper) is an iterative technique "to solve an optimization problem of matching the Gram matrices of deep features extracted from the content and style photos", whereas this one is simply two passes: stylization and smoothing.


Not sure if I understand, don't every neural network ever produce some approximation in finite time? In what sense is this approach "closed-form"?


Previous stylisation was slow because it needed to SGD optimisation for each image to be stylised. This uses a NN trained once. When you've trained a NN it is precisely a closed form solution, in the style y = max(0, 3x + 4). However they are normally a little longer to wrote down :P


Ah okay right this is the answer. Previous approaches [1] are deep generative models that you have to optimize for each input, whereas here you run just a forward evaluation on a model that you've trained beforehand.

I would still argue the term closed-form is misleading here, because:

- Even during training at any given time you can read off a "closed-form expression" of the neural network of this type, so closed-form in this broad sense really doesn't mean much. Furthermore any result of any numerical computation ever are also closed-form solutions according to this, on the grounds that they result from a computation that completed in finite number of steps. So really whenever you ask a grad student to run some numerical simulation expect them to come back saying "Hey I found a closed-form expression!"

- The reason the above is absurd is that these trained NN's aren't really solutions to the optimization problem, but approximations. So this is really saying I have a problem, I don't know how to solve it but I can produce a infinite sequence of approximations. Now I'm gonna truncate this sequence of approximations, and call this a closed form solution.

The analogy in highschool math would be computing an infinite sum that doesn't converge, but now let's instead just add to some large N, and call this a closed-form solution.

[1] e.g. https://arxiv.org/pdf/1508.06576.pdf


Actually, I agree with you. Initially you seemed to object to the term "closed form"; this now highlights the more pertinent point - these models are 100% closed form, but 0% "solution" in the formal sense.


Someone correct me if I'm wrong, but I believe this refers to the fact that it can be expressed in terms of certain simple mathematical operations like addition, subtraction, multiplication, powers, roots etc.—and as a consequence, the execution is very efficient. My understanding is that 'closed form' solution is essentially something that resembles a polynomial (again, accepting corrections!).


Closed form just means you can do it in a finite number of operations. So just "run X" rather than the previous versions of this kind of thing which are "repeat X until measure Y is lower than the limit I care about". (my basic understanding)


I checked the Wikipedia article, and the sorts of operations involved do appear to be a part of the definition: https://en.wikipedia.org/wiki/Closed-form_expression —though it sounds like it's a somewhat loosely defined term.


I really don't think that it's fair to call neural networks closed-form solutions. The term immediately makes me assume that it enabled you to bypass the training stage altogether.


Running a trained net is, if you only have to do a single iteration. It's a complex formula, but it is a closed form one.


Notice that all of the examples ilustrated in the paper contain similar scenes. The content image is a building, while the style image is also a building. Or an image of trees is styled using another image of trees.

But how well does it fare when you give it an image of a house and an image of something completely different, like a dog or a slipper?


Download the code, run it, and let us know!


What would you expect the outcome to be?

What is the correct answer to a question that's not well-formed?


The interesting question, then, is how far off can this be and still work? Is the limit "reasonable", or is there room for improvement of the algorithm?

E.g. I think most humans would say taking this content picture:

https://wallpapershome.com/images/pages/pic_hs/10150.jpg

and styling it with this picture:

https://c2.staticflickr.com/4/3499/3876547311_c2e32759d9_z.j...

is a pretty well-posed operation. How does that look using this algorithm?


Your first link just redirects to their homepage for me, can you explain which picture it was?


It shows a red crab on a beach in front of the bright blue ocean with a blue sky and white clouds.

I guess transfer of the wooden house amidst yellow fields with a reddening sky might lead to a wooden crab on a yellow field in front of a reddish-yellow ocean with red sky and clouds, or something.


It actually looks better than expected: https://imgur.com/a/5BjvC


Looks nice!

Did you have to do anything extra to get it working? I've set things up according to the documentation (I think), but I get dimension size errors when running it.


Haha, yes, I had to rewrite their code a bit. All the .unsqueeze(1).expand_as(...) in photo_wct.py need to be replaced by just .expand_as(...) and the return value of __feature_wct needs to be wrapped in torch.autograd.Variable.

I'm going to submit a PR, but it took me a bit of experimentation to fix these errors, so the code is a bit messier than I'd like.


Ahh that looks like the error I was hitting, thanks. I might try replacing the bits as well, though I just upgraded pytorch from 0.1.12 to 0.3 and it became much slower (I killed it after 5-6 minutes of setup).


My fork is here: https://github.com/Yorwba/FastPhotoStyle

I was using the pytorch 0.1.12 installed with conda (following their USAGE.md) and it took ~30s total for the transfer.


Much appreciated thanks!

For some reason it's taking me about 4-5 minutes for the transfer, but the code now runs and the rest of the runtime is only a few seconds.


Wow, thats pretty good! So this thing can do fairly well on complex transfers.


Thoses interested in that technology : I made two videos 18 months ago, No optical flow and youtube compression kills everything but still decent if watched in 4K on a big screen :)

https://www.youtube.com/watch?v=2YRVt80g2Ek

https://www.youtube.com/watch?v=i69cBYI6f-w


I'm going to be that person - why a non-OSI approved license? Given that it's CUDA-specific, I'd expect NVIDIA to want people to use it.


> Licensed under the CC BY-NC-SA 4.0 license

Seems fine to me. If you want to develop something commercial you'd roll your own anyway. Nothing else is restricted by this license.


Consider artists. There's a tremendous potential in using technology like this in art, and preventing someone from selling their works will often put them off of using it at all.


What does the license of the product have to do with the output of the product? You can use GIMP and GCC commercially, for example and libraries used with GCC often have runtime exemptions for their output


Because this tool is licensed non-commercial. Using it for art that you sell would be a commercial use, and a violation of the license.


Hmmmm. Does the licence of the tool affect the output from the tool? Photoshop is propriety but Adobe doesn't have to explicitly grant me rights to the work I create with it.


Usually no, unless say, the tool put some part of itself in the output.

The license of GCC doesn't affect the license of your binaries.

The license of python doesn't affect the license of your software.

etc.


You only need a license for the copyright though, in the worst case you waive your right to distribute your derivative code if it has been used for commercial applications (which would be a weird interpretation, but I can't find a precise explanation what the 'non-commercial' license covers).


Contrary to modern software developers, artists are used to the notion that tools developed by other people are worthy of some kind of compensation, even if found on some flea market.


Certainly, but do you see where you can buy a commercial license for this? I don't.


I wonder whether this could be applied to a real-time scenario. Modern Real-time renderers for games will often have a tone mapping step that let artists color grade the final output. The paper cites a 11+ second runtime for 1K inputs, which is orders of magnitude off what it would need to be, but perhaps a simpler version run on the GPU is feasible.


Notice that the research was done by Nvidia


Nvidia is pretty big in the machine learning space in general, not game specific these days - GPUs are pretty general purpose highly multithreaded number crunchers and Nvidia's been making moves further in this direction with their own CUDA-based training tools, the DGX-1, the Jetson and other products.


Paper this is based on: https://arxiv.org/pdf/1802.06474

It's really great that NVIDIA is releasing code for their deep learning research.


(a) This problem is long known as color/contrast transfer, and it was solved > 10yrs ago, (b) the results shown in this paper aren't objectively or subjectively better/more photo-realistic than Kokaram etc.'s work; and (c) I question whether this task even requires deep learning at all.

https://francois.pitie.net/colour/


This problem is long known as color/contrast transfer, and it was solved more than 10yrs ago. The results shown aren't objectively or subjectively better/more photo-realistic than Kokaram etc.'s work which is far simpler. I question whether this task even requires deep learning at all.

https://francois.pitie.net/colour/


These very low res example images aren't particularly useful for judging how good this actually is.


Only tangentially related, but has anyone ever tried to apply style-transfer on human faces for artificial aging or rejuvenation? Like for the movie industry or something?


FaceApp does this (inc gender swapping) and its quite fun for an hour or so of messing around.


The examples seem to be too good to be true. I don't have a GPU lying around so I cannot try it unfortunately.


paperspace provides pretty easy setup cloud GPUs for ~$0.40/hr if that's of interest :)


The only machines with decent GPUs in them I have access to run Windows and Windows Subsystem for Linux doesn't allow GPU access. Other than dual-booting or running Linux in VirtualBox - is there any way I can try this?


None of the dependencies seemed to be Linux-specific at a quick glance. You might be able to install all that on Windows (not sure how pleasant experince it'll be).

Virtualbox won't help you, because you can't give proper access to the GPU for the VM guest unless you set up PCI-e passthrough and dedicate your whole GPU to the VM guest (and use your integrated graphics for the host). Not sure if this is even possible if Windows is the host.

If you don't feel like setting up a Linux install on your box, you could try some of the GPU cloud services.


Also I am told the proprietary nvidia drivers have a software lock that prevents you from using GPU passthrough unless you buy certain more expensive models.


With PCI-e passthru using intel_iommu, you can set this up with a gaming GPU. The driver can't tell that it's not running on bare metal.

This requires dedicating the whole GPU and the PCI-e slot to the virtual machine guest.

For more flexible virtualization setups, you need the professional quality cards.


There is a work around. A number of GeForce cards gave the exact same chipset as a Quadro card but with a resistor pulling down an external pin. That resistor can be changed to make the card identify as a Quadro.

http://www.eevblog.com/forum/chat/hacking-nvidia-cards-into-...

Apparently this can also be done from software

http://archive.techarp.com/showarticleefc1.html


This is just spoofing the PCI VID:PID numbers to the driver and relying on driver bugs(?) to function. You could do the same with a few lines of kernel hacks far easier than soldering. It does not enable any features that are fused off in the hardware. This setup is not reliable.

Also, these posts are from 2008 and 2013, 5 and 10 years old. These hacks probably don't work any more.


OT, but why do those machines need to run Windows? Why can't you install Linux?


They are dev boxes for Windows VR apps. I'd like to play with this out of curiosity. It's not worth the hassle of a dual boot for that.


The user manual literally has a setup for Ubuntu, using CUDA & cupy.


I'm not sure I understand how that helps me.


Anaconda can probably help


No


Is it really all that hard to have a demo site for these things? It would be a lot of fun to play with crossing pictures. I'm guessing it's because using a graphics card in the browser isn't good enough yet?


I'm not sure how fast their FastPhotoStyle approach is, but a TensorFlow implementation of the original neural style transfer can take upwards of 20 minutes to create the final stylized image. If someone had the pre-trained model and neural net code in JS to read it and you could do it all client side then it would be possible, but still very slow.


The tech has come a long long way since the original, even before this FastPhotoStyle project.

A few months ago, there was TensorFire [0] that was able to do it in the browser. Quick google also gives other results [1]. There's also many apps that can do it in seconds. Speed definitely isn't an issue anymore, but getting it to work in browser can be tricky.

[0] https://tenso.rs/demos/fast-neural-style/

[1] https://reiinakano.github.io/fast-style-transfer-deeplearnjs...


That top left style will be perfect for the family xmas photo.


Is there some research doing the same in voice area? - Fix/change accent, - Improve person's voice, - Perhaps even make one sound like another.


I saw a clip from adobe a while ago

https://www.youtube.com/watch?v=I3l4XLZ59iw



https://lyrebird.ai/ -- they do the last thing. But they all seem related.


Yes, Adobe's VoCo, is one example.

Images and speech require different architectures (CNNs vs RNNs).


Has anyone had luck using this for their tinder profile?


Unfortunately, it only transfers style, not attractiveness


I assume you need a Nvidia card for this? Also has anyone tested it and seen how long it takes to render?


>Preparation 1: Setup environment and install required libraries >Python Library Dependency

> conda install pytorch torchvision cuda90 -y -c pytorch

What is conda? How do i install it on ubuntu 16.014?


Conda is basically an alternative to pip and virtualenv, used by the Anaconda python distribution that's really popular in the data science and machine learning community. The easiest way to get it is to install miniconda: https://conda.io/docs/user-guide/install/linux.html


Conda is the anaconda python distribution widely used for machine learning and numerical computing.


Is it faster than previous implementation ?


Looks like it's a lot faster. They compare their approach to the Luan et al. approach, and for a 1024x512 image, they are about 30-60x faster. They also seem to be more accurate with better results.


Oooooh no I'm going to get back on nerding this 100 % of my time :'(


what's the max resolution with this?


What witchcraft is this?


Of course there is a relevant XKCD: https://xkcd.com/1838/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: