Defense should train a network with faces of the jury and then show how the same technique, run by their biased network, now shows each of them in the scene of the crime :)
Comparison using nearest neighbor, instead of a more reasonable linear filter, or-- heaven forbid-- some edge basic directed interpolator... is a little cheaty.
Agreed, it would have been nice to show other upscaling algorithms. But neural net super resolution generators can still have significantly more detail at 4-8x, as shown here
http://arxiv.org/abs/1609.04802
(Author here.) Yeah, I knew this would come up but decided to proceed with the pixelated comparison anyway. I couldn't get the GIFs to reflect the results because of 8-bit quantization/dithering. The images show the neural network inputs and outputs, not a comparison with other super-resolution algorithms (still fascinating :-).
I'm working on the Docker instance now, that should help anyone with interest/experience in the field compare results easily.
A friend of mine suggested that an approach similar to this could be used to upscale old standard definition TV shows (specifically, those shot on video rather than film). I'd imagine that multiple specially trained networks would be employed for different parts of the image (trained on pictures of individual performers or types of set/background). Pleased to see that this is possible. Is there anyone doing something along those lines already?
It should also be possible to train it on itself to improve moving scenes by using the motion itself as temporal super-sampling, just like the human eye does.
this works quite well, and does not necessarily require any NN/machine learning. see the youtube for this paper https://www.disneyresearch.com/publication/scenespace/
tldr simple brute force weighted average of samples from many frames, combined with a noisy/low quality depth-from-motion estimate can be used to de-noise, increase resolution and otherwise manipulate video footage. very cool paper with great results from a simple technique.
As you suggested, continuity of appearance is what makes this problem so difficult.
I recall watching a movie that was converted from black-and-white to color as a child. There were many distracting artifacts. Most notable was the hairlines of the actors would shift as the actor rotated their head. It made the film unwatchable.
(Author here.) Absolutely! Using multiple super-resolution networks, not only continuity would present problems, but also blending between different regions. I agree there's a lot of value for domain-specific networks here, as you can see from the faces example on GitHub.
I'd be curious to see an ensemble-based super-resolution, where each model can output the confidence of a pixel region, then have another network learn to blend the result.
Conversely, these results are achieved using a single top-of-range GPU. Everything fits in memory for a batch-size 15 at 192x192. By distributing the training somehow, you could make the network 10x bigger and train for a whole week and likely get much better general purpose results.
Is there anyone doing something along those lines already?
I have a side business doing Film restoration and am not aware of any solution like that. Probably the best upscale solution there is is from Teranex, acquired by BlackMagic Design. Evertz probably also has something in their offering.
It should work, I don't think you need to bother with training it on individual performers. Someone made a thing like this to improve low res anime, that worked well.
In theory you could use this to increase temporal resolution as well. Turn 24 fps movies into 60 fps, and upscale regular HD to 4k.
It definitely makes a significant qualitative improvement, making the picture appear more in sync with what our brain interprets as a higher resolution picture, but my first thought is whether this particular example goes beyond aesthetics. Is there really any instance where this method could for instance turn an unintelligible picture of a license plate to something in which the characters can be recognised? More generally, I wonder whether there has been any research on the limits - i.e, what needs to be the combined minimal size of the information stored in the neural network plus the information on its inputs before the output can be said to be true to the source with probability x ?
I imagine that if you trained this on a set of license plate photos, it would be able to enhance license plates illegible to an untrained human such that they're readable. However, I doubt it would be better than a human specifically trained at this task.
I've seen some videos from Cold War satellite photo analysts, and the way they can look at some tiny gray blobs and go "That's a T-64 tank, that's a T-62 tank, that's an SA2 launcher" etc.
Well, it doesn't create any information that wasn't in the original data (nothing can do that, you can only lose information in processing) so if e.g. the characters can be recognized in the processed image of a licence plate, then by definition they could have been recognized from the original data as well in some manner.
However, they can make things more easily interpretable by humans. A rough analogy is turning up the contrast - given a very dark image of licence plate where the black parts are totally black (#000000) and white parts are just very dark (#010101), the characters definitely can be recognized even while human in normal conditions would just see it as totally black, and processing would help.
> Well, it doesn't create any information that wasn't in the original data (nothing can do that, you can only lose information in processing)
I'm not sure this is correct. In a sense, it does contain information that wasn't in the original inputs - i.e information added by the weights in the neural network which itself was obtained by information extracted from an enormous amount of previous samples. Of course, the largest and best trained neural network won't be able to tell the license number given 2 pixels of information, but I am curious as to the theoretical limits of what can be achieved in extreme cases of with very little information as input and a neural network that has almost limitless resources.
This is amazing. The surprise is that while the higher resolution images seem real, they are reconstructions based on the previous learning, and can be very different from the actual.
Nice test images to include would have been an original image, downsampled image, and the reconstructed image. If the author is reading this, could they add this to the README?
Would also be interesting to see some pixel art run through this. It probably won't work that well given that its trained on real downsampled photos though, but who knows.
This technique is akin to hire an artists to draw a high resolution version of your pixelated photos.
A good example of this is "They Of The Tentacle Remastered" (http://dott.doublefine.com/). The new game looks extremely similar to the old one but it has been redrawn.
As some one suggested you should be able to take an old TV show, train the neural network with HD pictures of the cast. And let it redraw it in its own "artistic" interpretation of the images.
(Author here.) Did you see the faces example on the GitHub page? It was a domain-specific network trained adversarially for that purpose, but I have yet to see any super-resolution of that quality with or without machine learning.
Most other approaches don't even try to inject high-frequency detail into the high-resolution images because the PSNR/SSIM benchmarks drop. Until those metrics/benchmarks are dropped, there'll be little more progress in super-resolution.
I ran that image through the library with the default settings and it came out with an image that is in my opinion much better than all of the approaches shown there
(Author here.) Maybe it's worth moving to a GitHub issue. Try `--model=small`. The demo server limits the number of pixels to around 320x200 or 256x256 and can do only 4 at the same time to fit in RAM.
I do photo restorations on Reddit, where people often submit blurry photos that sharpening just can't fix. It would be great if this were offered as an online service.
Question for the more experienced deep learning folk: if I wanted to use this to upscale textures for a game, would I have to train it on the same type of texture? In other words additional wood textures when upscaling wood, brick when upscaling brick textures, and so on?
(Author here.) If you have the luxury to train on domain-specific textures, the results will definitely be better. That's why I included all the training code in the repository as well—to allow for this kind of solution.
If you scroll down on GitHub to see the faces examples, those are achieved by a domain-specific network. I suspect you'll similarly get extremely high-quality if you have good input images.
I've seen a number of neural network approaches for super-resolution like waifu, but I haven't seen something general purpose thats better than bicubic/fourier/nearest neighbor.
(Author here.) My biggest insight from this project is that super-resolution with neural networks benefits significantly from being domain specific. If you train on broader datasets, it does pretty well but has to make compromises. Many recent papers do a comparison in terms of pixel similarity (PSNR/SSIM), and using those metrics the quality drops because high-frequency detail is punished under those criteria (even though it may look better perceptually). Reference: http://arxiv.org/abs/1609.04802
On GitHub, below each GIF there's a demo comparison, but on the site you can also submit your own to try it out (click on title or restart button). Takes about 60s currently; running on CPU as GPUs are busy training ;-)
> super-resolution with neural networks benefits significantly from being domain specific. If you train on broader datasets, it does pretty well but has to make compromises.
To what extent could the need for this trade-off be overcome with a larger network?
Train this using a huge facial database such as the one US immigration holds and you have the perfect human detector, able to identify you even from nighttime security cameras.
(Author here.) Unlike most other non generative adversarial network (GAN) approaches to super-resolution, it does try to inject high-frequency detail; see the faces example on GitHub. But I tuned down that parameter in the released models a bit so it performed better generally.
"Because my photos were used heavily in the dataset..."
Jury: So guilty