PixelNN – Example-Based Image Synthesis

otp124 · on Sept 25, 2017

I used to roll my eyes at crime television shows, whenever they said "Enhance" for a low quality image.

Now it seems the possibility of that becoming realistic are increasing with a steady clip, based on this paper and other enhancement techniques I've seen posted here.

ACow_Adonis · on Sept 25, 2017

Except, and this is really the fundamental catch, it's not so much "enhance" as it is "project a believable substitute/interpretation".

You fundamentally can't get back information that has been destroyed/or never captured in the first place.

What you can do is fill in the gaps/information with plausible values.

I don't know whether this sounds like I'm splitting hairs, but it's really important that the general public not think we're extracting information in these procedures, we're interpolating or projecting information that is not there.

Very useful for artificially generating skins for each shoe on a shoe rack in a computer game or simulation, potentially disastrous if the general public starts to think it's applicable to security camera footage or admissible as evidence...

ZeroGravitas · on Sept 25, 2017

To give specific examples from their test data, it added stubble to people who didn't have stubble, gave them a different shape of glasses, changed the color of cats, changed the color and brand of sport shoe.

And even then, I'm a little suspicious of how close some of the images got to original without being given color information.

It appears that info was either hidden in the original in a way not apparent to humans or was implicit in their data set in some way that would make it fail on photos of people with different skin tones.

omtinez · on Sept 25, 2017

I haven't read the paper in full detail, but reading between the lines I'm guessing that there's a significant portion of manual processing and hand waving involved. From the abstract, emphasis mine:

> the second stage uses a pixel-wise nearest neighbor method to map the smoothed output to multiple high-quality, high-frequency outputs in a controllable manner.

My interpretation is that they select training data by hand and generate a bunch of outputs. Repeating the process until they like the final result. From the paper:

> we allow a user to have an arbitrarily-fine level of control through on-the-fly editing of the exemplar set (E.g., “resynthesize an image using the eye from this image and the nose from that one”).

WhitneyLand · on Sept 25, 2017

There's nothing weak or negative about that, it's exactly what'd you expect. Obviously for a given input there will be multiple plausible outputs. With any such system it would make sense to allow some control in choosing among the outputs.

IshKebab · on Sept 25, 2017

Could be pretty great for police sketch artists. (Although pretty misleading for juries too.)

adrianN · on Sept 25, 2017

Just train the model with the suspect's Facebook photostream and presto you have convincing evidence.

chaosite · on Sept 26, 2017

Sounds similar to the problems with JBIG2 lossy compression.

https://en.wikipedia.org/wiki/JBIG2#Disadvantages

sweezyjeezy · on Sept 25, 2017

> Except, and this is really the fundamental catch, it's not so much "enhance" as it is "project a believable substitute/interpretation".

I would argue that this is a form of enhancement though, and in some cases will be enough to completely reconstruct the original image. For example, if I give you a scanned PDF, and you know for a fact that it was size 12 black Ariel text on a white background, this can feasibly let you reconstruct the original image perfectly. The 'prior' that has been encoded by the model from the large amount of other images increases the mutual information between grainy image and high-res. The catch is that uncertainty cannot be removed entirely, and you need to know that the target image comes from roughly the same distribution as the training set. But knowing this gives you information that is not encoded in the pixels themselves, so you can't necessarily argue that some enhancement is impossible. For example with celebrity images, if the model is able to figure out who is in the picture, this massively decreases the set of plausible outputs.

LeifCarrotson · on Sept 25, 2017

Or it might not! This reminds me of the Xerox bug from a couple years ago, that turned one number into another.

http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...

Enhancing some things incorrectly would be worse than leaving them ambiguous.

trevyn · on Sept 25, 2017

> The catch is that you need to know that the target image comes from roughly the same distribution as the training set.

When humans think about "enhance", they imagine extracting subtle details that were not obvious from the original, which implies that they know very little about what distribution the original image comes from. If they did, they wouldn't have a need for "enhance" 99% of the time -- the remaining 1% is for artistic purposes, which this is indeed suited for.

It'll be interesting to see how society copes with the removal of the "photographs = evidence" prior.

> when enhancing celebrity images, if the model is able to figure out who is in the picture this massively decreases the set of plausible outputs.

This is an excellent insight.

ZenPsycho · on Sept 26, 2017

Do you think knowing which state the license plate is from is enough prior knowledge?

yorwba · on Sept 25, 2017

See also https://dheera.net/projects/blur

usrusr · on Sept 25, 2017

Yeah, replace the training set with cartoon characters and the crime show dialog goes like this:

"Zoom! Enhance! Zoom! Enhance! Enhance! Oh my god it's full of Smurfs..."

7952 · on Sept 25, 2017

The benefit depends on how predictable the phenomenon is that your are interpolating from. Sometimes it will be quantitatively better than a low resolution version, sometimes not.

A good example is with compression algorithms for media. They work because the sound or image is predictable. And they are ineffective when the input becomes more unpredictable. But if the output is all you have then running the decompression will probably be better than just reading the raw compressed data. But you have to be aware of the limitations.

matt4077 · on Sept 25, 2017

> You fundamentally can't get back information that has been destroyed/or never captured in the first place.

I love this cliché. I've seen it thousands of times, and probably written it myself a few times. We all repeat stuff like that ad nauseam, without ever thinking.

Because it's fundamentally flawed, especially in the context that it has usually been applied to, namely criticising the CSI:XYZ trope of "enhancing images".

The truth is that there is a lot more information in a low-res image than meets the eye.

Even if you can't read the letters on a license plate, it can be recovered by an algorithm. If the Empire State Building is in the background, it's likely to be a US license plate. Maybe only some letters would result in the photo's low-res pattern. If you only see part of a letter, knowing the font may allow you to rule out many letters or numbers etc...

It's similar to that guy who used Photoshop's swirl effect to hide his face, not knowing that the effect is deterministic, and can easily be undone.

The error mostly appears to be in assuming that the information has been destroyed, when in reality it's often just obscured. And Neural Nets are excellent in squeezing all the information out noisy data.

amelius · on Sept 25, 2017

> It's similar to that guy who used Photoshop's swirl effect to hide his face, not knowing that the effect is deterministic, and can easily be undone.

The effect does not only need to be deterministic, but also invertible.

A low-res image has multiple "inverses" (yikes), supposedly each with an associated probability (if you would model it that way). So it would be more honest if the algorithm shows them all.

tentaTherapist · on Sept 25, 2017

Showing them all seems a bit impossible because the number would blow up really quickly, wouldn't it? Maybe it could categorise them, but that could be misleading, too... I don't know.

pedrosorio · on Sept 25, 2017

It's what they call an https://en.wikipedia.org/wiki/Inverse_problem

tentaTherapist · on Sept 25, 2017

That is a very well-named problem.

asfdsfggtfd · on Sept 25, 2017

>> You fundamentally can't get back information that has been destroyed/or never captured in the first place.

> I love this cliché. I've seen it thousands of times, and probably written it myself a few times. We all repeat stuff like that ad nauseam, without ever thinking.

It is not a cliche it is an absolute truth. Information not present cannot be retrieved. There may be more information present than is immediately obvious.

> Neural Nets are excellent in squeezing all the information out noisy data

Maybe but they are also good at overfitting onto noisy data (the original article is an example of such overfitting).

rootw0rm · on Sept 25, 2017

It's not cliché, it's true. You fundamentally can't get back information that has been destroyed/or never captured in the first place.

Yes, a low-res image has lots of information. You can process that information in many ways. Missing data can't just be magically blinked into existence though.

Copy/pasting bits of guessed data is NOT getting back information that has been destroyed or never captured. Obscured data is very different from non-existent data. Could the software recreate a destroyed painting of mine based on a simple sketch? Of course not, because it would have to invent details it knows nothing about.

I think it's almost dangerous to call this line of thinking cliché. It should be celebrated, not ridiculed.

scarface74 · on Sept 25, 2017

What you can do though, in limited circumstances, is create a still picture with more detail from a lower quality video.

https://photo.stackexchange.com/questions/17098/csi-image-re...

murkle · on Sept 25, 2017

It's a well-known technique in astronomy, eg https://www.aanda.org/articles/aa/ps/2005/22/aa2320-04.ps.gz

mcbits · on Sept 25, 2017

For anyone put off by the .ps.gz, it's actually just a normal web page that links to the full article in HTML and PDF. Not sure what they were thinking with that URL. I almost didn't bother to look. (Maybe that's what they were thinking?)

dahart · on Sept 25, 2017

I seem to remember from my computer vision class way back when that there's a fundamental theoretical limit to the amount of detail you can get out of a moving sequence. Recovering frequencies a little higher than the pixel sampling is definitely possible, but I feel like it was maybe something like 10x theoretical maximum. I also get the feeling, from looking around at available software, that in practice achieving 2-3x is the most you can get in ideal conditions, and most video is far from ideal.

k__ · on Sept 25, 2017

On the other hand, this is what the brain does all the time.

eternalban · on Sept 25, 2017

Wouldn't it be ironic if a mystified and superstitious GAI emerges out of all these efforts.

throwaway613834 · on Sept 25, 2017

> I don't know whether this sounds like I'm splitting hairs

Somewhat no, but somewhat yes. Thing is, while there can be lots of input images that generate the same output, it could be that only one (or a handful) of them would occur in reality. If this happens to sometimes be the case, and if you could somehow guarantee this was the case in some particular scenario, it could very well make sense to admit it as evidence. Of course, the issue is that figuring this out may not be possible...

jonathanstrange · on Sept 25, 2017

The white shoe output vs black shoe output illustrates this fairly well.

WhitneyLand · on Sept 25, 2017

>we're interpolating or projecting information that is not there

But that's not fully accurate either. Sometimes the information in total will really be a more accurate representation of reality than the blurred image. Maybe it could be described as an educated guess, sometimes wrong, sometimes invaluable.

It would be interesting to see the results starting with higher quality images. With the camera quality increasing, many times there should be more data to start with.

A

phkahler · on Sept 25, 2017

>> Maybe it could be described as an educated guess, sometimes wrong, sometimes invaluable.

When is a guess invaluable?

WhitneyLand · on Sept 25, 2017

When it identifies an established terrorist and prevents a mass casualty event.

teekert · on Sept 26, 2017

Exactly, this may be possible: [0] but only of the NN has seen such images before, the output will match the training data but says nothing about reality.

[0] https://i.pinimg.com/originals/b5/29/1b/b5291bba7250abd12010...

gus_massa · on Sept 25, 2017

In comparison of output vs original it is clear that the skin color is not accurate.

kevin_thibedeau · on Sept 25, 2017

"Ladies and gentlemen of the jury, we will definitively prove that the black smudge captured on camera was in fact a gun"

Already being done today with DNA.

jopsen · on Sept 25, 2017

Sometimes US justice system seems very "approximate". So why not convict people based on interpolated evidence?

- I'm joking of course :) hehe

dahart · on Sept 25, 2017

Sadly, it actually happens sometimes.

https://www.wired.com/2017/04/courts-using-ai-sentence-crimi...

This thread a year ago worried about it too, but the paper itself seems implausible and problematic.

https://news.ycombinator.com/item?id=12983827

WillReplyfFood · on Sept 25, 2017

You seem like a funny guy- the batmaNN interpolates that you are very likely a joker.

gambiting · on Sept 25, 2017

No, but think of these blurred images as a "hash" - in an ideal situation, you only have one value that encodes to a certain hash value, right? So If you are given a hash X you technically can work out that it was derived from value Y - you're not getting back information that was lost - in a way it was merely encoded into the blurred image, and it should be possible to produce a real image which, when blurred, will match what you have.

Don't get me wrong, I think we're still far far far off situation where we can get those reliably, but I can see how you could get the actual face out of a blurred image.

ComputerGuru · on Sept 25, 2017

> you only have one value that encodes to a certain hash value, right?

Errr wrong. A perfect hash, yes. But they're never perfect. You have a collision domain and you hope that you don't have enough inputs to trigger a birthday paradox.

Look at the pictures on the article. It's an outline of the shoe. That's your hash. ANY shoe with that general outline resolves to that same hash.

If your input is objects found in the Oxford English Dictionary, you'll have low collisions. An elephant doesn't hash to that outline. But if your inputs is the Kohl's catalog, you'll have an unacceptable collision rate.

Hashes are attempts at creating a _truncated_ "unique" representation of an input. They throw away data they hope isn't necessary to uniquely identify between possible inputs (bits). A perfect hash for all possible 32 bit values is 32 bits. You can't even have a collision free 31 bit hash.

So back to the blurry security camera footage of a license plate or a face. Sure, that "hash" can reliably tell you that it wasn't a sasquatch that committed the robbery, but it literally doesn't contain the data necessary to _ever_ prove it was the suspect in question, even if the techs _can_ prove that the suspect hashes to the image in the footage.

chrismorgan · on Sept 25, 2017

FYI (not because it’s particularly relevant to the sort of hashing that is being talked about, but because it’s a useful piece of info that might interest people, and corrects what I think is a misunderstanding in the parent comment): perfect hash functions are a thing, and are useful: https://en.wikipedia.org/wiki/Perfect_hash_function. So long as you’re dealing with a known, finite set of values, you can craft a useful perfect hash function. As an example of how this can be useful, there’s a set of crates in Rust that make it easy to generate efficient string lookup tables using the magic of perfect hash functions: https://github.com/sfackler/rust-phf#phf_macros. (A regular hash map for such a thing would be substantially less efficient.)

Crafting a perfect hash function with keys being the set of words from the OED is perfectly reasonable. It’ll take a short while to produce it, but it’ll work just fine. (rust-phf says that it “can generate a 100,000 entry map in roughly .4 seconds when compiling with optimizations”, and the OED word count is in the hundreds of thousands.)

ComputerGuru · on Sept 25, 2017

Yeah, I debated bringing it up but since we were in the context of not knowing set members ahead of time, decided not to.

Thanks for the rust-phf link. I'm bookmarking for my next project!

jaclaz · on Sept 25, 2017

>So back to the blurry security camera footage of a license plate or a face. Sure, that "hash" can reliably tell you that it wasn't a sasquatch that committed the robbery, but it literally doesn't contain the data necessary to _ever_ prove it was the suspect in question, even if the techs _can_ prove that the suspect hashes to the image in the footage.

For a face, sure, for printed text/license plates there are effective deblurring algorithms that in some cases may rebuild a readable image.

A (IMHO good) software is this one (was freeware, now it is Commercial, this is the last freeware version):

https://github.com/Y-Vladimir/SmartDeblur/downloads

You can try it (just for the fun of it) on these two images:

https://articles.forensicfocus.com/2014/10/08/can-you-get-th...

https://forensicfocus.files.wordpress.com/2014/09/out-of-foc...

https://forensicfocus.files.wordpress.com/2014/09/moving-car...

For the first choose "Out of Focus Blur" and play with the values, you should get a decent image at roughly Radius 8, Smooth 40%, Correction Strength 0%, Edge Feather 10%

For the second choose "motion Blur" and play with the values, you should get a decent image at roughly Length 14, Angle 34, Smooth 50%,

consp · on Sept 25, 2017

Fortunately there is a limit: the universe (in a practical sense). You cannot encode all states it has in a hash as it would require more states than you want to encode as you already mentioned (pigeon hole). But representing macroscopic data like text (or basically anything bigger than atomic scale) uniquely can be done with 128+ bits. Double that and you are likely safe for collisions, assuming the method you use is uniform and not biased to some input.

If you want ease collision examples you can take a look at people using CRC32 as hashes/digests. It is notoriously prone to collisions (since only 32 bits).

IncRnd · on Sept 25, 2017

That won't work. A lot of people have tried to create systems that they claim always compress movies or files or something else. Yet, none of those systems ever come to market. They get backers to give them cash, then they disappear. The reason they don't come to market is that they don't exist. Look up the pigeon-hole principle. It's the very first principle of data compression.

You can't compress a file by repeatedly storing a series of hashes, then hashes of those hashes, down into smaller and smaller representations. The reason that you cannot do this is that you cannot create a lossless file smaller than the original entropy. If you could happen to do so, however, you would get down to ever smaller files, until you had one byte left. But, you could never decompress such a file, because there is no single correct interpretation of such a decompression. In other words, your decompression is not the original file.

ACow_Adonis · on Sept 25, 2017

Without getting too technical because I hate typing on a phone, you're technically right in the sense of a theoretical hash.

But in real life there's collisions.

And in real life image or sound compression, blurs, artifacts and resolutions, it is fundamentally destroying information in practice. It is no longer the comparatively difficult but theoretically possible task of reversing a perfect hash, but more like mapping a name to the characters/bucket RXXHXXXX where x could be anything.

There are lots of values we can replace X with which are plausible, but without an outside source of information, we can't know what the real values in the original name was.

dispo001 · on Sept 25, 2017

Out of sheer curiosity I had a go at manually enhancing the Roundhay Garden Scene by dramatically enlarging the frames, stacking them, aligning them, erasing the most blurred ones and the obvious artifacts.

It went from this:

https://media.giphy.com/media/pUf3YfamV7BV6/giphy.gif

To this:

http://img.go-here.nl/Roundhay_Garden_Scene.gif

The funniest part was that the resolution really goes up if you make 1 px into 40 and align the frames accurately (then adjust opacity to the level of blur)

The crime television thing would be possible if you have enough frames of the gangster.

thaw13579 · on Sept 25, 2017

Approaches like these are hallucinating the high resolution images though--not something that we'd ever want being used for police work. That said, I wonder if it would perform better than eyewitness testimony...

smallnamespace · on Sept 25, 2017

> hallucinating the high resolution images though

To play devil's advocate though, modern neuroscience and neuropsychology basically tells us that that our brains reconstruct and recreate our memories every time we try to remember them. Our memories are highly malleable and prone to false implantation... and yet witness testimony is still the gold standard in courts.

gvx · on Sept 25, 2017

And experts have been calling for a long time to at least limit the power of witness testimony, precisely for those reasons.

smelterdemon · on Sept 25, 2017

I wouldn't want to see it used as evidence in court (and I doubt it would be allowed anyway but IANAL) but I could see this being a useful in certain circumstances for generating the photo-realistic equivalent of a police sketch e.g. if you had low-res security footage of a suspect and an eyewitness to guide the output.

netsharc · on Sept 25, 2017

It would be useful to reduce the number of suspects... calculate possible combinations, match them against the mugshots database and investigate/interrogate those people. Or if you're the NSA/KGB, you can match against the social media pictures database, and then ask the social media company to tell you where these users were at the time of the crime (since the social media app on the phone track their users' location...)

xyzzy_plugh · on Sept 25, 2017

You could e.g. ostensibly produce valid license plates, which could be further reduced by matching the car color and model, to produce a small set of calid records.

gambiting · on Sept 25, 2017

Sure, but if we go by how the police works now, they will take a plate produced by the computer as 100% given and arrest/shoot the owner of that plate because "computer said so".

IncRnd · on Sept 25, 2017

Such an algorithm would likely get the state wrong. This is error prone and fraught with real world difficulties that could get people shot.

asfdsfggtfd · on Sept 25, 2017

You could also just pick a random license plate. It would be just as accurate.

oever · on Sept 25, 2017

This image from the article shows that the original image and the fantasy image are not alike at all. The faces look to have different ages. The computer even fantasized a beauty mark.

http://www.cs.cmu.edu/~aayushb/pixelNN/freq_analysis.png

The computer is fantasizing.

O1111OOO · on Sept 25, 2017

> This image from the article shows that the original image and the fantasy image are not alike at all.

This is another avenue that could be further explored, which I quite like. That is, a non-artist can doodle images and create a completely new photo-realistic image based on the line drawings.

I was modifying a few images (from link on another comment here: https://affinelayer.com/pixsrv/ ) and the end results were interesting.

c12 · on Sept 25, 2017

The low resolution to high resolution image synthesis reminds me of the unblur tool that Adobe demoed during Adobe MAX in 2011. Here is the relevant clip if you're interested https://www.youtube.com/watch?v=xxjiQoTp864

ajnin · on Sept 26, 2017

That demo was quite impressive, but the technique is completely different. Adobe uses deconvolution to recover information and details that are actually in the picture, but not visible (unintuitively blurring is a mathematically reversible transformation. If you know the characteristics of the blur, then you can reverse it. In fact most of Adobe demo's magic comes from knowing the blur kernel and path in advance, not sure how it works in practice for real photos). But the Neural net demoed in this post just "makes up" the missing info using examples from photos it learned from, there is no information recovery.

seanmcdirmid · on Sept 25, 2017

You'll get something that looks plausible for sure, maybe not what was originally there though. In the future, someone will be falsely convicted of a crime because a DNN enhance decided to put their picture in some fuzzy context.

jlebrech · on Sept 25, 2017

It can give possible matches, i don't think it would be admissible in court. they could still trick a confession out of someone using that image.

ZeroGravitas · on Sept 25, 2017

You don't specify, but presumably you mean a true confession.

It could also be used to generate a false confession. If the prosecutor says "We have proof you were there at the scene" and shows you some generated image, then you as an innocent person have to weigh the chances of the jury being fooled by the image (and even if it's not admissable in court, it may be enough to convince the investiging team that you are responsible and stop looking for the real perpetrator) and the expected sentences if you maintain your innocence vs "admitting" your guilt.

KGIII · on Sept 25, 2017

It could also narrow down the list of suspects. From there, additional investigation can find more evidence. Having access to big data can help this.

jlebrech · on Sept 25, 2017

true, it cannot be used to "nail" a perp tho, just to help gain extra evidence.

KGIII · on Sept 25, 2017

Yup. In a court of law, the value as evidence is going to be weighted fairly low, even with expert testimony. It may be enough to get a warrant, or a piece in the process of deduction during the investigation phase.

api · on Sept 25, 2017

It's still impossible. These algorithms find in gaps with their biases, not reality. If information is not there it is not there.

mathw · on Sept 25, 2017

Yes!

Although what we don't have is any certainty that the enhanced face actually looks like the killer.

nl · on Sept 25, 2017

To paraphrase Google Brain's Vincent Vanhoucke, this appears to be another example where using context prediction from neighboring values outperforms an autoencoder approach.

If 2017 was the year of GANs, 2018 will be the year context prediction.

maho · on Sept 25, 2017

I hope some day this will generalize to video. I don't care about the exact shape of background trees in an action movie - with this approach, they could be compressed to just a few bytes, regardless of resolution.

stepik777 · on Sept 25, 2017

Except that it can put trees somewhere where there were no trees but something similar to them. Or it can put face of a more popular actor instead of an actual less popular one because it was more often present in the training dataset. No, thanks.

TuringTest · on Sept 25, 2017

Well, isn't that basically how Hollywood makes blockbusters?

adrianN · on Sept 25, 2017

Plug in the script and some artist's impressions of the sets and generate the whole movie on the fly.

IncRnd · on Sept 25, 2017

That's what video compression does now.

roel_v · on Sept 25, 2017

No, today's compression is about compressing what's already in the one movie. But imagine that you run your training set over 100's or 1000's of films, and extract just enough to represent say different types of trees in a few bytes. You could 'compress' a film by replacing data with markers that essentially describe some properties of the tree, and those properties + the training set are then used during 'decompression' to recreate (an approximation of) the tree.

This would of course not give you any space savings when you want to distribute 1 movie. There would be some minimum number of movies where the training set + actual movies would be smaller than the sum of the sizes of the individual movies compressed.

I'm not saying this would be a net space saver, or necessarily a good technique at all, but the concept is intriguing.

makapuf · on Sept 25, 2017

And some action movies could then be compressed to mere bytes if you basically have a virtual movie studio in your PC.

eesmith · on Sept 25, 2017

Might make it easy to watch the Sweded version.

laythea · on Sept 25, 2017

I wonder if this could be applied to "incomplete" 3D models and the work shifted to the GPU!?

joosters · on Sept 25, 2017

I don't understand how the edges-to-faces can possibly work. The inputs seem to be black & white, and yet the output pictures have light skin tones.

How can their algorithm work out the skin tone from a colourless image. Perhaps their training data only had white people in it?

dahart · on Sept 25, 2017

You never saw edges2cats I take it? https://affinelayer.com/pixsrv/

> I don't understand how the edges-to-faces can possibly work. The inputs seem to be black & white, and yet the output pictures have light skin tones.

The step you're missing is that an edge detector is run on the entire database of training images to produce a database of edge images. The input edge image is run against that corpus of edge images in order to find which edge images match, then sample the corresponding original color images and synthesize a new color image.

joosters · on Sept 25, 2017

Thanks for that link, I'd never seen that before. In fact, the edges2shoes sample on that page exactly summarises the issue I have: You start with what effectively appears to be a rough line drawing sketch of a shoe, and the algorithm 'fills in' a realistic shoe to fit the sketch. The sketch never had any colour information and so the algorithm has to pick one for it. In their example output, the algorithm has picked a black shoe, but it could just as realistically chosen a red one. The colouring all comes from their training data (in their case, 50k shoe images from Zappos). So in short, the algorithm can't determine colour.

But shoes and cats are one thing; reconstructing people's faces is another. I know the paper & the authors are demonstrating a technology here, rather than directly saying "you can use this technology for purpose X", but the discussion in these comments has jumped straight into enhancing images and improving existing pictures/video. But there is a very big line between 'reconstituting' or 'reconstructing' an image and 'synthesising' or 'creating' an image, and it appears many people are blurring the two together. Again, in the authors' defence, they are clear that they talk about the 'synthesis' of images, but the difference is critical.

dahart · on Sept 25, 2017

> So in short, the algorithm can't determine colour.

That's right. But with the caveat that a large training set can determine plausible colors and rule out implausible ones. This is more true for faces than for shoes! The point is that there is some correlation between shape and color in real life. The color comes from the context in the training set. This is what @cbr meant nearby re: "skin color is relatively predictable from facial features (ex: nose width), it should be able to do reasonably well."

There are CNNs trained to color images, and they do pretty well from training context: http://richzhang.github.io/colorization/

> there is a very big line between 'reconstituting' or 'reconstructing' an image and 'synthesising' or 'creating' an image, and it appears many people are blurring the two together.

Yep, exactly! Synthesis != enhance.

jtanderson · on Sept 25, 2017

I had the same thought. Maybe it's not that there were only white people in the dataset, but it's actually taking the shape of the face into account, and it most closely matches those with white skin tones. I suggest this by looking at the cat one: it has the stripes coming off the eyes, so suggests one of the grey striped breeds rather than, e.g. all black or calico. It's probably more than pixel-by-pixel NN interpolation, but also taking into account some of the actual structure of the edges.

jefftk · on Sept 25, 2017

Color comes from the initial neural network step. Since skin color is relatively predictable from facial features (ex: nose width), it should be able to do reasonably well.

joosters · on Sept 25, 2017

Really? With what accuracy? This is the kind of assumption that will get research groups into very deep water...

Just imagine the kind of CCTV usage being discussed elsewhere in this thread. But the neural network happens to have a wrong bias towards skin colour...

radarsat1 · on Sept 25, 2017

You're absolutely right to be concerned about this stuff, but be aware that it is generally acknowledged as a problem and that the "ethics of machine learning" is quite an interesting and active research topic.

One of the best articles I've read on the topic, if you're interested: https://medium.com/@blaisea/physiognomys-new-clothes-f2d4b59...

dahart · on Sept 25, 2017

Using image synthesis at all can't be used for up-rezing CCTV imagery, the output is a fabrication and the researchers have all said so. People imagining bad use cases shouldn't be relied on. ;) If an investigator used this to track down criminals, they are the ones getting into deep water and making assumptions.

imron · on Sept 25, 2017

Seems to have a thing for beards.

jokoon · on Sept 25, 2017

I have a large collection of images, many being accessible through google image search.

I wonder if there could be a way to "index" those images so I can find them back without storing the whole image, using some type of clever image histogram or hashing-kind function.

I wonder if that thing already exist, since there are many images, and since most images have a lot of difference in their data, could it be possible to create some kind of function that describe an image in a way that entering such histogram redirects to (or the closest) the image it indexed? I guess I'm lacking the math, but it sounds like some "averaging" hashing function.

dannyw · on Sept 25, 2017

That's perceptual hashing. Check out https://www.phash.org/

jokoon · on Sept 25, 2017

Is there simpler way to implement it? This is a library, but aren't more common ways to do this?

mlevental · on Sept 25, 2017

so will this do something like image recognition? ie does it work as well as surf/sift?

aub3bhat · on Sept 25, 2017

Perceptual hashing is useful for copy detection. Its not robust to changes/transformations nor do the hashes encode any semantic information.

danielmorozoff · on Sept 25, 2017

This is the current approach for large sale image retrieval. By using some model to extract features and then performing distance calculations. This is usually done with hashing once speed and the size of the dataset become large.

aub3bhat · on Sept 25, 2017

I am developing a Visual Search engine with pluggable indexing models.

https://www.deepvideoanalytics.com

ChuckMcM · on Sept 25, 2017

Is anyone in the FX business playing with this stuff? I'm thinking generational backdrops with groups of people/stuff/animals in them without a lot of modelling input.

XorNot · on Sept 25, 2017

So is there an analagous process that would apply to audio I wonder?

dahart · on Sept 25, 2017

Yes, this is a fairly similar concept: https://magenta.tensorflow.org/nsynth

This is actually training a neural network on the Markov model, so it's very similar to core ideas behind the OP's paper. The core idea is to model the probability of a bit of sound by breaking it into the last note and everything that comes before the last note ("P(audio)=P(audio∣note)P(note)"). If you sample a bunch of audio and factor it that way for any given point in time, and accumulate that data somewhere, you can then sample the accumulated data randomly to generate new music.

There are other audio NN synthesis methods as well, pretty sure I've even seen one posted to ShowHN before.

magnat · on Sept 25, 2017

There kind of already is audio equivalent: MIDI. It supplies low resolution timing and pitch information and it's up to synthesizer to produce audio output matching those data.

jeeceebees · on Sept 25, 2017

I think the interesting part would be example based audio synthesis. Could you replace a synthesizer with a neural network which, when fed examples, would allow you to generate sounds / explore some latent space between the examples.

For example an approach similar to https://gauthamzz.github.io/2017/09/23/AudioStyleTransfer/ but then using the methods described in the PixelNN paper.

radarsat1 · on Sept 25, 2017

I'll just plug my recent work on my sound synthesis "copier": http://gitlab.com/sinclairs/sounderfeit

It more or less attempts to be what you describe. Not very polished yet, but I had some basic success in modeling the parameter space of a synth, and adding new latent spaces with regularization.

jerrre · on Sept 25, 2017

What would the lo-res starting point be? Low sample-rate, bit depth, ...?

eru · on Sept 25, 2017

Look up compressed sensing for audio.

(Eg first result: http://sunbeam.ece.wisc.edu/csaudio/)

XorNot · on Sept 26, 2017

I was thinking like a recording of extremely muffled voices.

tinyrick2 · on Sept 25, 2017

This is amazing. I especially like how the result can somewhat be interpreted by showing from what image the part of the generated image is copied (see Figure 5).

deevolution · on Sept 25, 2017

Apparently you grow a beard after using their nn model?

XnoiVeX · on Sept 25, 2017

I noticed that too. I hope it is just a documentation error.

throwaway00100 · on Sept 25, 2017

No code available.

jszymborski · on Sept 25, 2017

Which is sadly par for the course in this field, or at least my experience. You can always email the group...

sosuke · on Sept 25, 2017

I spent too long trying to get RAISR to work when that paper came out. You can try it out from some Github repos but no one has been able to recreate the results Google presented. I would be hard pressed to say my hires photos looked any better than the originals when scaled up on my iPhone screen.

I do wish they would release the code AND any related training images they used to get those results.

Wildgoose · on Sept 25, 2017

Very clever. I wonder if something like this could be used for other forms of sensor data as well?

dispo001 · on Sept 25, 2017

Ah like, what do I look like I want to eat?

verytrivial · on Sept 25, 2017

A pair of the inputs in the edge-to-edge faces are swapped. I have nagged an author.

verytrivial · on Sept 26, 2017

... and I followed up with an annotated screenshot. I tried, I really did!

the8472 · on Sept 25, 2017

All those examples are fairly low-resolution. Does this approach scale or can it be applied in some tiled fashion? Or would the artifacts get worse for larger images?

tke248 · on Sept 25, 2017

Does anything like this exist on the web would like to send a blurry license plate picture through this and see what it comes up with..

kensai · on Sept 25, 2017

OMG, now the "enhance" they say in investigative TV series and movies will actually be reality! :p

nathan_f77 · on Sept 25, 2017

This is cool, but in the comparison with Pix-to-Pix, it seems like Pix-to-Pix is the clear winner.

smrtinsert · on Sept 25, 2017

"Enhance" is real. When will this stuff trickle into lower level law enforcement?

asfdsfggtfd · on Sept 25, 2017

Hopefully never. This does not enhance the image - it makes up a plausible imaginary image.

EDIT: Furthermore the range of plausible imaginary images that match a given input is high (infinite?).

pc86 · on Sept 25, 2017

Just need to look at the picture of Fred Armisen to see that this technique can generate a picture of a plausibly real human who bears no/very little resemblance to the original image.

smrtinsert · on Sept 25, 2017

Why not? A recreation that leads to an identification should be enough for a warrant that could be used for a continued investigation.

asfdsfggtfd · on Sept 26, 2017

We could also just pick a random person off the street and punish them - it would be similarly accurate and fair (actually probably fairer - if this is trained on pictures with a certain bias it will return pictures with that bias).

This paper does not demonstrate an enhancement technique but a phenomena which those using inverse methods called "overfitting".

TFortunato · on Sept 25, 2017

Hopefully never, but I'm sure someone will see this and try!

(Because these kind of techniques aren't really enhancing the images in a way that gives you new and useful information: they are taking the low-res images as input, and giving you a plausible high-res image as output, based on it's training data. It is NOT however trying to say "this is the ACTUAL high res image that generated this low-res image"

swamp40 · on Sept 25, 2017

Love how Tom Cruise and Fred Armisen get transformed into completely different people...

nashashmi · on Sept 25, 2017

Should it? The accuracy of the pictures were dismal.

ScoutOrgo · on Sept 25, 2017

Can we use this to identify the leprechaun and find where da gold at?

yazanator · on Sept 25, 2017

Is there a GitHub repository link?

avian · on Sept 25, 2017

I found the title somewhat misleading. I was expecting some clever application of the nearest-neighbor interpolation. But this seems to involve neural nets and appears far from "simple" to me (I'm not in the image processing field though).

dahart · on Sept 25, 2017

> I was expecting some clever application of the nearest-neighbor interpolation. But this seems to involve neural nets and appears far from "simple" to me (I'm not in the image processing field though).

It's not that far off actually, but they are talking about nearest neighbor Markov chains, not interpolation. You probably already know nearest neighbor Markov chains because there are lots of text examples, and a ton of Twitter bots that are generating random text this way. The famous historical example was the usenet post that said "I spent an interesting evening recently with a grain of salt." https://en.m.wikipedia.org/wiki/Mark_V._Shaney

This paper does use a NN to synthesize an image, which is conceptually pretty simple, even if difficult to implement well. After that they use a nearest neighbor Markov chain to fill in high frequencies. The first paper referenced is also the simplest example: http://graphics.cs.cmu.edu/people/efros/research/EfrosLeung....

That paper fills missing parts of an image using a single example, by using a Markov chain built on the nearest neighboring pixels. That paper is also one of the only image synthesis papers (or perhaps the only paper) that can synthesize readable text from an image of text. That's really cool because the inspiration was text-based Markov chains.

jampekka · on Sept 25, 2017

I don't think this method has anything to do with Markov Chains. The spatial structure isn't explicitly used at all, and the interpolation/regression is quite a vanilla nearest neighbor with some performance tricks.

Well, of course almost anything can be interpreted as a Markov process, but I don't think it's a very useful abstraction here.

dahart · on Sept 25, 2017

> I don't think this method has anything to do with Markov Chains.

Oh, it absolutely does. I think it's fair to say that Efros launched the field of nearest neighbor texture synthesis, and his abstract states: "The texture synthesis process grows a new image outward from an initial seed, one pixel at a time. A Markov random field model is assumed, and the conditional distribution of a pixel given all its neighbors synthesized so far is estimated by querying the sample image and finding all similar neighborhoods.

This is the same Markov model that all subsequent texture synthesis papers are implicitly using, including the paper at the top of this thread. Efros' paper implemented directly is really slow, so a huge number of subsequent papers use the same conceptual framework, and are only adding methods for making the method performant and practical. (Sometimes, at the cost of some quality -- many cannot synthesize text, for example.)

Note the inspiration for text synthesis, Shannon's paper, also describes the "Markoff Process" explicitly. http://math.harvard.edu/~ctm/home/text/others/shannon/entrop... (Efros referenced Shannon, and noted on his web page: "Special thanks goes to Prof. Joe Zachary who taught my undergrad data structures course and had us implement Shannon's text synthesis program which was the inspiration for this project.")

> Well, of course almost anything can be interpreted as a Markov process, I don't think it's a very useful abstraction here.

It's not an abstraction to build a conditional probability table and then sample from it repeatedly to synthesize a new output. That's what a Markov process is, and that's what the paper posted here is doing. I don't really understand why you feel it's distant and abstract, but if you want to elaborate, I am willing to listen!

jampekka · on Sept 25, 2017

Unless I horribly misread the paper, this is not based on the Efros' quilting method, which indeed uses Markov fields. The method linked here seems to interpolate every pixel independently from its surroundings (neighbor means a close-by pixel in the training set in the feature space, not a spatially close pixel).

And I didn't mean that Markov processes are abstract in any "distant" sense, but that they are an abstraction, ie a "perspective" from which to approach and formulate the problem.

dahart · on Sept 25, 2017

I was referring to Efros' "non-parametric sampling" paper, not the quilting one. Efros defined "non-parametric sampling" as another name for "Markov chain" -- almost (see my edit below). This paper (PixelNN) refers directly to "non-parametric sampling" in the same sense as Efros, and it states that they are using "nearest neighbor" to mean "non-parametric sampling". This is talking rather explicitly about a Markov chain -like process.

"To address these limitations, we appeal to a classic learning architecture that can naturally allow for multiple outputs and user-control: non-parametric models, or nearest-neighbors (NN). Though quite a classic approach [11, 15, 20, 24], it has largely been abandoned in recent history with the advent of deep architectures. Intuitively, NN works by requiring a large training set of pairs of (incomplete inputs, high-quality outputs), and works by simply matching the an incomplete query to the training set and returning the corresponding output. This trivially generalizes to multiple outputs through K-NN and allows for intuitive user control through on-the-fly modification of the training set..."

Note the first reference #11 is Efros' non-parametric sampling, and that the authors state this is the "classic approach" that they apply here.

What you call "interpolate every pixel independently from its surroundings" could be another way to describe a Markov chain, because 1: it is sampled according to the conditional probability distribution (which is what you get by using the K nearest matches.) and 2: the process is repeated - one pixel (or patch) is added using the best match, then it becomes part of the neighborhood in the search for the pixel/patch next door. The name for that is "Markov process", or in the discrete case, "Markov chain", if you take an unbiased random sample from the conditional distribution. If you always choose the best sample, then it's the same as a Markov chain, but biased.

> (neighbor means a close-by pixel in the training set in the feature space, not a spatially close pixel)

That's right, and that's why it's misleading to talk about nearest neighbor interpolation, because that phrase is a graphics phrase that means interpolate from spatially close pixels. Hardly anyone else calls it interpolation, they call it sampling, point sampling, and other terms.

*EDIT:

I'm going to relax a little bit on this. "Non-parametric sampling" is a tiny bit different from a Markov process in that a Markov process attempts to simulate a distribution in an unbiased way. By using the best match instead of a random sample from the conditional distribution, the output may produce a biased version of the original distribution. This is why it's called non-parametric sampling instead of calling it a Markov chain, but the distinction is pretty small and subtle -- texture synthesis using non parametric sampling is extremely similar to a Markov chain, but not necessarily exactly the same.

Side note, it's really unfortunate they used the abbreviation "NN" to talk about "nearest neighbor" in a paper that also builds on "neural networks".

jampekka · on Sept 25, 2017

AFAIU it actually seems to be sort of "just" a clever application of the nearest-neighbor interpolation. The CNN is used to come up with the feature space for the pixels (weights of the CNN), and then each pixel is "copy-pasted" from the training set based on the nearest match.

It seems that this could be used in theory with any feature descriptors, such as local color histograms, although the results wouldn't probably be as good.

Edit: Being a nearest neighbor probably also carries the usual computational complexity problems of the method. If I understand it correctly, they ease this by actually first finding just subset of best matching full images using the CNN features and then do a local nearest neighbor search just in those images.

tgb · on Sept 25, 2017

I think the confusion is that the term "nearest neighbor approach" has a different meaning in machine learning than in image interpolation.

https://en.wikipedia.org/wiki/Nearest-neighbor_interpolation

versus

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

gfredtech · on Sept 25, 2017

+1, the exact conclusion i came to(K-nearest neighbors) when I saw this post. Thanks for pointing this out

dahart · on Sept 25, 2017

> The CNN is used to come up with the feature space for the pixels (weights of the CNN), and then each pixel is "copy-pasted" from the training set based on the nearest match.

FWIW, what you just described is known as a "Markov process". It is sampling a known conditional probability distribution.

While some interpolation of the data happens because the output represents a mixture of the training images, this is not "interpolation" at the pixel level, it's picking best matches from a search space of image fragments. (And the pixel neighbors are usually synthesized - the best match depends on previous best matches!) This is distinctly different from the kind of nearest neighbor interpolation you'd do when resizing an image.

Note the phrase "nearest neighbor" in this paper has an overloaded double meaning. It is referring both to pixel neighbors and neighbors in the search space of images. The pixel neighbors provide spatial locality within a single image; this is how & why high frequencies are generated from the training set. Nearest neighbor is also referring to the neighborhood matches in the search space, the K nearest neighbors of a given pixel neighborhood are used to generate the next K pixel outputs in the synthesis phase.

SeanDav · on Sept 25, 2017

Agree. This appears to be more a clever implementation of an algorithm generating "artistic" impressions. In some cases, creating artifacts which simply were not part of the original picture.

doomlaser · on Sept 25, 2017

The term in neural net research is 'face hallucination': https://people.csail.mit.edu/celiu/FaceHallucination/fh.html

Take a low resolution input image, and hallucinate a higher resolution version by statistically assembling bits from similar images in a large data set of training images.

phkahler · on Sept 25, 2017

If anyone ever tries to use this in court I hope they call it "Face Hallucination" and not "Image Reconstruction". On the research side, I wonder what the point of this is. I find it interesting but of little practical value.

aristus · on Sept 25, 2017

It's a way to refine their models. A systematic model-based representation of data is basically also a generator of that data.

Why is that? Blame Kolmogorov. There are deep connections between compression, serialization, and computation. An optimal compression scheme is a serialization and the Turing-complete program to decode it. For example: you can compress pi into a few lines of algorithm plus a starting constant like 4.

sctb · on Sept 25, 2017

We've updated the title from the submitted “Simple Nearest-Neighbor Approach Creates Photorealistic Image from Low-Res Image” to that of the article.

negamax · on Sept 25, 2017

I was impressed that title didn't say AI..

imaginenore · on Sept 25, 2017

It almost looks like they mixed training and testing data in some of the examples. The bottom-left sample in the normals-to-faces is extremely suspicions.

jj12345 · on Sept 25, 2017

I was looking at this as well, but I'm willing to suspend my disbelief because the normal vaguely looks like it has a good deal of information (in a basic fidelity sense).

jameshart · on Sept 25, 2017

seems astonishing that the normal information includes enough detail to tell you which direction the eyes are pointing, though?

sgtAtom · on Sept 25, 2017

Enhance.

debuggerpk · on Sept 25, 2017

Hollywood got it right!!!