I used to roll my eyes at crime television shows, whenever they said "Enhance" for a low quality image.
Now it seems the possibility of that becoming realistic are increasing with a steady clip, based on this paper and other enhancement techniques I've seen posted here.
Except, and this is really the fundamental catch, it's not so much "enhance" as it is "project a believable substitute/interpretation".
You fundamentally can't get back information that has been destroyed/or never captured in the first place.
What you can do is fill in the gaps/information with plausible values.
I don't know whether this sounds like I'm splitting hairs, but it's really important that the general public not think we're extracting information in these procedures, we're interpolating or projecting information that is not there.
Very useful for artificially generating skins for each shoe on a shoe rack in a computer game or simulation, potentially disastrous if the general public starts to think it's applicable to security camera footage or admissible as evidence...
To give specific examples from their test data, it added stubble to people who didn't have stubble, gave them a different shape of glasses, changed the color of cats, changed the color and brand of sport shoe.
And even then, I'm a little suspicious of how close some of the images got to original without being given color information.
It appears that info was either hidden in the original in a way not apparent to humans or was implicit in their data set in some way that would make it fail on photos of people with different skin tones.
I haven't read the paper in full detail, but reading between the lines I'm guessing that there's a significant portion of manual processing and hand waving involved. From the abstract, emphasis mine:
> the second stage uses a pixel-wise nearest neighbor method to map the smoothed output to multiple high-quality, high-frequency outputs in a controllable manner.
My interpretation is that they select training data by hand and generate a bunch of outputs. Repeating the process until they like the final result. From the paper:
> we allow a user to have an
arbitrarily-fine level of control through on-the-fly editing of
the exemplar set (E.g., “resynthesize an image using the eye from this image and the nose from that one”).
There's nothing weak or negative about that, it's exactly what'd you expect. Obviously for a given input there will be multiple plausible outputs. With any such system it would make sense to allow some control in choosing among the outputs.
> Except, and this is really the fundamental catch, it's not so much "enhance" as it is "project a believable substitute/interpretation".
I would argue that this is a form of enhancement though, and in some cases will be enough to completely reconstruct the original image. For example, if I give you a scanned PDF, and you know for a fact that it was size 12 black Ariel text on a white background, this can feasibly let you reconstruct the original image perfectly. The 'prior' that has been encoded by the model from the large amount of other images increases the mutual information between grainy image and high-res. The catch is that uncertainty cannot be removed entirely, and you need to know that the target image comes from roughly the same distribution as the training set. But knowing this gives you information that is not encoded in the pixels themselves, so you can't necessarily argue that some enhancement is impossible. For example with celebrity images, if the model is able to figure out who is in the picture, this massively decreases the set of plausible outputs.
> The catch is that you need to know that the target image comes from roughly the same distribution as the training set.
When humans think about "enhance", they imagine extracting subtle details that were not obvious from the original, which implies that they know very little about what distribution the original image comes from. If they did, they wouldn't have a need for "enhance" 99% of the time -- the remaining 1% is for artistic purposes, which this is indeed suited for.
It'll be interesting to see how society copes with the removal of the "photographs = evidence" prior.
> when enhancing celebrity images, if the model is able to figure out who is in the picture this massively decreases the set of plausible outputs.
The benefit depends on how predictable the phenomenon is that your are interpolating from. Sometimes it will be quantitatively better than a low resolution version, sometimes not.
A good example is with compression algorithms for media. They work because the sound or image is predictable. And they are ineffective when the input becomes more unpredictable. But if the output is all you have then running the decompression will probably be better than just reading the raw compressed data. But you have to be aware of the limitations.
> You fundamentally can't get back information that has been destroyed/or never captured in the first place.
I love this cliché. I've seen it thousands of times, and probably written it myself a few times. We all repeat stuff like that ad nauseam, without ever thinking.
Because it's fundamentally flawed, especially in the context that it has usually been applied to, namely criticising the CSI:XYZ trope of "enhancing images".
The truth is that there is a lot more information in a low-res image than meets the eye.
Even if you can't read the letters on a license plate, it can be recovered by an algorithm. If the Empire State Building is in the background, it's likely to be a US license plate. Maybe only some letters would result in the photo's low-res pattern. If you only see part of a letter, knowing the font may allow you to rule out many letters or numbers etc...
It's similar to that guy who used Photoshop's swirl effect to hide his face, not knowing that the effect is deterministic, and can easily be undone.
The error mostly appears to be in assuming that the information has been destroyed, when in reality it's often just obscured. And Neural Nets are excellent in squeezing all the information out noisy data.
> It's similar to that guy who used Photoshop's swirl effect to hide his face, not knowing that the effect is deterministic, and can easily be undone.
The effect does not only need to be deterministic, but also invertible.
A low-res image has multiple "inverses" (yikes), supposedly each with an associated probability (if you would model it that way). So it would be more honest if the algorithm shows them all.
Showing them all seems a bit impossible because the number would blow up really quickly, wouldn't it? Maybe it could categorise them, but that could be misleading, too... I don't know.
>> You fundamentally can't get back information that has been destroyed/or never captured in the first place.
> I love this cliché. I've seen it thousands of times, and probably written it myself a few times. We all repeat stuff like that ad nauseam, without ever thinking.
It is not a cliche it is an absolute truth. Information not present cannot be retrieved. There may be more information present than is immediately obvious.
> Neural Nets are excellent in squeezing all the information out noisy data
Maybe but they are also good at overfitting onto noisy data (the original article is an example of such overfitting).
It's not cliché, it's true. You fundamentally can't get back information that has been destroyed/or never captured in the first place.
Yes, a low-res image has lots of information. You can process that information in many ways. Missing data can't just be magically blinked into existence though.
Copy/pasting bits of guessed data is NOT getting back information that has been destroyed or never captured. Obscured data is very different from non-existent data. Could the software recreate a destroyed painting of mine based on a simple sketch? Of course not, because it would have to invent details it knows nothing about.
I think it's almost dangerous to call this line of thinking cliché. It should be celebrated, not ridiculed.
For anyone put off by the .ps.gz, it's actually just a normal web page that links to the full article in HTML and PDF. Not sure what they were thinking with that URL. I almost didn't bother to look. (Maybe that's what they were thinking?)
I seem to remember from my computer vision class way back when that there's a fundamental theoretical limit to the amount of detail you can get out of a moving sequence. Recovering frequencies a little higher than the pixel sampling is definitely possible, but I feel like it was maybe something like 10x theoretical maximum. I also get the feeling, from looking around at available software, that in practice achieving 2-3x is the most you can get in ideal conditions, and most video is far from ideal.
> I don't know whether this sounds like I'm splitting hairs
Somewhat no, but somewhat yes. Thing is, while there can be lots of input images that generate the same output, it could be that only one (or a handful) of them would occur in reality. If this happens to sometimes be the case, and if you could somehow guarantee this was the case in some particular scenario, it could very well make sense to admit it as evidence. Of course, the issue is that figuring this out may not be possible...
>we're interpolating or projecting information that is not there
But that's not fully accurate either. Sometimes the information in total will really be a more accurate representation of reality than the blurred image. Maybe it could be described as an educated guess, sometimes wrong, sometimes invaluable.
It would be interesting to see the results starting with higher quality images. With the camera quality increasing, many times there should be more data to start with.
Exactly, this may be possible: [0] but only of the NN has seen such images before, the output will match the training data but says nothing about reality.
No, but think of these blurred images as a "hash" - in an ideal situation, you only have one value that encodes to a certain hash value, right? So If you are given a hash X you technically can work out that it was derived from value Y - you're not getting back information that was lost - in a way it was merely encoded into the blurred image, and it should be possible to produce a real image which, when blurred, will match what you have.
Don't get me wrong, I think we're still far far far off situation where we can get those reliably, but I can see how you could get the actual face out of a blurred image.
> you only have one value that encodes to a certain hash value, right?
Errr wrong. A perfect hash, yes. But they're never perfect. You have a collision domain and you hope that you don't have enough inputs to trigger a birthday paradox.
Look at the pictures on the article. It's an outline of the shoe. That's your hash. ANY shoe with that general outline resolves to that same hash.
If your input is objects found in the Oxford English Dictionary, you'll have low collisions. An elephant doesn't hash to that outline. But if your inputs is the Kohl's catalog, you'll have an unacceptable collision rate.
Hashes are attempts at creating a _truncated_ "unique" representation of an input. They throw away data they hope isn't necessary to uniquely identify between possible inputs (bits). A perfect hash for all possible 32 bit values is 32 bits. You can't even have a collision free 31 bit hash.
So back to the blurry security camera footage of a license plate or a face. Sure, that "hash" can reliably tell you that it wasn't a sasquatch that committed the robbery, but it literally doesn't contain the data necessary to _ever_ prove it was the suspect in question, even if the techs _can_ prove that the suspect hashes to the image in the footage.
FYI (not because it’s particularly relevant to the sort of hashing that is being talked about, but because it’s a useful piece of info that might interest people, and corrects what I think is a misunderstanding in the parent comment): perfect hash functions are a thing, and are useful: https://en.wikipedia.org/wiki/Perfect_hash_function. So long as you’re dealing with a known, finite set of values, you can craft a useful perfect hash function. As an example of how this can be useful, there’s a set of crates in Rust that make it easy to generate efficient string lookup tables using the magic of perfect hash functions: https://github.com/sfackler/rust-phf#phf_macros. (A regular hash map for such a thing would be substantially less efficient.)
Crafting a perfect hash function with keys being the set of words from the OED is perfectly reasonable. It’ll take a short while to produce it, but it’ll work just fine. (rust-phf says that it “can generate a 100,000 entry map in roughly .4 seconds when compiling with optimizations”, and the OED word count is in the hundreds of thousands.)
>So back to the blurry security camera footage of a license plate or a face. Sure, that "hash" can reliably tell you that it wasn't a sasquatch that committed the robbery, but it literally doesn't contain the data necessary to _ever_ prove it was the suspect in question, even if the techs _can_ prove that the suspect hashes to the image in the footage.
For a face, sure, for printed text/license plates there are effective deblurring algorithms that in some cases may rebuild a readable image.
A (IMHO good) software is this one (was freeware, now it is Commercial, this is the last freeware version):
For the first choose "Out of Focus Blur" and play with the values, you should get a decent image at roughly Radius 8, Smooth 40%, Correction Strength 0%, Edge Feather 10%
For the second choose "motion Blur" and play with the values, you should get a decent image at roughly Length 14, Angle 34, Smooth 50%,
Fortunately there is a limit: the universe (in a practical sense). You cannot encode all states it has in a hash as it would require more states than you want to encode as you already mentioned (pigeon hole). But representing macroscopic data like text (or basically anything bigger than atomic scale) uniquely can be done with 128+ bits. Double that and you are likely safe for collisions, assuming the method you use is uniform and not biased to some input.
If you want ease collision examples you can take a look at people using CRC32 as hashes/digests. It is notoriously prone to collisions (since only 32 bits).
That won't work. A lot of people have tried to create systems that they claim always compress movies or files or something else. Yet, none of those systems ever come to market. They get backers to give them cash, then they disappear. The reason they don't come to market is that they don't exist. Look up the pigeon-hole principle. It's the very first principle of data compression.
You can't compress a file by repeatedly storing a series of hashes, then hashes of those hashes, down into smaller and smaller representations. The reason that you cannot do this is that you cannot create a lossless file smaller than the original entropy. If you could happen to do so, however, you would get down to ever smaller files, until you had one byte left. But, you could never decompress such a file, because there is no single correct interpretation of such a decompression. In other words, your decompression is not the original file.
Without getting too technical because I hate typing on a phone, you're technically right in the sense of a theoretical hash.
But in real life there's collisions.
And in real life image or sound compression, blurs, artifacts and resolutions, it is fundamentally destroying information in practice. It is no longer the comparatively difficult but theoretically possible task of reversing a perfect hash, but more like mapping a name to the characters/bucket RXXHXXXX where x could be anything.
There are lots of values we can replace X with which are plausible, but without an outside source of information, we can't know what the real values in the original name was.
Out of sheer curiosity I had a go at manually enhancing the Roundhay Garden Scene by dramatically enlarging the frames, stacking them, aligning them, erasing the most blurred ones and the obvious artifacts.
The funniest part was that the resolution really goes up if you make 1 px into 40 and align the frames accurately (then adjust opacity to the level of blur)
The crime television thing would be possible if you have enough frames of the gangster.
Approaches like these are hallucinating the high resolution images though--not something that we'd ever want being used for police work. That said, I wonder if it would perform better than eyewitness testimony...
To play devil's advocate though, modern neuroscience and neuropsychology basically tells us that that our brains reconstruct and recreate our memories every time we try to remember them. Our memories are highly malleable and prone to false implantation... and yet witness testimony is still the gold standard in courts.
I wouldn't want to see it used as evidence in court (and I doubt it would be allowed anyway but IANAL) but I could see this being a useful in certain circumstances for generating the photo-realistic equivalent of a police sketch e.g. if you had low-res security footage of a suspect and an eyewitness to guide the output.
It would be useful to reduce the number of suspects... calculate possible combinations, match them against the mugshots database and investigate/interrogate those people. Or if you're the NSA/KGB, you can match against the social media pictures database, and then ask the social media company to tell you where these users were at the time of the crime (since the social media app on the phone track their users' location...)
You could e.g. ostensibly produce valid license plates, which could be further reduced by matching the car color and model, to produce a small set of calid records.
Sure, but if we go by how the police works now, they will take a plate produced by the computer as 100% given and arrest/shoot the owner of that plate because "computer said so".
This image from the article shows that the original image and the fantasy image are not alike at all. The faces look to have different ages. The computer even fantasized a beauty mark.
> This image from the article shows that the original image and the fantasy image are not alike at all.
This is another avenue that could be further explored, which I quite like. That is, a non-artist can doodle images and create a completely new photo-realistic image based on the line drawings.
I was modifying a few images (from link on another comment here: https://affinelayer.com/pixsrv/ ) and the end results were interesting.
The low resolution to high resolution image synthesis reminds me of the unblur tool that Adobe demoed during Adobe MAX in 2011. Here is the relevant clip if you're interested https://www.youtube.com/watch?v=xxjiQoTp864
That demo was quite impressive, but the technique is completely different. Adobe uses deconvolution to recover information and details that are actually in the picture, but not visible (unintuitively blurring is a mathematically reversible transformation. If you know the characteristics of the blur, then you can reverse it. In fact most of Adobe demo's magic comes from knowing the blur kernel and path in advance, not sure how it works in practice for real photos). But the Neural net demoed in this post just "makes up" the missing info using examples from photos it learned from, there is no information recovery.
You'll get something that looks plausible for sure, maybe not what was originally there though. In the future, someone will be falsely convicted of a crime because a DNN enhance decided to put their picture in some fuzzy context.
You don't specify, but presumably you mean a true confession.
It could also be used to generate a false confession. If the prosecutor says "We have proof you were there at the scene" and shows you some generated image, then you as an innocent person have to weigh the chances of the jury being fooled by the image (and even if it's not admissable in court, it may be enough to convince the investiging team that you are responsible and stop looking for the real perpetrator) and the expected sentences if you maintain your innocence vs "admitting" your guilt.
Yup. In a court of law, the value as evidence is going to be weighted fairly low, even with expert testimony. It may be enough to get a warrant, or a piece in the process of deduction during the investigation phase.
To paraphrase Google Brain's Vincent Vanhoucke, this appears to be another example where using context prediction from neighboring values outperforms an autoencoder approach.
If 2017 was the year of GANs, 2018 will be the year context prediction.
I hope some day this will generalize to video. I don't care about the exact shape of background trees in an action movie - with this approach, they could be compressed to just a few bytes, regardless of resolution.
Except that it can put trees somewhere where there were no trees but something similar to them. Or it can put face of a more popular actor instead of an actual less popular one because it was more often present in the training dataset. No, thanks.
No, today's compression is about compressing what's already in the one movie. But imagine that you run your training set over 100's or 1000's of films, and extract just enough to represent say different types of trees in a few bytes. You could 'compress' a film by replacing data with markers that essentially describe some properties of the tree, and those properties + the training set are then used during 'decompression' to recreate (an approximation of) the tree.
This would of course not give you any space savings when you want to distribute 1 movie. There would be some minimum number of movies where the training set + actual movies would be smaller than the sum of the sizes of the individual movies compressed.
I'm not saying this would be a net space saver, or necessarily a good technique at all, but the concept is intriguing.
> I don't understand how the edges-to-faces can possibly work. The inputs seem to be black & white, and yet the output pictures have light skin tones.
The step you're missing is that an edge detector is run on the entire database of training images to produce a database of edge images. The input edge image is run against that corpus of edge images in order to find which edge images match, then sample the corresponding original color images and synthesize a new color image.
Thanks for that link, I'd never seen that before. In fact, the edges2shoes sample on that page exactly summarises the issue I have: You start with what effectively appears to be a rough line drawing sketch of a shoe, and the algorithm 'fills in' a realistic shoe to fit the sketch. The sketch never had any colour information and so the algorithm has to pick one for it. In their example output, the algorithm has picked a black shoe, but it could just as realistically chosen a red one. The colouring all comes from their training data (in their case, 50k shoe images from Zappos). So in short, the algorithm can't determine colour.
But shoes and cats are one thing; reconstructing people's faces is another. I know the paper & the authors are demonstrating a technology here, rather than directly saying "you can use this technology for purpose X", but the discussion in these comments has jumped straight into enhancing images and improving existing pictures/video. But there is a very big line between 'reconstituting' or 'reconstructing' an image and 'synthesising' or 'creating' an image, and it appears many people are blurring the two together. Again, in the authors' defence, they are clear that they talk about the 'synthesis' of images, but the difference is critical.
> So in short, the algorithm can't determine colour.
That's right. But with the caveat that a large training set can determine plausible colors and rule out implausible ones. This is more true for faces than for shoes! The point is that there is some correlation between shape and color in real life. The color comes from the context in the training set. This is what @cbr meant nearby re: "skin color is relatively predictable from facial features (ex: nose width), it should be able to do reasonably well."
> there is a very big line between 'reconstituting' or 'reconstructing' an image and 'synthesising' or 'creating' an image, and it appears many people are blurring the two together.
I had the same thought. Maybe it's not that there were only white people in the dataset, but it's actually taking the shape of the face into account, and it most closely matches those with white skin tones. I suggest this by looking at the cat one: it has the stripes coming off the eyes, so suggests one of the grey striped breeds rather than, e.g. all black or calico. It's probably more than pixel-by-pixel NN interpolation, but also taking into account some of the actual structure of the edges.
Color comes from the initial neural network step. Since skin color is relatively predictable from facial features (ex: nose width), it should be able to do reasonably well.
Really? With what accuracy? This is the kind of assumption that will get research groups into very deep water...
Just imagine the kind of CCTV usage being discussed elsewhere in this thread. But the neural network happens to have a wrong bias towards skin colour...
You're absolutely right to be concerned about this stuff, but be aware that it is generally acknowledged as a problem and that the "ethics of machine learning" is quite an interesting and active research topic.
Using image synthesis at all can't be used for up-rezing CCTV imagery, the output is a fabrication and the researchers have all said so. People imagining bad use cases shouldn't be relied on. ;) If an investigator used this to track down criminals, they are the ones getting into deep water and making assumptions.
I have a large collection of images, many being accessible through google image search.
I wonder if there could be a way to "index" those images so I can find them back without storing the whole image, using some type of clever image histogram or hashing-kind function.
I wonder if that thing already exist, since there are many images, and since most images have a lot of difference in their data, could it be possible to create some kind of function that describe an image in a way that entering such histogram redirects to (or the closest) the image it indexed? I guess I'm lacking the math, but it sounds like some "averaging" hashing function.
This is the current approach for large sale image retrieval. By using some model to extract features and then performing distance calculations. This is usually done with hashing once speed and the size of the dataset become large.
Is anyone in the FX business playing with this stuff? I'm thinking generational backdrops with groups of people/stuff/animals in them without a lot of modelling input.
This is actually training a neural network on the Markov model, so it's very similar to core ideas behind the OP's paper. The core idea is to model the probability of a bit of sound by breaking it into the last note and everything that comes before the last note ("P(audio)=P(audio∣note)P(note)"). If you sample a bunch of audio and factor it that way for any given point in time, and accumulate that data somewhere, you can then sample the accumulated data randomly to generate new music.
There are other audio NN synthesis methods as well, pretty sure I've even seen one posted to ShowHN before.
There kind of already is audio equivalent: MIDI. It supplies low resolution timing and pitch information and it's up to synthesizer to produce audio output matching those data.
I think the interesting part would be example based audio synthesis. Could you replace a synthesizer with a neural network which, when fed examples, would allow you to generate sounds / explore some latent space between the examples.
It more or less attempts to be what you describe. Not very polished yet, but I had some basic success in modeling the parameter space of a synth, and adding new latent spaces with regularization.
This is amazing. I especially like how the result can somewhat be interpreted by showing from what image the part of the generated image is copied (see Figure 5).
I spent too long trying to get RAISR to work when that paper came out. You can try it out from some Github repos but no one has been able to recreate the results Google presented. I would be hard pressed to say my hires photos looked any better than the originals when scaled up on my iPhone screen.
I do wish they would release the code AND any related training images they used to get those results.
All those examples are fairly low-resolution. Does this approach scale or can it be applied in some tiled fashion? Or would the artifacts get worse for larger images?
Just need to look at the picture of Fred Armisen to see that this technique can generate a picture of a plausibly real human who bears no/very little resemblance to the original image.
We could also just pick a random person off the street and punish them - it would be similarly accurate and fair (actually probably fairer - if this is trained on pictures with a certain bias it will return pictures with that bias).
This paper does not demonstrate an enhancement technique but a phenomena which those using inverse methods called "overfitting".
Hopefully never, but I'm sure someone will see this and try!
(Because these kind of techniques aren't really enhancing the images in a way that gives you new and useful information: they are taking the low-res images as input, and giving you a plausible high-res image as output, based on it's training data. It is NOT however trying to say "this is the ACTUAL high res image that generated this low-res image"
I found the title somewhat misleading. I was expecting some clever application of the nearest-neighbor interpolation. But this seems to involve neural nets and appears far from "simple" to me (I'm not in the image processing field though).
> I was expecting some clever application of the nearest-neighbor interpolation. But this seems to involve neural nets and appears far from "simple" to me (I'm not in the image processing field though).
It's not that far off actually, but they are talking about nearest neighbor Markov chains, not interpolation. You probably already know nearest neighbor Markov chains because there are lots of text examples, and a ton of Twitter bots that are generating random text this way. The famous historical example was the usenet post that said "I spent an interesting evening recently with a grain of salt." https://en.m.wikipedia.org/wiki/Mark_V._Shaney
This paper does use a NN to synthesize an image, which is conceptually pretty simple, even if difficult to implement well. After that they use a nearest neighbor Markov chain to fill in high frequencies. The first paper referenced is also the simplest example: http://graphics.cs.cmu.edu/people/efros/research/EfrosLeung....
That paper fills missing parts of an image using a single example, by using a Markov chain built on the nearest neighboring pixels. That paper is also one of the only image synthesis papers (or perhaps the only paper) that can synthesize readable text from an image of text. That's really cool because the inspiration was text-based Markov chains.
I don't think this method has anything to do with Markov Chains. The spatial structure isn't explicitly used at all, and the interpolation/regression is quite a vanilla nearest neighbor with some performance tricks.
Well, of course almost anything can be interpreted as a Markov process, but I don't think it's a very useful abstraction here.
> I don't think this method has anything to do with Markov Chains.
Oh, it absolutely does. I think it's fair to say that Efros launched the field of nearest neighbor texture synthesis, and his abstract states: "The texture synthesis process grows a new image outward from an initial seed, one pixel at a time. A Markov random field model is assumed, and the conditional distribution of a pixel given all its neighbors synthesized so far is estimated by querying the sample image and finding all similar neighborhoods.
This is the same Markov model that all subsequent texture synthesis papers are implicitly using, including the paper at the top of this thread. Efros' paper implemented directly is really slow, so a huge number of subsequent papers use the same conceptual framework, and are only adding methods for making the method performant and practical. (Sometimes, at the cost of some quality -- many cannot synthesize text, for example.)
Note the inspiration for text synthesis, Shannon's paper, also describes the "Markoff Process" explicitly. http://math.harvard.edu/~ctm/home/text/others/shannon/entrop... (Efros referenced Shannon, and noted on his web page: "Special thanks goes to Prof. Joe Zachary who taught my undergrad data structures course and had us implement Shannon's text synthesis program which was the inspiration for this project.")
> Well, of course almost anything can be interpreted as a Markov process, I don't think it's a very useful abstraction here.
It's not an abstraction to build a conditional probability table and then sample from it repeatedly to synthesize a new output. That's what a Markov process is, and that's what the paper posted here is doing. I don't really understand why you feel it's distant and abstract, but if you want to elaborate, I am willing to listen!
Unless I horribly misread the paper, this is not based on the Efros' quilting method, which indeed uses Markov fields. The method linked here seems to interpolate every pixel independently from its surroundings (neighbor means a close-by pixel in the training set in the feature space, not a spatially close pixel).
And I didn't mean that Markov processes are abstract in any "distant" sense, but that they are an abstraction, ie a "perspective" from which to approach and formulate the problem.
I was referring to Efros' "non-parametric sampling" paper, not the quilting one. Efros defined "non-parametric sampling" as another name for "Markov chain" -- almost (see my edit below). This paper (PixelNN) refers directly to "non-parametric sampling" in the same sense as Efros, and it states that they are using "nearest neighbor" to mean "non-parametric sampling". This is talking rather explicitly about a Markov chain -like process.
"To address these limitations, we appeal to a classic learning architecture that can naturally allow for multiple outputs and user-control: non-parametric models, or nearest-neighbors (NN). Though quite a classic approach [11, 15, 20, 24], it has largely been abandoned in recent history with the advent of deep architectures. Intuitively, NN works by requiring a large training set of pairs of (incomplete inputs, high-quality outputs), and works by simply matching the an incomplete query to the training set and returning the corresponding output. This trivially generalizes to multiple outputs through K-NN and allows for intuitive user control through on-the-fly modification of the training set..."
Note the first reference #11 is Efros' non-parametric sampling, and that the authors state this is the "classic approach" that they apply here.
What you call "interpolate every pixel independently from its surroundings" could be another way to describe a Markov chain, because 1: it is sampled according to the conditional probability distribution (which is what you get by using the K nearest matches.) and 2: the process is repeated - one pixel (or patch) is added using the best match, then it becomes part of the neighborhood in the search for the pixel/patch next door. The name for that is "Markov process", or in the discrete case, "Markov chain", if you take an unbiased random sample from the conditional distribution. If you always choose the best sample, then it's the same as a Markov chain, but biased.
> (neighbor means a close-by pixel in the training set in the feature space, not a spatially close pixel)
That's right, and that's why it's misleading to talk about nearest neighbor interpolation, because that phrase is a graphics phrase that means interpolate from spatially close pixels. Hardly anyone else calls it interpolation, they call it sampling, point sampling, and other terms.
*EDIT:
I'm going to relax a little bit on this. "Non-parametric sampling" is a tiny bit different from a Markov process in that a Markov process attempts to simulate a distribution in an unbiased way. By using the best match instead of a random sample from the conditional distribution, the output may produce a biased version of the original distribution. This is why it's called non-parametric sampling instead of calling it a Markov chain, but the distinction is pretty small and subtle -- texture synthesis using non parametric sampling is extremely similar to a Markov chain, but not necessarily exactly the same.
Side note, it's really unfortunate they used the abbreviation "NN" to talk about "nearest neighbor" in a paper that also builds on "neural networks".
AFAIU it actually seems to be sort of "just" a clever application of the nearest-neighbor interpolation. The CNN is used to come up with the feature space for the pixels (weights of the CNN), and then each pixel is "copy-pasted" from the training set based on the nearest match.
It seems that this could be used in theory with any feature descriptors, such as local color histograms, although the results wouldn't probably be as good.
Edit: Being a nearest neighbor probably also carries the usual computational complexity problems of the method. If I understand it correctly, they ease this by actually first finding just subset of best matching full images using the CNN features and then do a local nearest neighbor search just in those images.
> The CNN is used to come up with the feature space for the pixels (weights of the CNN), and then each pixel is "copy-pasted" from the training set based on the nearest match.
FWIW, what you just described is known as a "Markov process". It is sampling a known conditional probability distribution.
While some interpolation of the data happens because the output represents a mixture of the training images, this is not "interpolation" at the pixel level, it's picking best matches from a search space of image fragments. (And the pixel neighbors are usually synthesized - the best match depends on previous best matches!) This is distinctly different from the kind of nearest neighbor interpolation you'd do when resizing an image.
Note the phrase "nearest neighbor" in this paper has an overloaded double meaning. It is referring both to pixel neighbors and neighbors in the search space of images. The pixel neighbors provide spatial locality within a single image; this is how & why high frequencies are generated from the training set. Nearest neighbor is also referring to the neighborhood matches in the search space, the K nearest neighbors of a given pixel neighborhood are used to generate the next K pixel outputs in the synthesis phase.
Agree. This appears to be more a clever implementation of an algorithm generating "artistic" impressions. In some cases, creating artifacts which simply were not part of the original picture.
Take a low resolution input image, and hallucinate a higher resolution version by statistically assembling bits from similar images in a large data set of training images.
If anyone ever tries to use this in court I hope they call it "Face Hallucination" and not "Image Reconstruction". On the research side, I wonder what the point of this is. I find it interesting but of little practical value.
It's a way to refine their models. A systematic model-based representation of data is basically also a generator of that data.
Why is that? Blame Kolmogorov. There are deep connections between compression, serialization, and computation. An optimal compression scheme is a serialization and the Turing-complete program to decode it. For example: you can compress pi into a few lines of algorithm plus a starting constant like 4.
It almost looks like they mixed training and testing data in some of the examples. The bottom-left sample in the normals-to-faces is extremely suspicions.
I was looking at this as well, but I'm willing to suspend my disbelief because the normal vaguely looks like it has a good deal of information (in a basic fidelity sense).
Now it seems the possibility of that becoming realistic are increasing with a steady clip, based on this paper and other enhancement techniques I've seen posted here.