I found the title somewhat misleading. I was expecting some clever application of the nearest-neighbor interpolation. But this seems to involve neural nets and appears far from "simple" to me (I'm not in the image processing field though).
> I was expecting some clever application of the nearest-neighbor interpolation. But this seems to involve neural nets and appears far from "simple" to me (I'm not in the image processing field though).
It's not that far off actually, but they are talking about nearest neighbor Markov chains, not interpolation. You probably already know nearest neighbor Markov chains because there are lots of text examples, and a ton of Twitter bots that are generating random text this way. The famous historical example was the usenet post that said "I spent an interesting evening recently with a grain of salt." https://en.m.wikipedia.org/wiki/Mark_V._Shaney
This paper does use a NN to synthesize an image, which is conceptually pretty simple, even if difficult to implement well. After that they use a nearest neighbor Markov chain to fill in high frequencies. The first paper referenced is also the simplest example: http://graphics.cs.cmu.edu/people/efros/research/EfrosLeung....
That paper fills missing parts of an image using a single example, by using a Markov chain built on the nearest neighboring pixels. That paper is also one of the only image synthesis papers (or perhaps the only paper) that can synthesize readable text from an image of text. That's really cool because the inspiration was text-based Markov chains.
I don't think this method has anything to do with Markov Chains. The spatial structure isn't explicitly used at all, and the interpolation/regression is quite a vanilla nearest neighbor with some performance tricks.
Well, of course almost anything can be interpreted as a Markov process, but I don't think it's a very useful abstraction here.
> I don't think this method has anything to do with Markov Chains.
Oh, it absolutely does. I think it's fair to say that Efros launched the field of nearest neighbor texture synthesis, and his abstract states: "The texture synthesis process grows a new image outward from an initial seed, one pixel at a time. A Markov random field model is assumed, and the conditional distribution of a pixel given all its neighbors synthesized so far is estimated by querying the sample image and finding all similar neighborhoods.
This is the same Markov model that all subsequent texture synthesis papers are implicitly using, including the paper at the top of this thread. Efros' paper implemented directly is really slow, so a huge number of subsequent papers use the same conceptual framework, and are only adding methods for making the method performant and practical. (Sometimes, at the cost of some quality -- many cannot synthesize text, for example.)
Note the inspiration for text synthesis, Shannon's paper, also describes the "Markoff Process" explicitly. http://math.harvard.edu/~ctm/home/text/others/shannon/entrop... (Efros referenced Shannon, and noted on his web page: "Special thanks goes to Prof. Joe Zachary who taught my undergrad data structures course and had us implement Shannon's text synthesis program which was the inspiration for this project.")
> Well, of course almost anything can be interpreted as a Markov process, I don't think it's a very useful abstraction here.
It's not an abstraction to build a conditional probability table and then sample from it repeatedly to synthesize a new output. That's what a Markov process is, and that's what the paper posted here is doing. I don't really understand why you feel it's distant and abstract, but if you want to elaborate, I am willing to listen!
Unless I horribly misread the paper, this is not based on the Efros' quilting method, which indeed uses Markov fields. The method linked here seems to interpolate every pixel independently from its surroundings (neighbor means a close-by pixel in the training set in the feature space, not a spatially close pixel).
And I didn't mean that Markov processes are abstract in any "distant" sense, but that they are an abstraction, ie a "perspective" from which to approach and formulate the problem.
I was referring to Efros' "non-parametric sampling" paper, not the quilting one. Efros defined "non-parametric sampling" as another name for "Markov chain" -- almost (see my edit below). This paper (PixelNN) refers directly to "non-parametric sampling" in the same sense as Efros, and it states that they are using "nearest neighbor" to mean "non-parametric sampling". This is talking rather explicitly about a Markov chain -like process.
"To address these limitations, we appeal to a classic learning architecture that can naturally allow for multiple outputs and user-control: non-parametric models, or nearest-neighbors (NN). Though quite a classic approach [11, 15, 20, 24], it has largely been abandoned in recent history with the advent of deep architectures. Intuitively, NN works by requiring a large training set of pairs of (incomplete inputs, high-quality outputs), and works by simply matching the an incomplete query to the training set and returning the corresponding output. This trivially generalizes to multiple outputs through K-NN and allows for intuitive user control through on-the-fly modification of the training set..."
Note the first reference #11 is Efros' non-parametric sampling, and that the authors state this is the "classic approach" that they apply here.
What you call "interpolate every pixel independently from its surroundings" could be another way to describe a Markov chain, because 1: it is sampled according to the conditional probability distribution (which is what you get by using the K nearest matches.) and 2: the process is repeated - one pixel (or patch) is added using the best match, then it becomes part of the neighborhood in the search for the pixel/patch next door. The name for that is "Markov process", or in the discrete case, "Markov chain", if you take an unbiased random sample from the conditional distribution. If you always choose the best sample, then it's the same as a Markov chain, but biased.
> (neighbor means a close-by pixel in the training set in the feature space, not a spatially close pixel)
That's right, and that's why it's misleading to talk about nearest neighbor interpolation, because that phrase is a graphics phrase that means interpolate from spatially close pixels. Hardly anyone else calls it interpolation, they call it sampling, point sampling, and other terms.
*EDIT:
I'm going to relax a little bit on this. "Non-parametric sampling" is a tiny bit different from a Markov process in that a Markov process attempts to simulate a distribution in an unbiased way. By using the best match instead of a random sample from the conditional distribution, the output may produce a biased version of the original distribution. This is why it's called non-parametric sampling instead of calling it a Markov chain, but the distinction is pretty small and subtle -- texture synthesis using non parametric sampling is extremely similar to a Markov chain, but not necessarily exactly the same.
Side note, it's really unfortunate they used the abbreviation "NN" to talk about "nearest neighbor" in a paper that also builds on "neural networks".
AFAIU it actually seems to be sort of "just" a clever application of the nearest-neighbor interpolation. The CNN is used to come up with the feature space for the pixels (weights of the CNN), and then each pixel is "copy-pasted" from the training set based on the nearest match.
It seems that this could be used in theory with any feature descriptors, such as local color histograms, although the results wouldn't probably be as good.
Edit: Being a nearest neighbor probably also carries the usual computational complexity problems of the method. If I understand it correctly, they ease this by actually first finding just subset of best matching full images using the CNN features and then do a local nearest neighbor search just in those images.
> The CNN is used to come up with the feature space for the pixels (weights of the CNN), and then each pixel is "copy-pasted" from the training set based on the nearest match.
FWIW, what you just described is known as a "Markov process". It is sampling a known conditional probability distribution.
While some interpolation of the data happens because the output represents a mixture of the training images, this is not "interpolation" at the pixel level, it's picking best matches from a search space of image fragments. (And the pixel neighbors are usually synthesized - the best match depends on previous best matches!) This is distinctly different from the kind of nearest neighbor interpolation you'd do when resizing an image.
Note the phrase "nearest neighbor" in this paper has an overloaded double meaning. It is referring both to pixel neighbors and neighbors in the search space of images. The pixel neighbors provide spatial locality within a single image; this is how & why high frequencies are generated from the training set. Nearest neighbor is also referring to the neighborhood matches in the search space, the K nearest neighbors of a given pixel neighborhood are used to generate the next K pixel outputs in the synthesis phase.
Agree. This appears to be more a clever implementation of an algorithm generating "artistic" impressions. In some cases, creating artifacts which simply were not part of the original picture.
Take a low resolution input image, and hallucinate a higher resolution version by statistically assembling bits from similar images in a large data set of training images.
If anyone ever tries to use this in court I hope they call it "Face Hallucination" and not "Image Reconstruction". On the research side, I wonder what the point of this is. I find it interesting but of little practical value.
It's a way to refine their models. A systematic model-based representation of data is basically also a generator of that data.
Why is that? Blame Kolmogorov. There are deep connections between compression, serialization, and computation. An optimal compression scheme is a serialization and the Turing-complete program to decode it. For example: you can compress pi into a few lines of algorithm plus a starting constant like 4.