Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Image-to-Image Translation with Conditional Adversarial Nets (phillipi.github.io)
317 points by cruisestacy on Nov 22, 2016 | hide | past | favorite | 56 comments


The "sketches to handbags" example, which is buried toward the bottom, is really cool. It's basically an extension of the "edges to handbags," but with hand-drawn sketches.

Even though the sketches are fairly crude, with no shading and a low level of detail, many of the generated images look like they could, in fact, be real handbags. They still have the mark of a generated image (e.g. weird mottling) but they're totally recognizable as the thing they're meant to be.

The "sketches to shoes" example, on the other hand, reveals some of the limitations. Most of the sketches use poor perspective, so they wouldn't match up well with edges detected from an actual image of a shoe. Our brains can "get the gist" of the sketches and perform some perspective translation, but the algorithm doesn't appear to perform any translation of the input (e.g. "here's a sketch that appears to represent a shoe, here's what a shoe is actually shaped like, let's fit to that shape before going any further"), so you end up with images where a shoe-like texture is applied to something that doesn't look convincingly like a real shoe.


This is be a popular shopping website. Sketch your perfect handbag. See an image of the product. Click to buy.


"Sketch your perfect handbag" may be a bit much to ask of most people.


It is, but this sort of approach is easily applied to transformations (in fact, an earlier GAN HN submission was "Generative Visual Manipulation on the Natural Image Manifold" https://people.eecs.berkeley.edu/~junyanz/projects/gvm/ , Zhu et al 2016b ). So you can start with a lousy sketch and transform it until it works, or you can start with a pre-populated set of handbags, recognize the one closest to what you want, and tweak/sketch that.


Draw your perfect handbag to share with friends. you only need 10 buyers to have it created.


I doubt friends want the same handbag through! It's an anti-viral feature: Personalization.


There was a paper at CVPR 2016 called "Sketch Me That Shoe," which basically converted hand sketches to images using tied embedding networks. https://www.eecs.qmul.ac.uk/~qian/Project_cvpr16.html


Truly impressive overall. Unfortunately, it looks like training set was way too small. Look for example at reconstruction of #13 here:

https://phillipi.github.io/pix2pix/images/index_facades2_los...

Notice white triangles (image crop artifacts) present on the original image, yet completely absent on the net input image. They make re-appearance on the output of 3 (4 even?) out of 5 nets despite the lack of corresponding cue in the input image. Looks like network cheated a bit here, i.e. took advantage of small set size and memorized the input image as a whole. Then recognized and recalled this very image (already seen during training) rather than actually reconstructing it purely from the input.

Same (but less prominent) for other images where "ground truth" image was cropped.


Just want to throw out that none of these applications are new. What is novel about their approach is that, instead of learning a mapping function using a hand-picked function to quantify accuracy for each problem, they also have a mechanism for choosing the function that quantifies accuracy. Haven't grokked the paper to see how they do it, but that is pretty neat IMO.


Interesting.

What I like about the "Day to Night" example is that is clearly demonstrates that these sort of networks lack common sense. It expects light to be where they are clearly (to humans with common sense at least) no things that can produce light. E.g. in the middle of a roof or in a tree. Of course, there can be, but it's fairly uncommon.

And the opposite as well, no lights where a human would totally expect a light, eg. in the front of buildings or on the top of, well, lighting poles.


I'd guess the problem is that the daytime pictures allow for easy feature detection (tree, building etc) but the nightime pictures are washed out- We humans look at the daytime picture first, then say "that nighttime picture must have a tree there" which involves feature detection across both pictures (in the training phase)

I suspect a neural network better specialized for this task (i.e. that has the data interlaced for both day and nighttime during training) would have no problem feature detecting trees and leaving them unlit.


This is awesome!

Makes me wonder how this can apply to image and video compression. You could send over the semantic segmentation version of an image or video, and system on the other end would use these technique to reconstruct the original.


You can perform extremely good compression this way, but the computational and energy cost would be prohibitive.

There are even more traditional tricks that don't make it in things like H.265 because it is too costly.


Here is my work, where I do use semantic information to achieve compression (rather improve JPEG). This is not an end to end compression like Google's work, but just incorporating semantic knowledge into compression. I am still trying to clean up the code before I make arxiv/github submission, but since you are interested here is the link http://gpgpu.cs-i.brandeis.edu/semantic_jpeg.pdf



I understood that to be the tech behind "Silicon Valley".


Does anyone else have the feeling that with the current trajectory, something exactly like this, but with perhaps a million times the amount of feedback and data, thought will just emerge? Yes, this is all 2D and abstract/selective training sets etc, but what if AI is the ultimate fake-it-until-you-make-it?


Networks with selective attention already exist, but what if they can learn about themselves? Right now they cannot create any notion of self or "body" (defined as the boundary between the environment that can be predicted vs that which cannot), because their outputs have no causal effect on their inputs. There are no differences within the network that make a difference to itself, there is no intrinsic perspective.

Could this change, if, for example

- inputs are augmented with the network state (or derived version thereof)

- previous outputs of the network / external memory are fed back?

This seems to be the kind of self reference self awareness requires.

Also, do asynchronous networks have fundamental advantages over synchronous networks? What about static vs dynamic networks?


I don't see this happening. What I do see happening, is it figuring us out. Somewhere out there, there's a function which explains how exactly our society is completely organized in every way.

From that, the AI could generate books, movies, and do a lot of things.


Reminds me of the novel-rewriting-apparatus from 1984, except with more friggin' superheroes and remakes.


Hoo boy, buddy, do I got some news for you: https://www.marxists.org/archive/marx/works/1867-c1/


I don't think it's going to emerge without significant effort to make it happen. I think most of the 'intelligence' we desire will be attainable without sentience. Sentience itself will require a lot of specific research directed at the goal. It's certainly a risk though.


Didn't it just emerge with humans? I don't see why it couldn't happen again. There may be a specific structure or wiring that facilitates thought, but i suspect any large enough net with enough training data can do it.


If you can define thought and it can be implemented, sure.


Biological nature didn't define thought before implementing it.


The Aerial-to-Map example looks like this may be useful for automatic map/satellite rectification/georeferencing, but not sure how efficient it'd be if it has to compare against a large area.

Does anyone have any experience in this area?


I feel this can potentially revolutionize creative processes, for example in the clothing industry. You just draw up a purse or a shoe, let the machines generate dozens of variants (with pictures), and then you only have to filter and rank them.

You can pipe these product sketches directly into focus groups who tell you which product is most likely to sell. You don't need massive staff to come up with product variants any more.



I wonder if you were to average the design of the bicycles whether it would actually produce something that works?


I would have thought that, if you are smart enough to find a "bicycle vector space" in which averaging sketches of a bicycle produces another valid sketch of a bicycle, then you probably already know enough about bicycles to design one without the input of imperfect sketches.


It has the potential to redefine what we think of as 'creativity', as happened with what we consider intelligence and what we think of as "AI Hard" problems.

Perhaps what these networks are generating can be labeled better as "Guided/constrained imitation" rather than real creativity.


> real creativity

What is real creativity? Creativity is just random noise converted into patterns. Is the computer variety of creativity not real enough?


Whenever this topic comes up someone with a very engineering-centric attitude jumps on the opportunity to downplay the complexity of the concept of "creativity". Often using the word "just".. I would argue that something like creativity is far too difficult to even define, let alone model and synthesize, to ever be adorned with the modifier "just".

While this is a highly philosophical topic, I can say, on the difficulty of defining creativity from a scientific point of view, is that more and more what I think is considered "true creativity" is fundamentally a social concept, and as such, even though computers are good at reproducing patters and variations on patterns, they will never be "truly creative" until they become independent members of society, whatever that means. So as cool as machine learning is getting these days, this would require a leap in AI that is still rather far away (imho).

That's why I think it's a little optimistic to label something as complex as creativity as being "just" anything. Creativity is social, insofar as it is defined and recognized by human society, and machines are.. not. Yet.


> Creativity is just random noise converted into patterns.

This is not a consensus definition. Creativity doesn't actually seem to be very random at all according to the people who study it.


Creativity isn't magic.

Humans are not magically creative as much as they'd like to be.

If i ask you to think of a random number, you don't just pull it out of thin air, It can be based on tens to hundres of things: -Should i do a relaly low or high number? -People always use round numbers that end in 0 or 5, maybe i shouldn't do that, or should i to make it seem truer -what other large "random" numbers have a heard? -i remember seeing a number recently, maybe try a modification of that -you used {x} as a random number last time, go similar to that?

All this adds up in that under a second thought you have when i asked you to think of a random number. the literal same thing goes into all creative works, the output is a function of the input.


Yes, but this function you defined, that applies tens of judgements to select a "random" number, is based on a random number input itself. The random part is just the seed, it then passes through various neural nets that expand on it and turn it into a plausible answer.

Randomness is injected into all brain processes on account that biological neurons are stochastic. So there is an amount of randomness mixed into everything the brain does.

Some neural nets can map real images into a Gaussian, and back. That means they disentangle the factors of the image into a mix of independent factors that map into the standard deviation. Any set of random numbers could be converted back into an image, by the reverse process.


> Yes, but this function you defined, that applies tens of judgements to select a "random" number, is based on a random number input itself. The random part is just the seed, it then passes through various neural nets that expand on it and turn it into a plausible answer.

On what basis do you make this claim? Humans are empirically terrible random number generators. If you ask someone to pick a random number, the result is very not random. Our biases are large and obvious, so it seems faulty to claim that our "seed" number is in any way truly random.

> Randomness is injected into all brain processes on account that biological neurons are stochastic. So there is an amount of randomness mixed into everything the brain does.

There's also some amount of randomness in what happens if you drop a rock but the net result is largely the same: it falls down. The fact that there is some randomness to a process does not mean that the randomness is actually driving the process.

> Some neural nets can map real images into a Gaussian, and back. That means they disentangle the factors of the image into a mix of independent factors that map into the standard deviation. Any set of random numbers could be converted back into an image, by the reverse process.

I don't see how this is relevant.


Do you think humans are good at anything or just generally useless ?


That's a really bizarre question. I think humans are good at a lot of things. I also think this is utterly irrelevant.


What's your point actually? Honest question.

You're acting like the brain is some kind of simple algorithm, more goes into a painting or a composition than just a bit of simple logic. A composer is not sitting there at 2am on her Piano going "Hm, I like round numbers, so I might make this note a F because it's the fourth note in the C major scale".

I might be wrong but from my understanding, we don't even really understand how neural nets are able to make certain decisions or generate certain pictures yet, correct?


> You're acting like the brain is some kind of binary computer

It's not binary but it is a computer. The alternative is to believe in magic.

> we don't even really understand how a computer is able to make certain decisions or generate certain pictures yet, correct?

I don't think that's correct, no. We understand how the process works. We may not understand the weights a specific neural net ends up with, but that's an issue with just having so much data to deal with. Similarly we don't "understand" how a web page ends up with a specific PageRank. We understand the process, but we can't manually reproduce the result because it's just too much data.


While I do not accept ragebol's notion of real creativity, I very much doubt your definition is correct. The use of 'just' is too strong, excluding the possibility of influence from other inputs. When humans learn, they do so by learning a large set of arbitrary relationships which further trigger complex associations (this is blue, it is used for X, it is like Y, which is like Z). These relationships are further broken down in terms of human meaningful components that are updateable with new knowledge with almost surgical precision.

It is the interaction between the relations and components that drives the kind of analogical reasoning humans exhibit: traversing networks of relations and thinking up new ways of combining various components.

Without a large base of knowledge to draw from, a creator simply does not have enough data to do anything more than copying. They would not have learned the rich relations nor a diverse enough set of components and operations on them. They would also be unable to know which combinations are truly novel nor have enough experience to predict what might be meaningful or impactful. This is why good creators must continually sample a large set of works.

The limit to this is that humans will generally struggle to wander far from experience. Computers on the other hand, can cover more of a space to make connections humans would find difficult. And the less random the decisions, the more interesting the founds structures are likely to be.

Where computers struggle and humans shine is the selection process. Ashby defines selection as: "problem solving" is largely, perhaps entirely, a matter of appropriate selection. Take, for instance, any popular book of problems and puzzles. Almost every one can be reduced to the form: out of a certain set, indicate one element".

In this work for example, the excellent results are a better grasp of structure induced by having the GAN's networks condition on an input image. This architecture does have the downside of essentially limiting the generative capacity. Humans can generate many variations conditioned on an input but these networks: "Despite the dropout noise, we observe very minor stochasticity in the output of our nets. ...GANs that produce stochastic output, and... capture the full entropy of the conditional distributions they model, is an important question left open by the present work." AGI more and more looks like it will be about striking the right balance between generation and selection.

For an artist, selection ability also plays a large part in taste. Having not just a good enough understanding to create novel combinations, but also a good enough model of people to predict what they are likely to find somehow compelling.

Computers are good at generating, humans at selection. It is for this reason I disagree with those who believe tools like these herald the end of creativity nor can I agree with those who believe this heralds the end of creators. The quality floor of average art will rise but there will be less evidence of selection talent as the ceiling of what counts as great art must also rise.

Joule for Joule, it is unlikely that anything for the forseeable future will beat teams of machine and man.


Thanks. I approve of your observations and especially like this part:

> AGI more and more looks like it will be about striking the right balance between generation and selection.

I was thinking AGI is essentially reinforcement learning on top of rich, predictive models of the world. But you could see it as the balance between generation and selection.


Why are some people better at it than others then if it's purely noise and patterns?


They do a worse job converting the noise into patterns, relative to some common judge of how good the solution is. See any GAN in its early stages of training for an example.


Besides a cool new application of GANNs, I don't see if this architecture is much different than normal GANNs. Anyone else have thoughts?


I wonder how well this scales to a larger domain of interest. So, e.g., if the neural net needs to know not only about cars and nature, but about more topics such as people, faces, computers, gastronomy, santa claus, halloween, etcetera, how does the neural net scale? And how should its topology be extended under such scaling?


It's being researched with great interest. Building models from text and images, describing internal structure and relations between objects, building rich prior knowledge about the world in order to do inference and guide behavior.

I see lots of papers that go in this direction, of creating a rich, semantic, predictive representation of images, video and text and then using it as the basis for reinforcement learning. Learning to understand the world and to act based on that understanding.


Kudos for providing proper examples of the network doing its thing, both good and bad. This is what all researched ought to do. Too many papers these days handpick a couple coolest looking results and stop at that.

...

I get a feeling this could be used in game design to do some really cool stuff with map and texture generation.


It could reduce the game size way down if it can generate textures on-the-fly.


I'm enrolled in Efros' computational photography course this semester, and Tinghui and Jun-Yan are the GSIs. It's fantastic to experience the bridge between teaching and cutting-edge research!


This is an absolutely incredible result. All of this stuff would be considered insanely advanced AI ten years ago, but now we look at it and say "this is just stuff computers can do".

We've got the pieces of visual processing and imagination here and the pieces of language input/output as part of Google's work. It feels like we just need to make some progress on an "AI executive" before we can get a real, interactive, human-like machine.


I'm interested in having a play. As an out and out ML newbie, is there such a thing as an AWS image I could run on a GPU instance and then just git clone and go?


Try one of the bitfusion AMIs on a g2.2xlarge instance.


Thanks very much. If anyone else is interested I can confirm that the Bitfusion Boost Ubuntu 14 Torch 7 AMI on a g2.2xlarge instance does offer a relatively painless way to get going with this, although I couldn't get the python image combiner to work so had prepare those separately. Have just trained my first neural net, most exciting!


Neural nets! Is there anything they can't do?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: