Hacker News new | past | comments | ask | show | jobs | submit login

It's still odd what the new AI models are good at, or not. Strangely to me, AI still struggles with hands. Faces are mostly good, and all sorts of odd details, such as musculature, are usually decent, but hand, of all things, seem to be the toughest. I'd have thought faces would be.



I tried for ages to get DALLE to draw me a cartoon spider, but gave up in the end. All the other cartoon animals that I asked it to create were perfect, but it could not draw a spider with eight legs. It's like the one thing that every child knows about spiders, but DALLE just wasn't able to it, no matter what prompt I tried.

It reminded me of https://27bslash6.com/overdue.html so much that it just started to make me laugh with each new attempt.


I think it's because hands have the dual properties of being extremely important to us and not primarily visual. Our hands are our primary mechanism for interacting with the world in a conscious and directed fashion. We devote a lot of mental attention to them, how to use them, how other people are using them. That's true subjectively, but you can also read it off of the Cortical Homunculus findings [1]. In short, though, we're extremely sensitive to whether hands are rendered properly and meaningfully.

And, then, unlike faces, there is relatively little visual data in the world showing exactly how hands work. Unlike faces they're not often the focal point of an image. Unlike faces, they don't present mostly forward and so in any particular image their visualization is only partial. Unlike faces hands are often defined by how they interact with any other complex object in a scene.

So we're both tough critics of hands and image models have relatively less training data.

For what it's worth, as well, it's evident that image models are only good at depicting many things gesturally. At the same time, so are painters. If you're a photographer, you can often spot fake images if you notice that the exposure, focus, or lighting is implausible. If you're a mathematician, you'll notice every chalkboard full of equations is nonsense in both AI images and most Hollywood movies. If you're a botanist, I'm sure you think every AI image with a background of trees looks weird.

And then it turns out that nearly every human being is a hand-ologist to a large degree.

For another interesting experience, take a look at the Clone synthetic hand [2] which is quite obviously artificial but also, from time to time, looks surprisingly human. We're quite clearly sensitive to exactly the musculature and range of motion of our hands and know exactly what's feasible, what's painful, what feels natural and unnatural given the exact constraints of how our hand is constructed. When those limits are probed it's immediately obvious.

[1] https://en.wikipedia.org/wiki/Cortical_homunculus [2] https://www.youtube.com/watch?v=A4Gp8oQey5M&t=20s


You’re partially correct, but this isn’t an explanation for why they’re rendered wrong

Hands are extremely complicated mechanically. They are the most complex creation evolution has come up with and part of the reason humans are able to do what they do.

Hands are like the chess game of anatomy, each segment of a hand has so many permutations that an AI simply doesn’t have enough reference info to animate it properly


I don't think we disagree, and I do think what I argue is sufficient for image generation models to fail to render hands well. What you add---that they are very complex---is true, I believe, but I avoided using it to argue as I'm not sure it's sufficient or necessary.

Generative models, arguable, have little trouble with complexity given enough training data. Faces are a perfect example. We both agree that image models, at least, lack that data for hands.

But there are many complex things that image models render with sparse training data which don't set off our perception as strongly. Hands fall into the uncanny valley: we are deeply familiar with them.

This is why I mention lighting and focus. They are subtle and complex. Additionally, image models have tons of training examples of each. That's still not enough for generative image models to consistently represent photography in a way that a person who has spent the time to build an accurate model of how camera images look would be fooled. But it fools most people.

The complexity of handling good lighting and focus involve both the generation of the entire scene that the photograph is taking place within and an accurate model of both the design of the camera and how it's been configured for the shot. Both of these are large spaces full of hidden variables that popular image models are not presently trained on.

Many people know you can look at the background of a generated image to identify irregularities. Checking that the lighting has a consistent angle (or multiple angles indicative of a cogent set of scene lights) is another good check. Additionally, if you have an eye for bokeh then when it appears in an image you can often detect whether it's faked. Finally, even smooth blurs often do not reflect either a physically plausible background being blurred or a consistent focal plane cutting through the 3d scene. All additional complexities that image generating models often don't have mastery over (for now). But also many judges of their outputs don't either, so it's easy to miss these "mistakes".


I think 4 fingers (of which there can only be 4 and not 5 or 3) that all look similar (but aren’t) plus a thumb that looks much more different (and yet not) plus what happens when you simply rotate your hand in space (fingers become obstructed and then revealed… changing the visible finger count and possibly loosening the reinforcement of 4 finger prevalence) might be the reason


Perhaps the hand is an example of something for which you really need an actual model of the anatomy to draw convincingly whereas lots of other things (eg faces) a very simple proximity model will generate just fine. Since a diffusion model doesn't have an actual conceptual model of the things it generates (it literaly just learns to remove noise from training images and then eventually "removes the noise" from a totally random bitstream until it generates your image).

It's interesting that lots of artists practise sketching using wooden models of a hand that they can pose in different ways.[1]

[1] This type of thing can be found in most art shops https://www.quickdrawsupplies.com/product/8-20cm-artists-pos...



There’s a scale here. People may mess up details but they are unlikely to draw the kind of mutant deformities you get in otherwise seemingly reasonable images. I regularly see stuff like a 7 fingers or a single fingers on a hand or a finger several times the size of others that curves bizarrely.


The problem with hands is that more fingers are between fingers than to left or right of fingers.

So, a finger should probably be portrayed as between other fingers.

You can see where this is going. It can't.


Hands are perfectly solved using controlnet. Hands haven’t been a problem for image generators since last years.


maybe its just that peoples hands are all such different shapes, proportions, in odd positions, not fully visible.... but something like "this muscle runs between the elbow and wrist" is just easier for the model to pick up on... it has "anchor points".

Facial features and fingers just.... are hanging off the body in extremely non-uniform ways w/ no real set proportions. It isn't totally intuitive to me why its so bad at it, but faces especially are so unique and the musculature of the face is so fine.... learning a representation must just be really really difficult.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: