Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They love saying things like "generative AI doesn't know physics". But the constraint that both eyes should have consistent reflection patterns is just another statistical regularity that appears in real photographs. Better training, larger models, and larger datasets, will lead to models that capture this statistical regularity. So this "one weird trick" will disappear without any special measures.


> Better training, larger models, and larger datasets, will lead to models that

Hypothetically, with enough information, one could predict the future (barring truly random events like radioactive decay). Generative AI is also constrained by economic forces - how much are GenAI companies willing to invest to get eyeball reflections right? Would they earn adequate revenue to cover the increase in costs to justify that feature? There are plenty of things that humanity can technically achieve, that don't get done because the incentives are not aligned- for instance, there is enough food grown to feed every human on earth and the technology to transport it, and yet we have hunger, malnutrition and famines.


> how much are GenAI companies willing to invest to get eyeball reflections right

This isn't how it works. As the models are improved, they learn more about reality largely on their own. Except for glaringly obvious problems (like hands, deformed limbs, etc) the improvements are really just giving the models techniques for more accurately replicating features from reasoning data. There's nobody that's like "today we're working on fingernails" or "today we're making hair physics work better": it's about making the model understand and replicate the features already present in the training dataset.


No, it’s a valid point, which I didn’t interpret as literally “we’re working on eyeballs today” but rather “we’re scaling up these imperfect methods to a trillion dollar GPU cluster”, the latter of which is genuinely something people talk about. The models will learn to mimic more and more of the long tail of the distribution of training data, which to us looks like an emergent understanding. So there’s a theoretical amount of data you could provide for them to memorize physical laws.

The issue is practical. There isn’t enough data out there to learn the long tail. If neural nets genuinely understood the world they would be getting 100% on ARC.


You don't need a trillion dollar GPU to accomplish this stuff. Models from two years ago look incredibly different than the ones today, not just because they're bigger but because they're more sophisticated. And the data from two years ago looks a lot like the data today: messy and often poorly annotated.

Even if we added no new compute and capped the resources used for inference and training, models would produce higher fidelity results over time due to improvements in the architecture.


I don't know where you get this opinion from as it doesn't match the landscape that I'm witnessing. Around the huge names are many companies and business units fine-tuning foundational models on their private datasets for the specific domains they are interested in. I can think of scenarios where someone is interested in training models to generate images with accurate reflections in specific settings.


> I don't know where you get this opinion from as it doesn't match the landscape that I'm witnessing.

I'm an engineer at an AI company that makes text/image to video models, but go off


I may be in agreement, and I was an idiot to misunderstand your comment and reply based on it.

I especially agree with the last sentence that models largely learn features in the dataset, but I don't understand why you would describe it as

> There's nobody that's like "today we're working on fingernails" or "today we're making hair physics work better"

If there were a business case for that, I would characterize curating a dataset of fingernails and fine-tuning or augmentind a model based on that as "today we're working on fingernails".

And the same too with eye reflections. So with the right dataset you can get eye reflections right, albeit in a limited domain. (E.g. deepfakes in a similar setting as the training data). In fact you can look at the community that sprung up around SD 1.5 (?) that fine tunes SD with relevant datasets to improve its abilities in exactly a "today we're going to improve its ability to produce these faces" kind of fashion.

Where did I misunderstand your comment? I seem to arrive at the completely opposite response from the same fact.

I also noticed that you say

> the improvements are really just giving the models techniques for more accurately replicating features from reasoning data.

You seem to refer to aspects of a model unrelated to dataset quality. But fine-tuning on a curated dataset may be sufficient and necessary for improving eye reflections and fingernails.


There could be edge cases, but fine tuning doesn't normally concentrate on a single specific feature. With positive and negative examples you could definitely train the eyes, but it's not what people usually do. Fine tuning is widely used to provide a specific style, clothes, or other larger scale elements.


> There could be edge cases, but fine tuning doesn't normally concentrate on a single specific feature.

Normally is being strained here: yes, most finetuning isn't for things like this, but quite a substantial minority is for more accurate rendering of some narrow specific feature, especially ones typically identified as signature problems of AI image gen; publicly identifying this as a way to visually distinguish AI gens makes it more likely that fine tuning effort will be directed at addressing it.


> This isn't how it works. As the models are improved, they learn more about reality largely on their own.

AI models aren't complete blackboxes to the people who develop them: there is careful thought behind the architecture, dataset selection and model evaluation. Assuming that you can take an existing model and simply throw more compute at it will automatically result in higher fidelity illumination modeling takes almost religious levels of faith. If moar hardware is all you need, Nvidia would have the best models in every category right now. Perhaps someone ought to write the sequel to Fred Brooks' book amd name it "The Mythical GPU-Cluster-Month".

FWIW, Google has AI-based illumination adjustment in Google Photos where one can add virtual lights - so specialized models for lighting already exist. However, I'm very cynical about a generic mixed model incidentally gaining those capabilities without specific training for it. When dealing with exponential requirements (training data, training time, GPUs, model weight size), you'll run out of resources in short order.


What you're refuting isn't what I said. I'm making the point that nobody is encoding all of the individual features of the human form and reality into their models through code or model design. You build a model by making it capable of observing details and then letting it observe the details of your training data. Nobody is spending time getting the reflections in the eyeballs working well, that comes as an emergent property of a model that's able to identify and replicate that. That doesn't mean it's a black box, it means that it's built in a general way so the researchers don't need to care about every facet of reality.


> If moar hardware is all you need, Nvidia would have the best models in every category right now.

Nvidia is making boatloads of money right now selling GPUs to companies that think they will be making boatloads of money in the future.

Nvidia has the better end of things at this very moment in time.


> Assuming that you can take an existing model and simply throw more compute at it will automatically result in higher fidelity illumination modeling takes almost religious levels of faith.

Seems an odd response to a poster who said “as the models are improved...”; the way the models are improved isn't just additional training to existing models, its updated model architectures.


Getting the eyeballs correct will correlate with other very useful improvements.

They won’t train a better model just for that reason. It will just happen along the way as they seek to broadly improve performance and usefulness.


I’m far from an expert on this, but these are often trained in conjunction with a model that recognizes deep fakes. Improving one will improve the other, and it’s an infinite recursion.


Popper disagrees

https://en.wikipedia.org/wiki/The_Poverty_of_Historicism

"Individual human action or reaction can never be predicted with certainty, therefore neither can the future"

See the death of Archduke Franz Ferdinand - perhaps could be predicted when it was known that he would go to Sarajevo. But before?

If you look at SciFi, some things have been predicted, but many -obvious things- haven't.

What if Trump had been killed?

And Kennedy?


I could see state actors being willing to invest to be able to make better propaganda or counter intelligence.


Yeah every person is constantly predicting the future, often even scarely accurately. I don't see how this is a hot take at all.


> how much are GenAI companies willing to invest to get eyeball reflections right?

Willing to? Probably not much. Should? A WHOLE LOT. It is the whole enchilada.

While this might not seem like a big issue and truthfully most people don't notice, getting this right (consistently) requires getting a lot more right. It doesn't require the model knowing physics (because every training sample face will have realistic lighting). But what underlines this issue is the model understanding subtleties. No model to date accomplishes this. From image generators to language generators (LLMs). There is a pareto efficiency issue here too. Remember that it is magnitudes easier to get a model to be "80% correct" than to be "90% correct".

But recall that the devil is in the details. We live in a complex world, and what that means is that the subtleties matter. The world is (mathematically) chaotic, so small things have big effects. You should start solving problems not worrying about these, but eventually you need to move into tackling these problems. If you don't, you'll just generate enshitification. In fact, I'd argue that the difference between an amateur and an expert is knowledge of subtleties and nuance. This is both why amateurs can trick themselves into thinking they're more expert than they are and why experts can recognize when talking to other experts (I remember a thread a while ago where many people were shocked about how most industries don't give tests or whiteboard problems when interviewing candidates and how hiring managers can identify good hires from bad ones).


That still won’t make them understand physics.

This all reminds me of “fixing” mis-architected software by adding extra conditional code for every special case that is discovered to work incorrectly, instead of fixing the architecture (because no one understands it).


Maybe it will. It really depends whether it's "easier" for the network to learn an intuitive physics, versus a laundry list of superficial hacks that let it minimise loss all the same. If the list of hacks grows so long that gradient descent finds it easier to learn the actual physics, then it'll learn the physics.

Hinton argues that the easiest way to minimise loss in next token prediction is to actually understand meaning. An analogous thing may hold true in vision modelling wrt physics.


If your entire existence was constrained to seeing 2d images, not of your choosing, could a perplexity-optimizing process "learn the physics"?

Basic things that are not accessible to such a learning process:

- moving around to get a better view of a 3d object

- see actual motion

- measure the mass of an object participating in an interaction

- set up an experiment and measure its outcomes

- choose to look at a particular sample at a closer resolution (e.g. microscopy)

- see what's out of frame from a given image

I think we have at this point a lot of evidence that optimizing models to understand distributions of images is not the same thing as understanding the things in those images. In 2013 that was 'DeepDream' dog worms, in 2018 that was "this person does not exist" portraits where people's garments or hair or jewelry fused together or merged with their background. In 2022 it was diffusion images of people with too many fingers, or whose hands melted together if you asked for people shaking hands. In the Sora announcement earlier this year it was a woman's jacket morphing while the shot zoomed into her face.

I think in the same way that LLMs do better at some reasoning tasks by generating a program to produce the answer, I suspect models which are trained to generate 3D geometry and scenes, and run a simulation -> renderer -> style transfer process may end up being the better way to get to image models that "know" about physics.


Indeed. It will be very interesting when we start letting models choose their own training data. Humans and other animals do this simply by interacting with the world around them. If you want to know what is on the back of something, you simply turn it over.

My guess is that the models will come up with much more interesting and fruitful training sets than what a bunch of researchers can come up with.


They're being trained on video, 3d patches are being fed into the ViT (3rd dimension is time) instead of just 2d patches. So they should learn about motion. But they can't interact with the world so maybe can't have an intuitive understanding of weight yet. Until embodiment at least.


I mean, the original article doesn't say anything about video models (where, frankly, spotting fakes is currently much easier), so I think you're shifting what "they" are.

But still:

- input doesn't distinguish what's real vs constructed nonphysical motion (e.g. animations, moving title cards, etc)

- input doesn't distinguish what's motion of the camera versus motion of portrayed objects

- input doesn't distinguish what changes are unnatural filmic techniques (e.g. change of shot, fade-in/out) vs what are in footage

Some years ago, I saw a series of results about GANs for image completion, and they had an accidental property of trying to add points of interest. If you showed it the left half of a photo of just the ocean, horizon and sky, and asked for the right half, it would try to put a boat, or an island, because generally people don't take and publish images of just the empty ocean -- though most chunks of the horizon probably are quite empty. The distribution on images is not like reality.


> It really depends whether it's "easier" for the network to learn an intuitive physics, versus a laundry list of superficial hacks that let it minimise loss all the same.

Human innate understanding of physics is a laundry list of superficial hacks. People needs education and mental effort to go beyond that innate but limited understanding.


When it is said that humans innately understand physics, no one means that people innately understand the equations and can solve physics problems. I think we all know how laughable such a claim would be, because how much people struggle when learning physics and how few people even get to a moderate level (not even Goldstein, but at least calculus based physics with partial derivatives).

What people mean when saying people innately understand physics is that they have a working knowledge of many of the implications. Things like that gravity is uniformly applied from a single direction and that is the direction towards ground. That objects move in arcs or "ballistic trajectories", that straight lines are uncommon, that wires hang with hyperbolic function shapes even if they don't know that word, that snow is created from cold, that the sun creates heat, many lighting effects (which is how we also form many illusions), and so on.

Essentially, humans know that things do not fall up. One could argue that this is based on a "laundry list of superficial hacks" and they wouldn't be wrong, but they also wouldn't be right. Even when wrong, the human formulations are (more often than not) causally formulated. That is, explainable _and_ rational (rational does not mean correct, but that it follows some logic. The logic doesn't need to be right. In fact, no logic is, just some are less wrong than others).


> It really depends whether it's "easier" for the network to learn an intuitive physics, versus a laundry list of superficial hacks that let it minimise loss all the same

The latter is always easier. Not to mention that the architectures are fundamentally curve fitters. There are many curves that can fit data, but not all curves are casually related to data. The history of physics itself is a history of becoming less wrong and many of the early attempts at problems (which you probably never learned about fwiw) were pretty hacky approximations.

> Hinton argues

Hinton is only partially correct. It entirely depends on the conditions of your optimization. If you're trying to generalize and understand causality, then yes, this is without a doubt true. But models don't train like this and most research is not pursuing these (still unknown) directions. So if we aren't conditioning our model on those aspects, then consider how many parameters they have (and aspects like superposition). Without a doubt the "superficial hacks" are a lot easier and will very likely lead to better predictions on the training data (and likely test data).


The grokking papers show that after sufficient training models can transition into a regime where both training and test error gets arbitrarily small.

Yes, this is out of reach of how we train most models today. But it demonstrates how even current models are capable of building circuits that perfectly predict (meaning understand the actual dynamics) of data given sufficient exposure.


I have some serious reservation about the grokking papers and there's the added complication that test performance is not a great proxy for generalization performance. It is naive to assume the former begets the latter because there are many underlying assumptions there that I think many would not assume are true once you work them out. (Not to mention the common usage of t-SNE style analysis... but that's a whole other discussion)

It is important to remember that there are plenty of alternative explanations to why the "sudden increase" in performance happens. I believe if people had a deeper understanding of how metrics work that the phenomena would become less surprising and make one less convinced that scale (of data and/or model) will be insufficient to create general intelligence. But this does take quite a bit of advanced education (that is atypical from a ML PhD) and you're going to struggle to obtain it "in a few weekends".


It really isn't easier at a sufficient complexity threshold.

Truth and reality cluster.

So hyperdimensional data compression which is organized around truthful modeling versus a collection of approximations will, as complexity and dimensionality approach uncapped limits, be increasingly more efficient.

We've already seen toy models do world modeling far beyond what was being expected at the time.

This is a trend likely to continue as people underestimate modeling advantages.


> that gradient descent finds it easier to learn the actual physics, then it'll learn the physics.

I guess it really depends on what the meaning of gradient decent learning the physics is.

Maybe you define it to mean that the actually correct equations appear encoded in the computation of the net. But this would still be tacit knowledge. It would be kind of like a math software being aware of physics at best.


This is more a comment to the word "understand" than "physics".

Yes, the models output will converge to being congruent with laws of physics by virtue of deriving that as a latent variable.


Isn't that just what neural networks do? The way light falls on an object is physically deterministic, but the neural network in the brain of a human painter doesn't actually calculate rays to determine where highlights should be. A center fielder knows where to run to catch a fly ball without having to understand the physics acting on it. Similarly, we can spot things that look wrong, not because we're refering to physical math but because we have endless kludged-together rules that supercede other rules. Like: Heavy objects don't float. Except for boats which do float. Except for boats that are leaking, which don't. To then explain why something is happening we refer to specialized models, and these image generation models are too general for that, but there's no reason they couldn't refer to separate physical models to assist their output in the future.


  > doesn't actually calculate rays 
  > without having to understand the physics acting on it
I believe you are confusing physics with mathematics, or more specifically mathematical computation.


Boats are mostly air by volume, which isn't heavy at all compared to water.


>That still won’t make them understand physics

I would assume that larger models working with additional training data will eventually allow them to understand physics to the same extent as humans inspecting the world - i.e. to capture what we call Naive Physics [0]. But the limit isn't there; the next generation of GenAI could model the whole scene and then render it with ray tracing (no special casing needed).

[0] https://en.wikipedia.org/wiki/Na%C3%AFve_physics


There seems to be little basis for this assumption, as current models don’t exhibit understanding. Understanding would allow to apply it to situations that don’t match existing patterns in the training data.


That’s not large models “understanding physics.” Better, giving output “statistically consistent” with real physical measurements. And no one, to my knowledge, has yet succeeded in a general AI app that reverts to a deterministic calculation in response to a prompt.


chatgpt has had the ability to generate and call out to deterministic python scripts for a year now


> And no one, to my knowledge, has yet succeeded in a general AI app that reverts to a deterministic calculation in response to a prompt.

They will all do this with a fixed seed. They just don't do that because nobody wants it.


> This all reminds me of “fixing” mis-architected software by adding extra conditional code for every special case that is discovered to work incorrectly...

Isn't that what AI training is in general? It has worked pretty well so far.

I dont think img-gen AI is ever going to "understand physics", but that isn't the task at hand. I don't think it is neccesary to understand physics to make good fake pictures. For that matter, i dont think understanding physics would even be a good approach to the fake picture problem.


> That still won’t make them understand physics.

They don't have to. They just have to understand what makes a realistic picture. The author of the article isn't really employing physics either; he's comparing the eyes to each other.


Most "humans" don't understand physics to a Platonic level and act in much the same way as a model, finding best fits among a set of parameters that produce a result that fits some correctness check.


But we don’t know how much larger the models will have to be, how large the data sets or how much trianing is needed, do we? They could have to be inconceivably large.

If you want to correct for this particular problem you might be better off training a face detector, an eye detector and a model that takes two eyes as input and corrects for this problem. Process then would be:

- generate image

- detect faces

- detect eyes in each face

- correct reflections in eyes

That is convoluted, though, and would get very convoluted when you want to correct for multiple such issues. It also might be problematic in handling faces with glass eyes, but you could try to ‘detect’ those with a model that is trained on the prompt.


> They could have to be inconceivably large.

The opposite might also be true. Just having better, well curated data goes a long way. LAION worked for a long time because it's huge, but what if all the garbage images were filtered out and the annotations were better?

The early generations of image and video models used middling data because it was the only data. Since then, literally everyone with data has been working their butts off to get it cleaned up to make the next generation better.

Better data, more intricate models, and improvements to the underlying infrastructure could mean these sorts of "improvements" come mostly "for free".


ADetailer does exactly that. Feels like this large thread above is non-practicing for the most part.

There’s no eyes module in it by default, but it’s trivial-ish to add, and a hires eyes dataset isn’t hard to collect either.

Just found eyes model on https://civitai.com/models/150925/eyes-detection-adetailer (seems anime only)


I feel like a GAN method might work better, building a detector, and training the model to defeat the detector.


Just like the 20 fingers disappeared


Those early diffusion generators sure managed to make the flesh monster in The Witcher look sane sometimes.


/s, right? I haven’t actually seen any models make this disappear really yet


This was my first thought too... You can already see in the examples in the article that the model understands that photos of people tend to have reflections of lights in their eyes. It understands that both eyes tend to reflect the same number of lights. It's already modelling that there's a similarity relationship between these areas of the image (nobody has dichromia in these pictures).

I can remember when it was hard for image generators to model the 3d shape of a table, now they can very easily display 4 very convincing legs.

I don't have technical expertise here but it just seems like a natural starting point to assume that this reflection thing is a transient shortcomimg.


Or the times when the model would create six fingers. No longer.


Sometimes this is addressed not by fixing one model, but instead running post processing models that are specialized to fix particular know defects like oddities with fingers.


Ah!


Here's the link about neural scaling law: https://en.wikipedia.org/wiki/Neural_scaling_law

Can you make a napkin calculation of how much better training should be, how larger models and datasets should be to overcome difference in widely separated but related pixels of the image?

I did that and results are not promising: 200 billions parameter model will not perform much better than 100 billions' one. The loss is already too small.

Also, the phenomena exemplified in the article is a problem of relation between distant entities in generated media. The same can be seen in LLM where they have inconsistencies at the beginning and end of sentences.

If generated eyes start to lie, there will be different separated but related objects to rely on.


Wouldn't also the adverserial model training have to the take "physics correctness" into account? As long as the image detects as "<insert celebrity> in blue dress", why would it care about correct details in eyes if nothing in the "checker" cares about that?


Current image generators don’t use an adversarial model. Though the ones that do would have eventually encoded that as well; the details to look for aren’t hard-coded.


Interesting. Apparently, I have much to learn.


GP told you how they don't work, but not how they do:

Current image generators work by training models to remove artificial noise added to the training set. Take an image, add some amount of noise, and feed it with it's description as inputs to your model. The closest the output is to the original image, the highest the reward function.

Using some tricks (a big one is training simultaneously on large and small amounts of noise), you ultimately get a model that can remove 99% noise based only on the description you feed it, and that means you can just swap out the description for what you want the model to generate and feed it pure noise, and it'll do a good job.


I read this description of the algorithm a few times and I find it fascinating because it's so simple to follow. I have a lot of questions, though, like "why does it work?", "why nobody thought of this before", and "where is the extra magical step that moves this from 'silly idea' to 'wonder work'"?


answer to the 2nd and 3rd question is mostly "vastly more computing power available", especially the kind that CUDA introduced a few years back


Shouldn't a GAN be able to use this fact immediately in its adversarial network?


Unfortunately no. The GAN always need to be in balance and contention with the generator. You can swap out the discriminator later, but you also got to make sure your discriminator is able to identify these errors. And ML models aren't the best at noticing small details. And since they too don't understand physics, there is no reason to believe that they will encode such information, despite every image in real life requiring consistency. Also remember that there is a learning trajectory, and most certainly these small details are not learned early on in networks. The problem is that this information is post hoc trivial to identify errors, but it isn't a priori. It is also easy for you because you know physics innately and can formulate causal explanations.


> So this "one weird trick" will disappear without any special measures.

> Better training, larger models, and larger datasets

But "better training" here is a special measure. It would take a lot of training effort to defeat this check. For example, you'd need a program or group of people who would be able to label training data as realistic/not based on the laws of physics as reflected in subjects' eyeballs.


I know there are murmurs that synthetic data (i.e. using rendering software with 3D models) was used to train some generative models, including OpenAI Sora; seems like it's the only plausible way right now to get the insane amounts of data needed to capture such statistical regularities.


Exactly. Notably, in my experiments, diffusion models based on U-Nets (e.g. SD1.4, SD2) are worse at capturing "correlations at a distance" like this in comparison to newer, DiT-based methods (e.g. SD3, PixArt).


A simpler process is to automatically post process the image to “fix” the eyes. Similar techniques are used to address deformities with hands and other localized issues.


>will lead to models that capture this statistical regularity.

That's not guaranteed, AI does find statistical regularities we miss but also miss some we find.


Did anybody prompt a GenAI to get this output?


It wouldn't work. The models could put stuff in the eyes but it wouldn't be able to do so realistically, consistently or even a fraction of the time. The text describing the images does not typically annotate tiny details like correct reflections in the eyes so prompting for it is useless.


Treating knowing (or understanding) as binary is a common failing in discussions about AI.


> But the constraint that both eyes should have consistent reflection patterns is just another statistical regularity that appears in real photographs

Hi, author here of a model that does really good on this[0]. My model is SOTA and has undergone a third party user study that shows it generates convincing images of faces[1]. AND my undergrad is in physics. I'm not saying this to brag, I'm giving my credentials. That I have deep knowledge in both generating realistic human faces and in physics. I've seen hundreds of thousands of generated faces from many different models and architectures.

I can assure you, these models don't know physics. What you're seeing is the result of attention. Go ahead and skip the front matter in my paper and go look at the appendix where I show attention maps and go through artifacts.

Yes, the work is GANs, but the same principles apply to diffusion models. Just diffusion models are typically MUCH bigger and have way more training data (sure, I had access to an A100 node at the time, but even one node makes you GPU poor these days. So best to explore on GANs ):

I'll point out flaws in images in my paper, but remember that these fool people and you're now primed to see errors, and if you continue reading you'll be even further informed. In Figures 8-10 you can see the "stars" that the article talks about. You'll see mine does a lot better. But the artifact exists in all images. You can also see these errors in all of the images in the header, but they are much harder to see. But I did embed the images as large as I could into the paper, so you can zoom in quite a bit.

Now there are ways to detect deep fakes pretty readily, but it does take an expert eye. These aren't the days of StyleGAN-2 where monsters are common (well... at least on GANs and diffusion is getting there). Each model and architecture has a different unique signature but there are key things that you can look for if you want to get better at this. Here's things that I look for, and I've used these to identify real world fake profiles and you will see them across Twitter and elsewhere:

- Eyes: Eyes are complex in humans with lots of texture. Look for "stars" (inconsistent lighting), pupil dilation, pupil shape, heterochromia (can be subtle see Figure 2, last row, column 2 for example), and the texture of the iris. And also make sure to look at the edge of eyes (Figs 8-10) and

- Glasses: look for aberrations, inconsistent lighting/reflections, and pay very close attention to the edges where new textures can be created

- Necks: These are just never right. The skin wrinkles, shape, angles, etc

- Ears: These always lose detail (as seen in TFA and my paper), lose symmetry in shape, are often not lit correctly, if there are earrings then watch for the same things too (see TFA).

- Hair: Dear fucking god, it is always the hair. But I think most people might not notice this at first. If you're having trouble, start by looking at the strands. Start with Figure 8. Patches are weird, color changes, texture, direction, and more. Then try Fig 9 and TFA.

- Backgrounds: I make a joke that the best indicator to determine if you have a good quality image is how much it looks like a LinkedIn headshot. I have yet to see a generated photo that has things happening in the background that do not have errors. Both long-range and local. Look at my header image with care and look at the bottom image in row 2 (which is pretty good but has errors), row 2 column 4, and even row 1 in column 4's shadow doesn't make sense.

- Phase Artifacts: This one is discussed back in StyleGAN2 paper (Fig 6). These are still common today.

- Skin texture: Without fail, unrealistic textures are created on faces. These are hard to use in the wild though because you're typically seeing a compressed image and that creates artifacts too and you frequently need to zoom to see. They can be more apparent with post processing though.

There's more, but all of these are a result of models not knowing physics. If you are just scrolling through Twitter you won't notice many of these issues. But if you slow down and study an image, they become apparent. If you practice looking, you'll quickly learn to find the errors with little effort. I can be more specific about model differences but this comment is already too long. I can also go into detail about how we can't determine these errors from our metrics, but that's a whole other lengthy comment.

[0] https://arxiv.org/abs/2211.05770

[1] https://arxiv.org/abs/2306.04675


One does not get Newton by adding more epicycles.


Agreed, but the tricks are still useful.

When there are no more tricks remaining, I think we must be pretty close to AGI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: