I already incorporated this to Houdini to generate 3D distance meshes and it works wonders. If anyone is interested in experimenting with this, I really recommend Hugging Face diffusers library.
I wonder when we will be able to generate studio quality 3D models from a single 2D image. I know there are solutions that can do this but it is still nowhere near professional quality.
Really obvious use that took too long to occur to me: Make a bunch of textures for your game. Label them and use them to train a specialized model file. Use that to make more textures for your game.
It’ll get you 80% of the way there for each texture really fast. Would work really well if you are making a ton of NPCs and made a head model and a separate clothes model.
This is really cool, and I can immediately see a lot of way to take this further and improve some of it's weaknesses.
To solve the issue with hidden faces (right now it just projects out copies of the visible textures, you could combine this with inpainting: Rotate the scene / model to expose untextured faces sides that get masked out and inpainted, then rinse and repeat until the model is fully textured.
Another interesting (albeit much harder) thing to try would be to take the image output from SD, run it through MiDaS to get a new depth map with additional details. Diff the depth maps in 3d space and update the geometry to match before projecting the image. Combine it with the first suggestion and you would have a process for progressively refining a detailed, fully textured model.
> Another interesting (albeit much harder) thing to try would be to take the image output from SD, run it through MiDaS to get a new depth map with additional details. Diff the depth maps in 3d space and update the geometry to match before projecting the image. Combine it with the first suggestion and you would have a process for progressively refining a detailed, fully textured model.
I made a go at this exact approach last night. I'm new to working with 3d data but at least got a mesh rendered from the depth map. Spent most of my time fighting with tooling. Let me know if you want to help out.
I've been running through the problem in my head from a theoretical perspective but am in a similar situation in regards to familiarity with tooling.
The depth map encodes an array of relative positions, but without knowing the camera settings, field of view, etc. mapping them back to a coordinate is a guess. Luckily if you are rendering your initial depth map from a 3d model, you can use the camera settings to get the information you need to convert a depth map back into a correct array of 3d pixel coordinates.
You can use those pixel coordinates, which you know lie along existing surfaces, to subdivide the mesh to add complexity where needed. Then when you generate the new image and corresponding depth map, you can project those into 3d space using the same camera settings. You will need to filter out coordinates that are past some margin of error from the existing mesh and then use those to push and pull the existing mesh faces.
Map your textures onto your updated mesh and then also generate a confidence texture. The more each face points towards the camera the more confident you can be about the texture there.
Then move / rotate the camera and repeat, but this time also render the model with the confidence texture to generate an inpainting mask image. Continue from different positions / angles until the scene or model has been full textured with a high degree of confidence across the entire texture.
It is commercial code, so I can’t share it. However, algorithm is really simple to implement in Houdini:
* Generate depth image from 3D meshes
* Send depth image to local HTTP server running Hugging Face diffusers library. Generate textured image.
* Cast textured image to a 2D plane surface that covers the camera extends.
* Traverse the UV surface of the 3D meshes and find closest point on the 2D stable diffusion image plane. Copy the RGB value to UV texture and you’re done.
The pace and extent of innovation in this space is breathtaking. Almost every day there's a new super original idea showing up, that builds on something we just simply didn't have four months ago.
The open sourcing of these models was of course instrumental to this. So thanks Huggingface et al., I guess!
It's very clear how the discussion goes from here so it's really worth short circuiting it and providing the detail.
The models are openly available but the license places some restrictions on the use.
> You agree not to use the Model or Derivatives of the Model:
> - In any way that violates any applicable national, federal, state, local or international law or regulation;
> - For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
> - To generate or disseminate verifiably false information and/or content with the purpose of harming others;
> - To generate or disseminate personal identifiable information that can be used to harm an individual;
> - To defame, disparage or otherwise harass others;
> - For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;
> - For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;
> - To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
> - For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories;
> - To provide medical advice and medical results interpretation;
> - To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use).
For many, this distinction is more of either an academic one or one where we are OK with those kind of restrictions. Where I live if I did many of these things I would be either criminally or civilly (the right phrasing?) liable so having a license that tells me I can't break the law is a little redundant. I think 4 is possibly legal but an edge case and the last one is nation state level.
You are probably technically correct, but the context here is very important IMO.
Another piece of the puzzle that is worth noting, it is unclear if the model itself is copyrightable. There's a reason that they require you to affirmatively agree to the license before sharing the model instead of just attaching the license.
This way if a court rules that generated models don't qualify for copyright they can still enforce it under contract law.
Beyond what you described there is a weird clause in paragraph 7:
7. Updates and Runtime Restrictions. To the maximum extent permitted by law, Licensor reserves the right to restrict (remotely or otherwise) usage of the Model in violation of this License, update the Model through electronic means, or modify the Output of the Model based on updates. *You shall undertake reasonable efforts to use the latest version of the Model.*
I think this means it is now illegal to use SD 1 if at all reasonably possible to use SD 2?
“Licensor reserves the right”… the licensor has not invoked this clause and SD1.x is still listed on the huggingface repos of the respective licensors (the situation is a little complicated by the fact that 1.4, 1.5 and 2.x were each released by different licensors, so a revocation would most likely require agreement from all those parties).
Still worth keeping in mind from an open source perspective…
So for a start not being compliant with a license is not the same as something being illegal. Then I'm not sure whether the last sentence is tied to the rest of the clause if it relates to usage of the model in violation of the license. I'm not sure if v1 and v2 are considered the same model or not. Arguable either way.
It does add risks for a company sure.
edit - thanks for adding that in, it's an important part of the picture.
depth2img is underrated for SDv2 release, probably because AUTOMATIC1111 integration is done only 1 or 2 weeks ago.
Later this week, Draw Things will have a release with depth2img model as well, initially only supporting iOS-provided depth map, but will do MiDaS inferred depth map soon. Going to be exciting to see what people comes up with.
I wish Stable Diffusion had focused on more fine art and professional photography in their training corpus, like DALL-E did in its own one. All outputs I've seen have a decidedly low-quality or amateurish look, which must have been picked up from somewhere.
Thing is, the primary power of large models is not in the styles they memorized, but in the ability to be fine-tuned and guided by reference content. The focus is your choice, you can shift the model bias with any available finetuning method and a certain amount of reference images. 99% of the training has already been done for you, so all you need is a relatively subtle push achievable on a single GPU (or a few GPUs).
Take a look at Analog Diffusion outputs, for example.
With patience the guidance can even be done simply, without training, with enough prompts, blending of prompts, input images to img2img, etc. I'm not an expert on art history, but those who are seem able to get just about any style on any image with just the base stable diffusion model. Adding specific photography terms can emulate some of the results from the analog diffusion model in regular SD as well.
It has nothing to do with "style". DALL-E 2 consistently astonishes with its composition, framing, variety of visual elements, and so on, which tend to also be good in oil paintings and professional photographs, but not at all in the sort of content which predominates on art sharing apps with user-generated content.
If you mean porn (which is censored in DALL-E), SD also can't make it as it's mostly filtered out from the training set. All of it is made by custom finetuned models, which is my point. Same with hands, which SD can struggle with. Styles are not the only thing that you can transfer, but subjects and concepts as well.
Well... That's what predominates as user-generated content, I think :)
Anyway, none of the existing image-gen models can be used in content production in their vanilla state. Prompt to image is fine as a hobby, but style/concept transfer is what makes them usable in professional setting. Or rather will make in nearest future, as this is all still highly experimental. SD in particular is quite small and is not a ready to use product, not intended for direct usage. It's a middleware model to build products upon. Such as Midjourney.
You're just prompting it wrong. Try adding qualifiers that have well-known associations with pro-level photography. For example these two makes your prompts incredibly much better, append either "Canon EOS 5D Mark IV, Sigma 85mm f/1.4" or "photography by Annie Liebovitz" (for a bit more artsy photos) to the end of your prompt and see the difference.
If you're not doing portrait photography, replace Sigma 85mm f/1.4 with something else for example Sigma 24mm for more wide-angle photography.
Note that in my extensive testing specific lenses and cameras tend to make broad, random changes to the output unless they tend to have a specific aesthetic associated with them.
E.g. 85mm vs 24mm make no specific changes to a photo. SD appears to just interpret these as "make a photo look realistic", and any changes to the photo as you switch between them are simply incidental.
Yeah the particular item of the focal length is not as important than hinting in the prompt that you want a photorealistic pic in general. Still the conditioning guidance has a lots of terms and maybe it can tip it in a good direction at some unclear points during a diffusion process the more details you add.
You need to look at a better source. SD 2.x is trained on alot of professional photography. There is plenty of output which is not low-quality/amateurish. There is also the Analog Diffusion model is the sd1.5 fine tuned for film photography.
On a similar note, has anyone had any luck using these models to try different paint colourings in their house (from uploading photos of their rooms)? I've struggled to do this with SD, and the non-AI apps aren't great at it either.
When you afford SD img2img influence with a high strength parameter, so it can change the colors of a photo, you also are letting it have enough influence to redesign the room at random and not just changing the colors
Dreambooth trained on the ground truth photos might help a bit
This seems much easier in an app like photoshop/affinity if you want to keep everything really static. Or by using a combination of photoshop and then SD to clean up.
Take the depth field from life size furniture instead, all covered in chroma key surface and make "we never use the same crazy virtual studio twice, not even the arrangement" the identity of your fixed camera show?
I personally prefer to just read it on Twitter directly these days if someone links to Twitter, as long as Twitter doesn’t pop up the thing asking me to sign in which it sometimes does.
But I do secretly wish that everyone would migrate away from Twitter altogether and use Mastodon etc instead.
Mastodon is a cult though. I can't get over how irritatingly homogenous in opinion it is despite being federated, thanks to the draconian moderation policy that the large instances impose.
There are hundreds if not thousands of instances which federate with each other. The entire point of Mastodon is to avoid giving too much power to one instance. Join a smaller one.
I already incorporated this to Houdini to generate 3D distance meshes and it works wonders. If anyone is interested in experimenting with this, I really recommend Hugging Face diffusers library.
I wonder when we will be able to generate studio quality 3D models from a single 2D image. I know there are solutions that can do this but it is still nowhere near professional quality.