Stable Diffusion 2 Depth Guided model: architecture photos from dollhouse

DethNinja · on Dec 21, 2022

Depth Guided models are amazing for game art texturing: https://hackaday.com/2022/12/18/image-generating-ai-can-text...

I already incorporated this to Houdini to generate 3D distance meshes and it works wonders. If anyone is interested in experimenting with this, I really recommend Hugging Face diffusers library.

I wonder when we will be able to generate studio quality 3D models from a single 2D image. I know there are solutions that can do this but it is still nowhere near professional quality.

corysama · on Dec 21, 2022

Really obvious use that took too long to occur to me: Make a bunch of textures for your game. Label them and use them to train a specialized model file. Use that to make more textures for your game.

It’ll get you 80% of the way there for each texture really fast. Would work really well if you are making a ton of NPCs and made a head model and a separate clothes model.

dwallin · on Dec 21, 2022

This is really cool, and I can immediately see a lot of way to take this further and improve some of it's weaknesses.

To solve the issue with hidden faces (right now it just projects out copies of the visible textures, you could combine this with inpainting: Rotate the scene / model to expose untextured faces sides that get masked out and inpainted, then rinse and repeat until the model is fully textured.

Another interesting (albeit much harder) thing to try would be to take the image output from SD, run it through MiDaS to get a new depth map with additional details. Diff the depth maps in 3d space and update the geometry to match before projecting the image. Combine it with the first suggestion and you would have a process for progressively refining a detailed, fully textured model.

bryced · on Dec 21, 2022

> Another interesting (albeit much harder) thing to try would be to take the image output from SD, run it through MiDaS to get a new depth map with additional details. Diff the depth maps in 3d space and update the geometry to match before projecting the image. Combine it with the first suggestion and you would have a process for progressively refining a detailed, fully textured model.

I made a go at this exact approach last night. I'm new to working with 3d data but at least got a mesh rendered from the depth map. Spent most of my time fighting with tooling. Let me know if you want to help out.

dwallin · on Dec 21, 2022

I've been running through the problem in my head from a theoretical perspective but am in a similar situation in regards to familiarity with tooling.

The depth map encodes an array of relative positions, but without knowing the camera settings, field of view, etc. mapping them back to a coordinate is a guess. Luckily if you are rendering your initial depth map from a 3d model, you can use the camera settings to get the information you need to convert a depth map back into a correct array of 3d pixel coordinates.

You can use those pixel coordinates, which you know lie along existing surfaces, to subdivide the mesh to add complexity where needed. Then when you generate the new image and corresponding depth map, you can project those into 3d space using the same camera settings. You will need to filter out coordinates that are past some margin of error from the existing mesh and then use those to push and pull the existing mesh faces.

Map your textures onto your updated mesh and then also generate a confidence texture. The more each face points towards the camera the more confident you can be about the texture there.

Then move / rotate the camera and repeat, but this time also render the model with the confidence texture to generate an inpainting mask image. Continue from different positions / angles until the scene or model has been full textured with a high degree of confidence across the entire texture.

CyanBird · on Dec 21, 2022

Seems that the texture generation is still being relegated to image generation rather than sampling the curvature, AO or normal maps?

Thank you for sharing, this is just incredible

matthewfcarlson · on Dec 21, 2022

Do you have that written up anywhere? That sounds amazing as I'd love to see mesh generation in Houdini.

DethNinja · on Dec 22, 2022

It is commercial code, so I can’t share it. However, algorithm is really simple to implement in Houdini:

* Generate depth image from 3D meshes

* Send depth image to local HTTP server running Hugging Face diffusers library. Generate textured image.

* Cast textured image to a 2D plane surface that covers the camera extends.

* Traverse the UV surface of the 3D meshes and find closest point on the 2D stable diffusion image plane. Copy the RGB value to UV texture and you’re done.

robertlagrant · on Dec 21, 2022

It looks like the artwork from that game Psychonauts. Awesome.

isoprophlex · on Dec 21, 2022

The pace and extent of innovation in this space is breathtaking. Almost every day there's a new super original idea showing up, that builds on something we just simply didn't have four months ago.

The open sourcing of these models was of course instrumental to this. So thanks Huggingface et al., I guess!

charcircuit · on Dec 21, 2022

Stable Diffusion is not an open source model.

IanCal · on Dec 21, 2022

It's very clear how the discussion goes from here so it's really worth short circuiting it and providing the detail.

The models are openly available but the license places some restrictions on the use.

> You agree not to use the Model or Derivatives of the Model:

> - In any way that violates any applicable national, federal, state, local or international law or regulation;

> - For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;

> - To generate or disseminate verifiably false information and/or content with the purpose of harming others;

> - To generate or disseminate personal identifiable information that can be used to harm an individual;

> - To defame, disparage or otherwise harass others;

> - For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;

> - For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;

> - To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;

> - For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories;

> - To provide medical advice and medical results interpretation;

> - To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use).

https://github.com/CompVis/stable-diffusion/blob/main/LICENS...

For many, this distinction is more of either an academic one or one where we are OK with those kind of restrictions. Where I live if I did many of these things I would be either criminally or civilly (the right phrasing?) liable so having a license that tells me I can't break the law is a little redundant. I think 4 is possibly legal but an edge case and the last one is nation state level.

You are probably technically correct, but the context here is very important IMO.

dwallin · on Dec 21, 2022

Another piece of the puzzle that is worth noting, it is unclear if the model itself is copyrightable. There's a reason that they require you to affirmatively agree to the license before sharing the model instead of just attaching the license.

This way if a court rules that generated models don't qualify for copyright they can still enforce it under contract law.

tbodt · on Dec 21, 2022

Beyond what you described there is a weird clause in paragraph 7:

7. Updates and Runtime Restrictions. To the maximum extent permitted by law, Licensor reserves the right to restrict (remotely or otherwise) usage of the Model in violation of this License, update the Model through electronic means, or modify the Output of the Model based on updates. *You shall undertake reasonable efforts to use the latest version of the Model.*

I think this means it is now illegal to use SD 1 if at all reasonably possible to use SD 2?

janekm · on Dec 21, 2022

“Licensor reserves the right”… the licensor has not invoked this clause and SD1.x is still listed on the huggingface repos of the respective licensors (the situation is a little complicated by the fact that 1.4, 1.5 and 2.x were each released by different licensors, so a revocation would most likely require agreement from all those parties). Still worth keeping in mind from an open source perspective…

IanCal · on Dec 21, 2022

So for a start not being compliant with a license is not the same as something being illegal. Then I'm not sure whether the last sentence is tied to the rest of the clause if it relates to usage of the model in violation of the license. I'm not sure if v1 and v2 are considered the same model or not. Arguable either way.

It does add risks for a company sure.

edit - thanks for adding that in, it's an important part of the picture.

bandana · on Dec 21, 2022

It is

andrewon · on Dec 21, 2022

A primer to Depth model in Stable Diffusion 2: https://stable-diffusion-art.com/depth-to-image/

liuliu · on Dec 21, 2022

depth2img is underrated for SDv2 release, probably because AUTOMATIC1111 integration is done only 1 or 2 weeks ago.

Later this week, Draw Things will have a release with depth2img model as well, initially only supporting iOS-provided depth map, but will do MiDaS inferred depth map soon. Going to be exciting to see what people comes up with.

Der_Einzige · on Dec 21, 2022

Does this work with dreambooth automatic webUI extension yet? 2.1 doesn't work

liuliu · on Dec 22, 2022

I think, we may be able to apply LoRA trick to transfer custom dreambooth models (trained with SD v2) back to depth guide models.

joewhatkins · on Dec 21, 2022

We got this working in Three.js - it works great for painting scenery and backgrounds.

FarhadG · on Dec 21, 2022

Care to share more? Would love to see how it's being leveraged in that space

matthewfcarlson · on Dec 27, 2022

I'd love to see this as well

ramoz · on Dec 21, 2022

If you like using Colab, here’s a link to a notebook that spins up a nice simple demo of the depth model.

https://github.com/backnotprop/Colab-Stable-Diffusion-2-Dept...

prvc · on Dec 21, 2022

I wish Stable Diffusion had focused on more fine art and professional photography in their training corpus, like DALL-E did in its own one. All outputs I've seen have a decidedly low-quality or amateurish look, which must have been picked up from somewhere.

orbital-decay · on Dec 21, 2022

Thing is, the primary power of large models is not in the styles they memorized, but in the ability to be fine-tuned and guided by reference content. The focus is your choice, you can shift the model bias with any available finetuning method and a certain amount of reference images. 99% of the training has already been done for you, so all you need is a relatively subtle push achievable on a single GPU (or a few GPUs).

Take a look at Analog Diffusion outputs, for example.

dwringer · on Dec 21, 2022

With patience the guidance can even be done simply, without training, with enough prompts, blending of prompts, input images to img2img, etc. I'm not an expert on art history, but those who are seem able to get just about any style on any image with just the base stable diffusion model. Adding specific photography terms can emulate some of the results from the analog diffusion model in regular SD as well.

prvc · on Dec 21, 2022

It has nothing to do with "style". DALL-E 2 consistently astonishes with its composition, framing, variety of visual elements, and so on, which tend to also be good in oil paintings and professional photographs, but not at all in the sort of content which predominates on art sharing apps with user-generated content.

orbital-decay · on Dec 22, 2022

If you mean porn (which is censored in DALL-E), SD also can't make it as it's mostly filtered out from the training set. All of it is made by custom finetuned models, which is my point. Same with hands, which SD can struggle with. Styles are not the only thing that you can transfer, but subjects and concepts as well.

prvc · on Dec 22, 2022

>If you mean porn

I certainly do not.

orbital-decay · on Dec 22, 2022

Well... That's what predominates as user-generated content, I think :)

Anyway, none of the existing image-gen models can be used in content production in their vanilla state. Prompt to image is fine as a hobby, but style/concept transfer is what makes them usable in professional setting. Or rather will make in nearest future, as this is all still highly experimental. SD in particular is quite small and is not a ready to use product, not intended for direct usage. It's a middleware model to build products upon. Such as Midjourney.

l33tman · on Dec 21, 2022

You're just prompting it wrong. Try adding qualifiers that have well-known associations with pro-level photography. For example these two makes your prompts incredibly much better, append either "Canon EOS 5D Mark IV, Sigma 85mm f/1.4" or "photography by Annie Liebovitz" (for a bit more artsy photos) to the end of your prompt and see the difference.

If you're not doing portrait photography, replace Sigma 85mm f/1.4 with something else for example Sigma 24mm for more wide-angle photography.

lelandfe · on Dec 21, 2022

Note that in my extensive testing specific lenses and cameras tend to make broad, random changes to the output unless they tend to have a specific aesthetic associated with them.

E.g. 85mm vs 24mm make no specific changes to a photo. SD appears to just interpret these as "make a photo look realistic", and any changes to the photo as you switch between them are simply incidental.

l33tman · on Dec 21, 2022

Yeah the particular item of the focal length is not as important than hinting in the prompt that you want a photorealistic pic in general. Still the conditioning guidance has a lots of terms and maybe it can tip it in a good direction at some unclear points during a diffusion process the more details you add.

nickthegreek · on Dec 21, 2022

You need to look at a better source. SD 2.x is trained on alot of professional photography. There is plenty of output which is not low-quality/amateurish. There is also the Analog Diffusion model is the sd1.5 fine tuned for film photography.

https://huggingface.co/wavymulder/Analog-Diffusion

andybak · on Dec 21, 2022

That doesn't match my observations. I've seen a very wide range of styles out of Stable Diffusion (and I've probably viewed mulitple thousands)

pea · on Dec 21, 2022

On a similar note, has anyone had any luck using these models to try different paint colourings in their house (from uploading photos of their rooms)? I've struggled to do this with SD, and the non-AI apps aren't great at it either.

CuriouslyC · on Dec 21, 2022

Try converting the images to grayscale, then setting the denoising strength very low and prompting for the exact colors you want things to be.

herendin2 · on Dec 21, 2022

When you afford SD img2img influence with a high strength parameter, so it can change the colors of a photo, you also are letting it have enough influence to redesign the room at random and not just changing the colors

Dreambooth trained on the ground truth photos might help a bit

AuryGlenz · on Dec 22, 2022

I’d love for there to be a form of frequency separation, where we could keep the details but mess with the colors/less major details.

Photographers have been attempting to emulate various films in different ways forever. Stable diffusion could do it absolutely perfectly.

nickthegreek · on Dec 21, 2022

This seems much easier in an app like photoshop/affinity if you want to keep everything really static. Or by using a combination of photoshop and then SD to clean up.

usrusr · on Dec 21, 2022

Take the depth field from life size furniture instead, all covered in chroma key surface and make "we never use the same crazy virtual studio twice, not even the arrangement" the identity of your fixed camera show?

xwdv · on Dec 21, 2022

When will AI artists get fast enough that they can generate imagery as you type?

monkeydust · on Dec 21, 2022

Can this be used for virtual 2d try on of clothing?

bzzzt · on Dec 21, 2022

Looked impressive until I saw the guitars ;)

itronitron · on Dec 21, 2022

coming soon to an AirBnB listing near you :)

system2 · on Dec 21, 2022

[flagged]

codetrotter · on Dec 21, 2022

It’s ok. If you want you can read it in Threadreader instead. https://threadreaderapp.com/thread/1605276764598665217.html

I personally prefer to just read it on Twitter directly these days if someone links to Twitter, as long as Twitter doesn’t pop up the thing asking me to sign in which it sometimes does.

But I do secretly wish that everyone would migrate away from Twitter altogether and use Mastodon etc instead.

folopf · on Dec 21, 2022

Mastodon is a cult though. I can't get over how irritatingly homogenous in opinion it is despite being federated, thanks to the draconian moderation policy that the large instances impose.

mrd3v0 · on Dec 21, 2022

There are hundreds if not thousands of instances which federate with each other. The entire point of Mastodon is to avoid giving too much power to one instance. Join a smaller one.

xwdv · on Dec 21, 2022

What if the smaller one isn’t accepted to the federation?

kelseyfrog · on Dec 22, 2022

Have more pro-social or socially acceptable behavior?

_ugfj · on Dec 21, 2022

Isn't all AI art stolen art?

xwdv · on Dec 21, 2022

Good artists borrow, great artists steal.

_ugfj · on Dec 22, 2022

Yeah trivialize away ruining many artists livelihoods with this bullshit. But what did I expect from HN bros.