A lot of people mentioned this! The "dreamlike" comparison is common as well. In both cases, you have a network of neurons rendering an image approximating the real world :) so it sort of makes sense.
Regarding the specific boiling-textures effect: there's a tradeoff in recurrent world models between jittering (constantly regenerating fine details to avoid accumulating error) and drifting (propagating fine details as-is, even when that leads to accumulating error and a simplified/oversaturated/implausible result). The forest trail world is tuned way towards jittering (you can pause with `p` and step frame-by-frame with `.` to see this). So if the effect resembles LSD, it's possible that LSD applies some similar random jitter/perturbation to the neurons within your visual cortex.
Yup, similar concepts! Just at two opposite extremes of the compute/scaling spectrum.
- That forest trail world is ~5 million parameters, trained on 15 minutes of video, scoped to run on a five-year-old iPhone through a twenty-year old API (WebGL GPGPU, i.e OpenGL fragment shaders). It's the smallest '3D' world model I'm aware of.
- Genie 3 is (most likely) ~100 billion parameters trained on millions of hours of video and running across multiple TPUs. I would be shocked if it's not the largest-scale world model available to the public.
These are extremely impressive from a technological progression standpoint, and at the same time not at all compelling, in the same way AI images and LLM prose are and are not.
It's neat I guess that I can use a few words and generate the equivalent of an Unreal 5 asset flip and play around in it. Also I will never do that, much less pay some ongoing compute cost for each second I'm doing it.
Exactly. People are getting so excited that all this stuff is possible, and forgetting that we are burning through innumerable finite resources just to prove something is possible.
They were too concerned with whether or not they could, they never stopped to think if they should.
Yeah, the future I see from this is just shitty walking video games that maybe look nice but have ridiculous input lag, stuttery frame rates, and no compelling gameplay loop or story. Oh and another tool to fill up facebook with more fake videos to make people angry. Oh well, I guess this is what we've decided to direct all our energy towards.
I was lucky enough to be an early tester, here's a brief video walking through the process of creating worlds, showing examples--walking on the moon, with Nasa photo as part of the prompt, being in 221B Baker street with Holmes and Watson, wandering through a night market in Taipei as a giant boba milk tea (note how the stalls are different, and sell different foods), and also exploring the setting of my award-nominated tabletop RPG.
On a technical level, this looks like the same diffusion transformer world model design that was shown in the Genie 3 post (text/memory/d-pad input, video output, 60sec max context, 720p, sub-10FPS control latency due to 4-frame temporal compression). I expect the public release uses a cheaper step-distilled / quantized version. The limitations seen in Genie 3 (high control latency, gradual loss of detail and drift towards videogamey behavior, 60s max rollout length) are still present. The editing/sharing tools, latency, cost, etc. can probably improve over time with this same model checkpoint, but new features like audio input/output, higher resolution, precise controls, etc. likely won't happen until the next major version.
From a product perspective, I still don't have a good sense of what the market for WMs will look like. There's a tension between serious commercial applications (robotics, VFX, gamedev, etc. where you want way, way higher fidelity and very precise controllability), vs current short-form-demos-for-consumer-entertainment application (where you want the inference to be cheap-enough-to-be-ad-supported and simple/intuitive to use). Framing Genie as a "prototype" inside their most expensive AI plan makes a lot of sense while GDM figures out how to target the product commercially.
On a personal level, since I'm also working on world models (albeit very small local ones https://news.ycombinator.com/item?id=43798757), my main thought is "oh boy, lots of work to do". If everyone starts expecting Genie 3 quality, local WMs need to become a lot better :)
Z-Image is another open-weight image-generation model by Alibaba [1]. Z-Image Turbo was released around the same time as (non-Klein) FLUX.2 and received generally warmer community response [2] since Z-image Turbo was faster, also high-quality, and reportedly better at generating NSFW material. The base (non-Turbo) version of Z-Image is not yet released.
Z-Image is roughly as censored as Flux 2, from my very limited testing. It got popular because Flux 2 is just really big and slow. It is, however, great at editing, has an amazing breadth of built in knowledge, and has great prompt adherence.
Z Image got popular because the people stuck with 12GB video cards could still use it, and hell - probably train on it, at least once the base version comes out. I think most people disparaging Flux 2 never tried it as they wouldn't want to deal with how slowly it would work on their system, if they even realize that they could run it.
Ahh I see, and Klein is basically a response to Z-Image Turbo, i.e. another 4-8B sized model that fits comfortably on a consumer GPU.
It’ll be interesting to see how the NSFW catering plays out for the Chinese labs. I was joking a couple months ago to someone that Seedream 4’s talents at undressing was an attempt to sow discord and it was interesting it flew under the radar.
Post-Grok going full gooner pedo, I wonder if it Grok will take the heat alone moving forward.
They are underselling Z-Image Turbo somewhat. It's arguably the best overall model for local image generation for several reasons including prompt adherence, overall output quality and realism, and freedom from censorship, even though it's also one of the smallest at 6B parameters.
ZIT is not far short of revolutionary. It is kind of surreal to contemplate how much high-quality imagery can be extracted from a model that fits on a single DVD and runs extremely quickly on consumer-grade GPUs.
Hold on now. Z-Image Turbo has gotten a lot of hype but it's worse at all of those things other than perhaps looking like it was shot on a cell phone camera than Qwen Image and Flux 2 (the full sized version). Once you get away from photographic portraits of people it quickly shows just how little it can do.
Not in my experience. Flux 2 is much larger and heavily censored, and Qwen-Image is just plain not as good. You can fool me into thinking that Z-Image Turbo output isn't AI, while that's rarely the case with Qwen.
Look at the images I posted elsewhere in this section. They are crappy excuses for pogo sticks, but they absolutely do NOT look like they came from a cell phone.
Also see vunderba's page at https://genai-showdown.specr.net/ . Even when Z-Image Turbo fails a test, it still looks great most of the time.
Edit re: your other comment -- don't make the mistake of confusing censorship with lack of training data. Z-Image will try to render whatever you ask for, but at the end of the day it's a very small model that will fail once you start asking for things it simply wasn't trained on. They didn't train it with much NSFW material, so it has some rather... unorthodox anatomical ideas.
However.. I’m already expecting the blowback when a Z-Image release doesn’t wow people like the Turbo finetune does. SDXL hasn’t been out two years yet, seems like a decade.
We’ll see. I’m hopeful that Z works as expected and sets the new watermark. I just am not sure it does it right out the gate.
Almost afraid to ask, but anytime grok or x or musk comes up I am never sure if there is some reality based thing, or some “I just need to hate this” thing. Sometimes they’re the same thing, other times they aren’t.
I can guess here that because Grok likely uses WAN that someone wrote some gross prompts and then pretended this is an issue unique to Grok for effect?
A few days ago people were replying to every image on Twitter saying "Grok, put him/her/it in a bikini" and Grok would just do it. It was minimum effort, maximum damage trolling and people loved it.
Ah. So, see, this is exactly why I need to check apparently.
Personally, I go between “I don’t care at all” and “well it’s not ideal” on AI generations. It’s already too late, but the barrier of entry is a lot lower than it was.
But I’m applying a good faith argument where GP does not seem to have intended one.
Reducing it to some people put people in bikinis for a couple days for the lulz is...not quite what happened.
You may note I am no shirking violet, nor do I lack perspective, as evidenced by my notes on Seedream. And fortuitiously, I only mentioned it before being dismissed as bad faith: I could not have foreseen needing to call out as credentials until now.
I don't think it's kind to accuse others of bad faith, as evidence by me not passing judgement on the person you are replying to's description.
I do admit it made my stomach churn a little bit to see how quickly people will other. Not on you, I'm sure I've done this too. It's stark when you're on the other side of it.
Nah it's been happening for months and involved kids, over and over, albeit for the same reasoning, lulz & totally based. I am a bit surprised that you thought this was just a PG-rated stunt on X for a couple days, it's been in the news for weeks, including on HN.
I see absolutely no citations. Can you point to anything that shows a specific Grok issue vs generally people doing icky things with photo generation software?
You can Google whatever you need yourself at this point, you told the world I was operating in bad faith based off one sentence from a stranger. You ignored my reply to you. And now you are engaging with me on another reply as if my claim was Grok is uniquely capable of this, when I in fact said the opposite, and the interesting part of the discussion was me pointing out all can do this. Have a good day!
I am of no party or clique, why would Elon be doing moderation anyway? He has better things to do. If anything, sounded understaffed and thus taken advantage of by ne’er do wells - you can check if I’m pivoting by noting I noted in my original post every model can do this and Grok being focused on was a strange aberration.
I feel pathetic defending myself to someone who keeps reading my mind in the blandest way possible, then accuses me of wrongthought I must have had, based on things I never said. Hard to believe you’re living up to your ideals in this moment if you’re a fellow advocate for truth seekers and great men. I respect interlocution, but not repeated personal attacks based on thoughts projected and things unsaid. That’s not truth seeking behavior.
Yeah the issue reads as if someone asked Claude Code "find the most serious performance issue in the VSCode rendering loop" and then copied the response directly into GitHub (without profiling or testing anything).
- Infra/systems: I was able to connect to a server within a minute or two. Once connected, the displayed RTT (roundtrip time?) was around 70ms but actual control-to-action latency was still around ~600-700ms vs the ~30ms I'd expect from an on-device model or game streaming service.
- Image-conditioning & rendering: The system did a reasonable job animating the initial (landscape photo) image I provided and extending it past the edges. However, the video rendering style drifted back to "constrast-boosted video game" within ~10s. This style drift shows up in their official examples as well (https://x.com/DynamicsLab_AI/status/1958592749378445319).
- Controls: Apart from the latency, control-following was relatively faithful once I started holding down Shift. I didn't notice any camera/character drift or spurious control issues, so I guess they are probably using fairly high-quality control labels.
- Memory: I did a bit of memory testing (basically - swinging view side to side and seeing which details got regenerated) and it looks like the model can retain maybe ~3-5s of visual memory + the prompt (but not the initial image).
This is very encouraging progress, and probably what Demis was teasing [1] last month. A few speculations on technical details based on staring at the released clips:
1. You can see fine textures "jump" every 4 frames - which means they're most likely using a 4x-temporal-downscaling VAE with at least 4-frame interaction latency (unless the VAE is also control-conditional). Unfortunately I didn't see any real-time footage to confirm the latency (at one point they intercut screen recordings with "fingers on keyboard" b-roll? hmm).
2. There's some 16x16 spatial blocking during fast motion which could mean 16x16 spatial downscaling in the VAE. Combined with 1, this would mean 24x1280x720/(4x16x16) = 21,600 tokens per second, or around 1.3 million tokens per minute.
3. The first frame of each clip looks a bit sharper and less videogamey than later stationary frames, which suggests this is could be a combination of text-to-image + image-to-world system (where the t2i system is trained on general data but the i2w system is finetuned on game data with labeled controls). Noticeable in e.g. the dirt/textures in [2]. I still noticed some trend towards more contrast/saturation over time, but it's not as bad as in other autoregressive video models I've seen.
Regarding latency, I found a live video of gameplay here [1] and it looks like closer to 1.1s keypress-to-photon latency (33 frames @ 30fps) based on when the onscreen keys start lighting up vs when the camera starts moving. This writeup [2] from someone who tried the Genie 3 research preview mentions that "while there is some control lag, I was told that this is due to the infrastructure used to serve the model rather than the model itself" so a lot of this latency may be added by their client/server streaming setup.
You know that thing in anxiety dreams where you feel very uncoordinated and your attempts to manipulate your surroundings result in unpredictable consequences? Like you try to slam on the brake pedal but your car doesn’t slow down, or you’re trying to get a leash on your dog to lead it out of a dangerous situation and you keep failing to hook it on the collar? Maybe that’s extra latency because your brain is trying to render the environment at the same time as it is acting.
Firstly it can render environments in detail. I'm (mostly) aphantasic even in dreams, so this wasn't obvious to me. But most people literally get visual renderings in their mind.
Secondly, it's fairly clear now that our sensory inputs are not being experienced as sensory inputs. We experience a reconstruction. Obvious basic sign of this is that we fill in the gap in vision where the optic nerve is. But generally, we're making an integrated world model all the time out of the senses, and are conscious of that world model.
You're right though, both the above are rendering the experience and can take shortcuts for that. It's sufficiently detailed in each case though that it kinda is rendering the world too, in some sense.
I guess I mean that we are awake experience the input from our senses, and that in a dream only the replication of the experience of seeing or hearing etc. is needed, not a replication of the input of the senses which then leads to the experience.
> I found a live video of gameplay here [1] and it looks like closer to 1.1s keypress-to-photon latency (33 frames @ 30fps) based on when the onscreen keys start lighting up vs when the camera starts moving.
I think the most likely explanation is that they trained a diffusion WM (like DIAMOND) on video rollouts recorded from within a 3D scene representation (like NeRF/GS), with some collision detection enabled.
This would explain:
1. How collisions / teleportation work and why they're so rigid (the WM is mimicking hand-implemented scene-bounds logic)
2. Why the scenes are static and, in the case of should-be-dynamic elements like water/people/candles, blurred (the WM is mimicking artifacts from the 3D representation)
3. Why they are confident that "There's no map or explicit 3D representation in the outputs. This is a diffusion model, and video in/out" https://x.com/olivercameron/status/1927852361579647398 (the final product is indeed a diffusion WM trained on videos, they just have a complicated pipeline for getting those training videos)
reply