Hacker Newsnew | past | comments | ask | show | jobs | submit | dnnssl2's commentslogin

MirageLSD: The First Live-Stream Diffusion (LSD) Model - A Vid2Vid running in real time, infinite generation, zero latency. Available now in a live-hosted unlimited demo at https://mirage.decart.ai!

Please check out our Wired article on this model: https://www.wired.com/story/decart-artificial-intelligence-m...


Looks sick! Is the goal to make a set of models that would allow for complete immersive VR/AR? Like Ready-Player-One vibes?


What can this handle? Code? Browser? Computer Use?


Oasis is playable so therefore:

1. Non-cherrypicked in its consistency (if you look at the demonstrations in the Oasis blog post you can find specific cases of consistency which is an anomaly rather than the norm)

2. Is live-inferenced at 20fps. If you use Runway v3 which is a comparably larger and higher quality model (resolution and consistency) it might take a minute or two generate 10 seconds of video.

3. Is served (relatively) reliably at consumer scale (with queues of 5-10k concurrent players) which means that in order to save on GPU cost, you increase batch size and decrease model size to “fit” more players in 1 GPU.



What is the upper bound on the level of improvement (high performance networking, memory and compute) you can achieve with ternary weights?


What’s the difference between all of the other query optimization startups? Bluesky, etc.


We have better tech. For our customers, this translates directly into more savings.

We also have less setup and overhead than most of the other companies in the space. many of them come in with recommendations for system changes that you need to implement, and which they then charge you for; we take about ten minutes to set up and then generate savings automatically.


How does one select a good candidate for the draft model in speculative decoding? I imagine that there's some better intuition than just selecting the next parameter count down (i.e 70B -> 13B, 13B -> 7B).

Also how does that interact with MoE models? Do you have a mini version of the MoE, with smaller experts?


This is indeed a bit of a dark art. Essentially, you want a balance between "is significantly faster than base model" and "generates similar stuff to the base model".

Anecdotally, folks often seem to use say, 70B base + 7B as verifier. But I think there's a lot of room for experimentation and improvement here.

You could... say, take a 70B model and maybe just chop off the last 90% of layers and then fine-tune. Or perhaps you could use a model that's trained to generate 8 tokens at once. Or perhaps you could just use statistical "n-gram" predictor.


If you were to serve this on a datacenter server, is the client to server roundtrip networking the slowest part of the inference? Curious if it would be faster to run this cloud GPUs on better hardware but farther compute, or locally with worse hardware.


Surprisingly, no. And part of this is that text generation is really expensive. Unlike traditional ML inference (like with, resnets), you don't just pass your data through your model once. You need to pass it over and over again (once for each token you generate).

So, in practice, a full "text completion request" can often take on the order of seconds, which dwarfs the client <-> server roundtrip.


Is this still the case for sliding window attention/streaming LLMs, where you have a fixed length attention window rather than infinitely passing in new tokens for quadratic scaling? You even get better performance due to purposely downsampling non-meaningful attention sink tokens.


I cover it a bit in the blog post, but unless you have a really long context length (like 32k+), your primary computational cost doesn't come from attention but rather from loading your weights from VRAM into registers.

I mean, practically speaking, completions from say, ChatGPT or Claude take seconds to finish :)


What are some of the better use cases of fast inference? From my experience using ChatGPT, I don't need it to generate faster than I can read, but waiting for code generation is painful because I'm waiting for the whole code block to format correctly, be available to copy or execute (in the case of code interpreter). Anything else fall under this pattern?


The main thing is chat is just one application of LLMs. Other applications are much more latency sensitive. Imagine, for instance, an LLM-powered realtime grammar checker in an editor.


Most LLM model use shouldn't be 'raw' but as part of a smart & iterative pipeline. Ex:

* reading: If you want it to do inference over a lot of context, you'll need to do multiple inferences. If each inference is faster, you can 'read' more in the same time on the same hardware

* thinking: a lot of analytical approaches essentially use writing as both memory & thinking. Imagine iterative summarization, or automatically iteratively refining code until it's right

For louie.ai sessions, that's meant a fascinating trade-off here when doing the above:

* We can use smarter models like gpt-4 to do fewer iterations...

* ... or a faster but dumber model to get more iterations in the same amount of time

It's entirely not obvious. For example, the humaneval leaderboard has gpt4 for code being beat by gpt 3.5 for code when run by a LATS agent: https://paperswithcode.com/sota/code-generation-on-humaneval . This highlights that the agent framework is the one really responsible for final result quality, so their ability to run many iterations in the same time window matters.


Programmatic and multi-step use cases. If you need chain-of-thought or similar, tool use, etc. Generating data.

Most use cases outside of classic chat.

For example, I made an on-demand educational video project, and the slowest part was by far the content generation. RAG, TTS, Image generation, text rendering, and video processing were all a drop in the bucket, in comparison.

It would be an even wider gap now, and TTS is super-realtime, and image generation can be single step.


Perhaps this is naive, but in my mind it can be useful for learning.

- Hook LLM to VMs

- Ask for code that [counts to 10]

- Run code on VM

- Ask different LLM to Evaluate Results.

- Repeat for sufficient volume.

- Train.

The faster it can generate results the faster those results can be tested against the real world, e.g. a VM, users on X, other models with known accuracies.


One obvious use case is that it makes per-token generation much cheaper.


That's not so much a use case, but I get what you're saying. It's nice that you can find optimizations to shift down the pareto frontier of across the cost and latency dimension. The hard tradeoffs are for cases like inference batching where it's cheaper and higher throughput but slower for the end consumer.

What's a good use case for an order of magnitude decrease in price per token? Web scale "analysis" or cleaning of unstructured data?


Under the same conditions where enterprise versions of the API have significantly less latency and better reliability than personal. OpenAI can change anything about the underlying infrastructure.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: