Hacker Newsnew | past | comments | ask | show | jobs | submit | Vetch's commentslogin

I'm not sure that's the fully right mental model to use. They're not searching randomly with unbounded compute nor selecting from arbitrary strategies in this example. They are both using LLMs and likely the same ones, so will likely uncover overlapping possible solutions. Avoiding that depends on exploring more of the tail of the highly correlated to possibly identical distributions.

It's a subtle difference from what you said in that it's not like everything has to go right in a sequence for the defensive side, defenders just have to hope they committed enough into searching such that the offensive side has a significantly lowered chance of finding solutions they did not. Both the attackers and defenders are attacking a target program and sampling the same distribution for attacks, it's just that the defender is also iterating on patching any found exploits until their budget is exhausted.


Then the point still stands, this makes things even worse given that it's adding its own hallucinations on top, instead of simply relaying the content or idealistically, identifying issues in the reporting.


Being tall doesn't automatically make you good or dominant at basketball, you can even be too tall. Wemby might just be at that threshold, but the unusual thing about him is his dexterity despite his height; such maneuverability and flexibility is trainable. I hear he also spent the summer training, likely harder than most.


No, but being short is completely disqualifying, so being tall is certainly a component of the physical traits that make you good at basketball. If you're 5'2" , it doesn't matter what other gifts you have -- you will not be a pro male basketball player today.

In tennis, being too tall is clearly net bad, but being too short is also definitely bad. 80% of male pro tennis players are 5'10" - 6'4", which is certainly not the statistics of the general population.


Absolutely it's a combination of many factors. However height is undeniably very important. Wemby at 5'5" won't be as impressive a player, no matter how much he trained.


Dennis Rodman is a famous counterexample (tall, no particular talent, became an All-Star for rebounding and shot-blocking)


It's an artifact of post-training approach. Models like kimi k2 and gpt-oss do not utter such phrases and are quite happy to start sentences with "No" or something to the tune of "Wrong".

Diffusion also won't help the way you seem to think it will (that the outputs occur in a sequence is not relevant, what's relevant is the underlying computation class backing each token output, and there, diffusion as typically done does not improve on things. The argument is subtle but the key is that output dimension and iterations in diffusion do not scale arbitrarily large as a result of problem complexity).


You are right and the idea of LLMs as lossy compression has lots of problems in general (LLMs are a statistical model, a function approximating the data generating process).

Compression artifacts (which are deterministic distortions in reconstruction) are not the same as hallucinations (plausible samples from a generative model; even when greedy, this is still sampling from the conditional distribution). A better identification is with super-resolution. If we use a generative model, the result will be clearer than a normal blotchy resize but a lot of details about the image will have changed as the model provides its best guesses at what the missing information could have been. LLMs aren't meant to reconstruct a source even though we can attempt to sample their distribution for snippets that are reasonable facsimiles from the original data.

An LLM provides a way to compute the probability of given strings. Once paired with entropy coding, on-line learning on the target data allows us to arrive at the correct MDL based lossless compression view of LLMs.


Unless you're also writing your own graphics and game engine from scratch, if you're making a truly novel and balanced game, then it should not be possible to crank out code with AI. When working in engines, the bulk of the work is usually in gameplay programming so the fact that its code is so predictable should be concerning (unless the programming is effectively in natural language). Not spending most of your time testing introduced mechanics, re-balancing and iterating should be triggering alarm bells. If you're working on an RPG, narrative design, reactivity and writing will eat up most of your time.

In the case you're working as part of team large enough to have dedicated programmers, the majority of the roles will usually be in content creation, design and QA.


Why would the proportion of high quality games increase? The number yes, but I expect not the proportion. Lowering the entry barrier means more people who have spent less time honing their skills can release something that's lacking in polish, narrative design, fun mechanics and balance. Among new entrants, they should number more than those already able to make a fun game. Not a value judgement, just an observation.

Think of the negative reputation the Unity engine gained among gamers, even though a lot of excellent games and even performant games (DSP) have been made with it.

More competitors does also raise the bar required for novelty, so it is possible that standards are also rising in parallel.


We had shovelware games 25+ years ago (and probably 40 years ago, though I suspect the lack of microcomputers limited that). There were bargain-bin selections (literally bins full of CDs) that cost a few bucks and were utterly shite. I suspect the target audience was tech-unaware relatives who would be "little Johnny likes video games, I'll get him one of these...". Most of them were bad takes on popular games of the time.

Unity + Steam just makes this process a bit easier and more streamlined. I think the new thing is that as well as the dickwads who are trying to rip people off, there are well-intentioned newbie or indie developers releasing their unpolished attempts. These folks couldn't publish their work in the old days, because making CDs costs money, while now they can.


But why isn't this merely papering over a more fundamental issue with how these models are "aligned"? LLMs are, for example, not inherently sycophantic. kimi k2 and o3 are not, and Sydney, mentioned in the blog post, was most decidedly not.

In my experience, the issue of sycophancy has been longest in the Anthropic models, so it might be most deeply rooted for them. It's only recently, perhaps with the introduction of user A/B preference tests such as by lmarena and the providers themselves has this become a major issue for most other LLMs.

Thinking that simple actions like adding an anti-evil vector to the residual stream to improve behavior sounds naively dangerous. It would not surprise me if unexpected and unwanted downstream effects resulted from this; which a future paper will address too. Not unlike what happened with tuning for user preference.


The brain is certainly vastly more energy efficient at inference than LLMs on GPUs. But it looks like you're trying to make a different argument, that an LLM can spend less energy than a human to complete a given task. Unfortunately, you have not made that argument and I won't be reading unverified LLM output that might contain hallucinated steps or claims.

> V3/R1 scale models as a baseline, one can produce 720,000 tokens

On what hardware? At how many tokens per second? But most importantly, at what quality? I can use a PRNG to generate 7 billion tokens at a fraction of the energy use of an LLM but those tokens are not going to be particularly interesting. Simply counting how many tokens can be generated in a given time frame is still not a like for like comparison. To be complete, the cost required to match human level quality, if possible, also needs accounting for.

> Deeply thinking humans expend up to a a third of their total energy on the brain

Where did you get this from? A 70B LLM? It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%). This is because most of its energy use is spent on things like up-keep and maintaining resting membrane potential. Ongoing "Background activity" like the DMN also means the brain is always actively computing something interesting.


> > V3/R1 scale models as a baseline, one can produce 720,000 tokens On what hardware? At how many tokens per second? But most importantly, at what quality?

The hardware is the GB200 NVL72 by NVidia. This is for the class of 671B DeepSeek models, eg R1-0528 or V3, with their full accuracy setup (ie reproducing the quality of the reported DeepSeek benchmarks). Here is the writeup (by humans; the second figure shows the tokens per second per GPU as a function of the batch size, which emphasizes the advantages of centralized decoding, compared to current hacks at home): https://lmsys.org/blog/2025-06-16-gb200-part-1/

And here are the instructions to replicate the particular benchmark: https://github.com/sgl-project/sglang/issues/7227

The LLM text I linked in my original answer carries out the math using the energy consumption of the NVidia hardware setup (120kW) and rather simple arithmetic, which you can reproduce.


I agree with you that quality is the most important question, for similar reasons.

I don't think that current models are at expert level, but they do seem to be reliably good enough to be useful and pass standardised tests and be generally quite solidly in the "good enough you have to pay close attention for a while before you notice the stupid mistake" area that makes them very irritating for anyone running job interviews or publishing books etc.

And worse, I also think the numbers you're replying to are, at best, off by a few decimal places.

If I take the 0.36 bananas (which was already suspicious) and USD 0.1 / kWh, I get 0.004 USD. If I scale that up to by 1/0.72 to get a megatoken, that's still only 5/9ths of a cent.

If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap. It's close enough to the cost of their text-embedding-3-small (per megatoken) to be within the fudge-factor of my assumption about how much of their prices are electricity costs, but text-embedding are much much weaker than transformer models, to the point they're not worth considering in the same discussion unless you're making an academic point.

> It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%).

Indeed.

Now I'm wondering: how much power does the human brain use during an epileptic fit? That seems like it could plausibly be 70% of calories for a the few seconds of the seizure? But I've only got GCSE grade C in biology, so even with what I picked up the subsequent 25 years of general geeking, my idea of "plausible" is very weak.


> If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap

This assumption is very wrong. The primal cost factor in inference is the GPU itself. NVidia’s profit margins are very high; so are OpenAI’s margins for the API usage, even after taking into account the costs of the GPU. You can understand their margins if you read about inference at scale, and the lmsys blog in my parallel answer is a decent eye opener if you thought that companies sell tokens close to the price of electricity.


An alternative and perhaps easier way to estimate the relative importance of the GPU cost vs the electricity cost is to estimate how many years of constant use of the GPU at full power you need for the cost of industrial-scale electricity to catch up to the cost of the industrial scale GPU pricing. The H200 had 700W max power draw and about 40k USD cost (price varies a lot); typical lowest rental price a year ago was 2USD/h, possibly a bit lower by now. In 1h you could not even spent 1kWh electricity with them in optimal compute conditions, yet, at scale, you can negotiate prices lower than 0.05 USD per kWh of electricity at some parts of the US. Alternatively, assume 0.05 USD per kWh, and use the GB200 NVL72 that draws 120kW at peak. That is a cost of 6USD/hour or $52.6k per year. Even if one were to use the hardware for 10 years straight without problems at peak performance, the cost of electricity is way cheaper than the cost of the hardware itself (you have to ask NVidia for a quote, but expect a multi-million dollar tag and they have no shortage of customers ready to pay.)


That math is for random projections? Note that JL lemma is a worst case guarantee and in practice, there's a lot more distortion tolerance than the given bounds would suggest. Concepts tend to live in a space of much lower intrinsic dimensionality than the data's and we often care more about neighbor and rank information than precise pair-wise distances.

Also, JL is only a part of the story for the transformers.


Johnson-Lindenstrauss is an example of a probabilistic existence argument: the probability of a random projection having low error is nonzero, therefore a low-error projection must exist. That doesn't mean any given random projection can be expected to have low error, although if you keep rerolling often enough, you'll eventually find one.

The existence argument does only provide a lower bound on the number of dimensions that can be represented with low error, but there's not necessarily much room for improvement left.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: