Hacker Newsnew | past | comments | ask | show | jobs | submit | runarberg's commentslogin

Don’t know about your parent, but I am certainly on of those “AI can’t make anyone more productive”.

Well, at least I would say that while being a bit hyperbolic. But folks like us who prefer to see claims by corporations trying to sell you stuff backed by behavioral research before we start taking the corporation’s word for it.


When I searched for "its in the tfa meaning" this was my third result on Duck Duck Go:

https://news.ycombinator.com/item?id=19781756

When I searched for "tfa internet meaning", The fifth result looked helpful so I clicked it, and it was:

https://www.noslang.com/search/tfa

Searching the internet wasn’t hard before AI, and it isn’t hard today.


I just googled "what is tfa", and none of the results on the first page were related to the current topic.

Try “TFA acronym Internet forums”.

But surely your search engine must have given you the answer within your first three clicks, if not, perhaps you should consider a better search engine.

> Engineers turn their nose at this, but look who has tapped into this wealthy revenue stream.

This may be one of the most tone deaf, american imperialist sentiment, I’ve heard on HN for a while.

Engineers who have any sense of morality have a pretty good reason to turn their nose at this, and there is no but needed to follow that sentence.


If you read the comment a little more closely, it is very obvious that the "this" engineers turn their noses up at is the flexible model full of glue code, ala Salesforce, as opposed to "good architecture".

It's more or less in the same vein as pointing out that WordPress powered a massive chunk of the Internet despite violating almost every good coding practice you can name, and that getting things done is what makes money, not building ivory towers.

The fact that you turned that argument into some sort of anti American screed says much more about you than the parent.


To be fair, I had the same interpretation as OP here. One cannot have an earnest discussion of Palintir without at least implicitly including the privacy & military industrial complex associations of this company.

That is why I called it tone deaf, I admit the part about American Imperialism may have been unwarranted (may is in emphasis for a reason).

This engineer turned their nose at the bad architecture and glue code, but neglected to mention the total lack of morality from Palantir. I would argue that abandoning morality and aiding the American imperalist machine in its war against human rights and dignity, has been a much bigger reason for Palantir’s success then their lack of good engineering practice. They are willing to get paid for something most people morally object to. Lots of engineers are willing to abandon their craftsmanship if it pays well enough, few their morals.

Perhaps I read too much into this absence, in which case the post is only tone deaf, but I favor the read where this absence was intentional, in which case it is both tone deaf and American imperialist.


I don't think it's at all fair to make this kinds of inference from what was written, you'd have to make huge assumptions, and also take an ideological perspective as well. It might be a perfectly valid critique ... but it can't be at all inferred from the comment.

I'm really puzzled. I frequently post scathing criticisms of government spying on HN.

I'm one of the top HN commentors for the string "1984" and led to 10% of the mentions last year (as someone else blogged about).

I was just admiring the operations and scaling of it. It's pretty impressive to grow to such a scale.


A lot of people, especially outside the US are going to look through a cynical geopolitical lens, which is not entirely unreasonable, so it's not 'surprising' at all that people would jump on this.

For example, I think Musk is a horrible person and I view all of his statements through the 'lens' of the fact he is lying, confabulating and he's a jerk.

But - I mean, SpaceX does work, it's by all means a pretty good company (work-life balance not withstanding).

It's really hard to separate these issues.


That admiration is the tone deafness I perceived. It comes across as “we gotta hand it to ISIS” in its best interpretation.

Palantir has been on Amnesty International’s list of companies aiding in human rights violations since 2020, in particular for aiding DHS and ICE in illegal deportations and family separations, in 2023 the company provided tech to the IDF which was then used in the Gaza Genocide, the company prided it self of it (and consequently a lot of their staff resigned as their morality did not allow them to work there). This is just to say we are not just talking about government spying here, Palantir is a major participant in many of the worlds worst human rights abuses of the past decade.

Palantir is probably the company on the planet right now who is perceived by the general public as the most evil, and I for one think this company deserves this reputation. One does not, in fact, got to hand it to Palantir.


There are levels to things. In a professional context (including product design and documentation/instructions) don‘t use machine translation[†].

For your personal hobby site or for general online communication, you probably shouldn’t use machine translation, but it is probably useful if have B1 language skills and are checking up on your grammar, vocabulary, etc. As for using LLMs to help you write, I certainly prefer people use the traditional models over LLMs, as the traditional models still require you to think and forces you to actually learn more about the output language.

For reading somebody else’s content in a language you don‘t understand, machine translation is fine up to a point, as long as you are aware that it may not be accurate.

---

† In fact I personally I think EU should mandate translator qualification, and probably would have only 20 years ago when consumer protection was still a thing they pretended to care about.


I suspect the majority of these illegal moves are in blitz or bullet tournaments in game 12 of the third day, and the player touches one peace but moves another, or hits the clock with the hand that didn’t make the move, or hits the clock without making a move. I don‘t think any expert level chess player grabs a captured rook and places it on the board, or moves a light squared bishop to a dark square, unless they are hustling at the park, in which case (it can be argued) moves like this with a slight of hand is part of the game.

> It's a proxy for generalized reasoning.

And so for I am only convinced that they have only succeeded on appearing to have generalized reasoning. That is, when an LLM plays chess they are performing Searle’s Chinese room thought experiment while claiming to pass the Turing test


Wait, I may be missing something here. These benchmarks are gathered by having models play each other, and the second illegal move forfeits the game. This seems like a flawed method as the models who are more prone to illegal moves are going to bump the ratings of the models who are less likely.

Additionally, how do we know the model isn’t benchmaxxed to eliminate illegal moves.

For example, here is the list of games by Gemini-3-pro-preview. In 44 games it preformed 3 illegal moves (if I counted correctly) but won 5 because opponent forfeits due to illegal moves.

https://chessbenchllm.onrender.com/games?page=5&model=gemini...

I suspect the ratings here may be significantly inflated due to a flaw in the methodology.

EDIT: I want to suggest a better methodology here (I am not gonna do it; I really really really don’t care about this technology). Have the LLMs play rated engines and rated humans, the first illegal move forfeits the game (same rules apply to humans).


The LLMs do play rated engines (maia and eubos). They provide the baselines. Gemini e.g. consistently beats the different maia versions.

The rest is taken care of by elo. That is they then play each other as well, but it is not really possible for Gemini to have a higher elo than maia with such a small sample size (and such weak other LLMs).

Elo doesn't let you inflate your score by playing low ranked opponents if there are known baselines (rated engines) because the rated engines will promptly crush your elo.

You could add humans into the mix, the benchmark just gets expensive.


I did indeed miss something. I learned after posting (but before my EDIT) that there are anchor engines that they play.

However these benchmarks still have flaws. The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.

Second (and this is a minor one) Maia 1900 is currently rated at 1774 on lichess[2], but is 1816 on the leaderboard, to the author’s credit they do admit this in their methodology section.

Third, and this is a curiosity, gemini-3-pro-preview seems to have played the same game twice against Maia 1900[3][4] and in both cases Maia 1900 blundered (quite suspiciously might I add) mate in one when in a winning position with Qa3?? Another curiosity about this game. Gemini consistently played the top 2 moves on lichess. Until 16. ...O-O! (which has never been played on lichess) Gemini had played 14 most popular lichess moves, and 2 second most popular. That said I’m not gonna rule out that the fact that this game is listed twice might stem from an innocent data entry error.

And finally, apart from Gemini (and Survival bot for some reason?), LLMs seem unable to pass Maia-1100 (rated 1635 on lichess). The only anchor bot before that is random bot. And predictably LLMs cluster on both sides of it, meaning they play as well as random (apart from the illegal moves). This smells like benchmaxxing from Gemini. I would guess that the entire lichess repertoire features prominently in Gemini’s training data, and the model has memorized it really well. And is able to play extremely well if it only has to play 5-6 novel moves (especially when their opponent blunders checkmate in 1).

1: https://github.com/lightnesscaster/Chess-LLM-Benchmark/commi...

2: https://lichess.org/@/maia9

3: https://chessbenchllm.onrender.com/game/6574c5d6-c85a-4cb3-b...

4: https://chessbenchllm.onrender.com/game/4af82d60-8ef4-47d8-8...


> The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.

This is not true. This is clearly spelled out in FIDE rules and is upheld at tournaments. First illegal move is a warning and reset. Second illegal move is forfeit. See here https://rcc.fide.com/article7/

I doubt GDM is benchmarkmaxxing on chess. Gemini is a weird model that acts very differently from other LLMs so it doesn't surprise me that it has a different capability profile.


>> 7.5.5 After the action taken under Article 7.5.1, 7.5.2, 7.5.3 or 7.5.4 for the first completed illegal move by a player, the arbiter shall give two minutes extra time to his/her opponent; for the second completed illegal move by the same player the arbiter shall declare the game lost by this player. However, the game is drawn if the position is such that the opponent cannot checkmate the player’s king by any possible series of legal moves.

I stand corrected.

I’ve never actually played competitive chess, I’ve just heard this from people who do. And I thought I remembered once in the Icelandic championships where a player touched one piece but moved the other, and subsequently made to forfeit the game.


Replying in a split thread to clearly separate where I was wrong.

If Gemini is so good at chess because of a non-LLM feature of the model, then it is kind of disingenuous to rate it as an LLM and claim that LLMs are approaching 2000 ELO. But the fact it still plays illegal moves sometimes, is biased towards popular moves, etc. makes me think that chess is still handled by an LLM, and makes me suspect benchmaxxing.

But even if no foul play, and Gemini is truly a capable chess player with nothing but an LLM underneath it, then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot. My fourth point above was my strongest point. There are only 4 anchor engines, one beats all LLMs, second beats all except Gemini, the third beats all LLMs except Gemini and Survival bot (what is Survival bot even doing there?) and the forth is random bot.


Gemini is an LLM. It playing chess is not relying on a non-LLM module of some sort. I'm just saying that as an LLM, Gemini has a peculiar profile compared to other LLMs (likely an artifact of its post-training process). In particular Gemini is very capable, but also quite misaligned (it will more often actively sabotage users).

> then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot

That's overly reductive. That would be true if we didn't see improvement over time from the other LLMs but we clearly do. In particular, even if Gemini is benchmarkmaxxing, this means that LLMs from other labs will eventually get there as well. Benchmarkmaxxing can be thought of as "premature" reaching of benchmarks. But I can't think of a single benchmark that was benchmarkmaxxed that wasn't eventually saturated by every single LLM provider (because being able to benchmarkmaxx serves as an existence proof that there is an LLM capable of it and as more training gets done on the LLMs the other ones get there).


The problem with benchmaxxing is that lies about the capabilities of the technology. IF all we wanted was a machine that plays chess, we would just use a chess engine, which we have known how to make for decades. If Google wanted Gemini to be able to play chess, it would be much easier (and better; and hellavulat cheaper) to stick a traditional chess engine into their product and defer all chess to that engine.

The claim here (way up thread) was: “we have the technology to train models to do anything that you can do on a computer, only thing that's missing is the data”, and the implication is that logic and reasoning is an emerging properties of these models if given enough data and enough parameters. However the evidence seems to suggest otherwise. Logic and reasoning have to be specifically programmed into these models, and even with dataset as vast as online chess games (just lichess has 7.1 billion games), if that claim above were true, chess should be easy for LLMs, but it obviously isn’t. And that tells us something about the limitations of the technology.


That’s a devastating benchmark design flaw. Sick of these bullshit benchmarks designed solely to hype AI. AI boosters turn around and use them as ammo, despite not understanding them.

Relax. Anyone who's genuinely interested in the question will see with a few searches that LLMs can play chess fine, although the post-trained models mostly seem to be regressed. Problem is people are more interested in validating their own assumptions than anything else.

https://arxiv.org/abs/2403.15498

https://arxiv.org/abs/2501.17186

https://github.com/adamkarvonen/chess_gpt_eval


I like this game between grok-4.1-fast and maia-1100 (engine, not LLM).

https://chessbenchllm.onrender.com/game/37d0d260-d63b-4e41-9...

This exact game has been played 60 thousand times on lichess. The peace sacrifice Grok performed on move 6 has been played 5 million times on lichess. Every single move Grok made is also the top played move on lichess.

This reminds me of Stefan Zweig’s The Royal Game where the protagonist survived Nazi torture by memorizing every game in a chess book his torturers dropped (excellent book btw. and I am aware I just committed Godwin’s law here; also aware of the irony here). The protagonist became “good” at chess, simply by memorizing a lot of games.


The LLMs that can play chess, i.e not make an illegal move every game do not play it simply by memorized plays.

> That’s a devastating benchmark design flaw

I think parent simply missed until their later reply that the benchmark includes rated engines.


This claim sounds plausible, but it is also testable. Do you know whether this has actually been tested in an experimental setting?

I still have not been convinced otherwise that LLMs are just super fancy (and expensive) curve fitting algorithms.

I don‘t like to throw the word intelligence around, but when we talk about intelligence we are usually talking about human behavior. And there is nothing human about being extremely good at curve fitting in multi parametric space.


At the risk of explaining the insult:

https://en.wikipedia.org/wiki/Not_even_wrong

Personally I think not even wrong is the perfect description of this argumentation. Intelligence is extremely scientifically fraught. We have been doing intelligence research for over a century and to date we have very little to show for it (and a lot of it ended up being garbage race science anyway). Most attempts to provide a simple (and often any) definition or description of intelligence end up being “not even wrong”.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: