Hacker Newsnew | past | comments | ask | show | jobs | submit | michaelcampbell's commentslogin

"This is AI" is the new "This is 'shopped, I can tell by the pixels."

I can tell by the em dashes

Total tangent, but what vagary of HTML (or the Brave Browser, which I'm using here) causes words to be split in very odd places? The "inspect" devtools certainly didn't show anything unusual to me. (Edit: Chrome, MS Edge, and Firefox do the same thing. I also notice they're all links; wonder if that has something to do with it.)

https://i.imgur.com/HGa0i3m.png


CSS on the <a> tags:

word-break: break-all;


It's an error in the site's CSS. CSS has way better methods, like splitting words correctly depending on the language and hyphenating it.

Although I can never remember the correct incantation, should be easy for LLMs.


CSS word-break property

Ask Claude?

Going to give this one a try; I'm still partial to Atkinson hyper legible, but this one looks fun.

"some of you may die, but it's a risk I'm willing to take"

> give reasons why

Because it'll be an LLM guided bot handing out bans, so no one will actually KNOW why.


"I will never be a world class athlete, so I play for the love of the sport."

Helps me.


> What problems (besides the obvious) have been found in which "memory-safe languages" can help.

Why isn't that enough?


Probably too early to tell, but the tech industry is rife with magic incantations and long held beliefs that we do because we've always done them, not because they "work".

I also wonder; it's a normal scraper mechanism doing the scraping, right? Not necessarily an LLM in the first place so the wholesale data-sucking isn't going "read" the file even if it IS accessed?

Or is this file meant to be "read" by an LLM long after the entire site has been scraped?


Yes. It's a basic scraper that fetches the document, parses it for URLs using regex, then fetches all those, repeat forever.

I've done honeypot tests with links in html comments, links in javascript comments, routes that only appear in robots.txt, etc. All of them get hit.


What about scripted transformations? Or just add a simple timestamp to the query and only allow it to be used up to a week later? (Whether it works without the parameter could be tested too)

We need to update robots.txt for the LLM world, help them find things more efficiently (or not at all I guess). Provide specs for actions that can be taken. Etc.

If current behaviour is anything to go by, they will ignore all such assistance, and instead insist on crawling infinite variations of the same content accessed with slightly different URL-patterns, plus hallucinate endless variations of non-existent but plausible looking URLs to hit as well until the server burns down - all on the off-chance that they might see a new unique string of text which they can turn into a paperclip.

There's no LLM in the loop at all, so any attempt to solve it by reasoning with an LLM is missing the point. They're not even "ignoring" assistance as sibling supposes. There simply is no reasoning here.

This is what you should imagine when your site is being scraped:

   def crawl(url):
    r = requests.get(url).text
    store(text)
    for link in re.findall(r'https?://[^\s<>"\']+', r):
        crawl(link)

Sure, but at some point the idea is to train an LLM on these downloaded files no? I mean what is the point of getting them if you don't use them. So sure, this won't be interpreted during the crawling but it will become part of the knowledge of the LLM

Training is not inference, there is no reasoning happening then either.

Even if it did have some effect down the line it wouldn't help sites like AA with their scraping problem, which is the issue at hand.


You mean to add bad Monte-Carlo generated slop pages which are only advertised as no-go in the robots.txt file, right?

Absolutely.

I assume that there are data brokers, or AI companies themselves, that are constantly scraping the entire internet through non-AI crawlers and then processing data in some way to use it in the learning process. But even through this process, there are no significant requests for LLMs.txt to consider that someone actually uses it.


I assume this might be changing. Anecdotally, from what I've read here, I think we're starting to see headless browsers driven by LLMs for the purposes of scraping (to get around some of the content blocks we're seeing). Perhaps this is a solution to a problem that won't work now, but in the future, maybe.

I think it depends. LLMs now can look up things on the fly to bypass the whole "this model was last updated in December 2025" issue of having dated information. I've literally told Claude before to look up something after it accused me of making up fake news.

Good on him, then. Much luck and hopes of prosperity.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: