More

michaelcampbell · 2026-03-01T17:42:47 1772386967

"This is AI" is the new "This is 'shopped, I can tell by the pixels."

tingletech · 2026-03-01T18:02:56 1772388176

I can tell by the em dashes

michaelcampbell · 2026-03-01T17:38:53 1772386733

Total tangent, but what vagary of HTML (or the Brave Browser, which I'm using here) causes words to be split in very odd places? The "inspect" devtools certainly didn't show anything unusual to me. (Edit: Chrome, MS Edge, and Firefox do the same thing. I also notice they're all links; wonder if that has something to do with it.)

https://i.imgur.com/HGa0i3m.png

werdnapk · 2026-03-01T17:51:16 1772387476

CSS on the <a> tags:

word-break: break-all;

knallfrosch · 2026-03-01T19:03:38 1772391818

It's an error in the site's CSS. CSS has way better methods, like splitting words correctly depending on the language and hyphenating it.

Although I can never remember the correct incantation, should be easy for LLMs.

fancy_pantser · 2026-03-01T17:53:05 1772387585

CSS word-break property

rosstex · 2026-03-01T19:29:16 1772393356

Ask Claude?

michaelcampbell · 2026-03-01T17:29:03 1772386143

Going to give this one a try; I'm still partial to Atkinson hyper legible, but this one looks fun.

michaelcampbell · 2026-03-01T13:40:43 1772372443

"some of you may die, but it's a risk I'm willing to take"

michaelcampbell · 2026-02-23T12:39:12 1771850352

> give reasons why

Because it'll be an LLM guided bot handing out bans, so no one will actually KNOW why.

michaelcampbell · 2026-02-23T12:33:04 1771849984

"I will never be a world class athlete, so I play for the love of the sport."

Helps me.

michaelcampbell · 2026-02-23T12:31:49 1771849909

> What problems (besides the obvious) have been found in which "memory-safe languages" can help.

Why isn't that enough?

michaelcampbell · 2026-02-18T12:45:19 1771418719

Probably too early to tell, but the tech industry is rife with magic incantations and long held beliefs that we do because we've always done them, not because they "work".

michaelcampbell · 2026-02-18T12:42:25 1771418545

I also wonder; it's a normal scraper mechanism doing the scraping, right? Not necessarily an LLM in the first place so the wholesale data-sucking isn't going "read" the file even if it IS accessed?

Or is this file meant to be "read" by an LLM long after the entire site has been scraped?

hamdingers · 2026-02-18T15:26:29 1771428389

Yes. It's a basic scraper that fetches the document, parses it for URLs using regex, then fetches all those, repeat forever.

I've done honeypot tests with links in html comments, links in javascript comments, routes that only appear in robots.txt, etc. All of them get hit.

efreak · 2026-02-18T19:45:56 1771443956

What about scripted transformations? Or just add a simple timestamp to the query and only allow it to be used up to a week later? (Whether it works without the parameter could be tested too)

dumbfounder · 2026-02-18T18:33:36 1771439616

We need to update robots.txt for the LLM world, help them find things more efficiently (or not at all I guess). Provide specs for actions that can be taken. Etc.

gamesieve · 2026-02-18T19:38:13 1771443493

If current behaviour is anything to go by, they will ignore all such assistance, and instead insist on crawling infinite variations of the same content accessed with slightly different URL-patterns, plus hallucinate endless variations of non-existent but plausible looking URLs to hit as well until the server burns down - all on the off-chance that they might see a new unique string of text which they can turn into a paperclip.

hamdingers · 2026-02-18T22:24:17 1771453457

There's no LLM in the loop at all, so any attempt to solve it by reasoning with an LLM is missing the point. They're not even "ignoring" assistance as sibling supposes. There simply is no reasoning here.

This is what you should imagine when your site is being scraped:

   def crawl(url):
    r = requests.get(url).text
    store(text)
    for link in re.findall(r'https?://[^\s<>"\']+', r):
        crawl(link)

flaburgan · 2026-02-18T23:09:28 1771456168

Sure, but at some point the idea is to train an LLM on these downloaded files no? I mean what is the point of getting them if you don't use them. So sure, this won't be interpreted during the crawling but it will become part of the knowledge of the LLM

hamdingers · 2026-02-19T16:24:08 1771518248

Training is not inference, there is no reasoning happening then either.

Even if it did have some effect down the line it wouldn't help sites like AA with their scraping problem, which is the issue at hand.

boothby · 2026-02-18T22:58:07 1771455487

You mean to add bad Monte-Carlo generated slop pages which are only advertised as no-go in the robots.txt file, right?

reconnecting · 2026-02-18T12:56:05 1771419365

Absolutely.

I assume that there are data brokers, or AI companies themselves, that are constantly scraping the entire internet through non-AI crawlers and then processing data in some way to use it in the learning process. But even through this process, there are no significant requests for LLMs.txt to consider that someone actually uses it.

olivia-banks · 2026-02-18T21:16:24 1771449384

I assume this might be changing. Anecdotally, from what I've read here, I think we're starting to see headless browsers driven by LLMs for the purposes of scraping (to get around some of the content blocks we're seeing). Perhaps this is a solution to a problem that won't work now, but in the future, maybe.

giancarlostoro · 2026-02-18T14:38:33 1771425513

I think it depends. LLMs now can look up things on the fly to bypass the whole "this model was last updated in December 2025" issue of having dated information. I've literally told Claude before to look up something after it accused me of making up fake news.

michaelcampbell · 2026-02-17T20:23:35 1771359815

Good on him, then. Much luck and hopes of prosperity.