They're already doing that https://docs.perplexity.ai/guides/bots There's *Perpl...

avallach · 2025-08-05T13:44:05 1754401445

And then once they see that the website operator blocked the perplexity-user, apparently instead of respecting that, they not only ignore robots.txt, but actively try to bypass the security measures established with the explicit purpose of limiting their access. If this was about bypassing DRM rather than AI-WAF, it would be plainly illegal.

To me this invalidates their whole claim that Cloudflare fails to tell the difference between scraper and user-driven agent. Instead, distinguishing them is trivial, and the block is intentional.

skeledrew · 2025-08-05T15:19:19 1754407159

I use Perplexity regularly for research because it does a good job accessing, preprocessing and citing relevant resources. Which do you think is better: the service respects my desire for it to do a good job and ignore site owners blocking agent access because "don't like automated agents", or the service respects said site owners' - what I consider unreasonable - desires and not do a good job for me? Expand to the inevitably increasing LLM-for-research user base.

avallach · 2025-08-05T16:29:16 1754411356

I can totally see your point. It's a bit like that fight of news agencies against the free snippets and aggregations on 3rd party websites. The Internet is supposed to be open after all.

But it also feels like essentially "pirating" the webpages while erasing their brand. Maybe it's even a tolerable transitive situation, but you can't even argue it's beneficial in the same way as game piracy could be according to some. In the long term, we need an incentive for the content creators to willingly allow such processing. Otherwise, a lot of high quality content will eventually become members-only with DRM-like anti agent protections.

The incentive doesn't have to be monetary. I could for example imagine some website owners allow AI agents that commit to upfront verbatim repeating some sort of mandatory headers/messages/acknowledgements from the content authors, before copying or summarizing, and are known to stick to this commitment.

You can also bypass the problem already now by accessing and copying the content manually, and then putting it in the context of a tool like NotebookLM. Nobody's hurt, because you have actually seen the source by yourself, and that's all the website owners can reasonably demand.

TL;DR: why even post quality content in open if the audience won't see your ads, your donation button, or even your name. What do you think?

viraptor · 2025-08-05T20:58:47 1754427527

This kind of makes sense for chatgpt and others. But perplexity links to your content directly. I end up clicking more perplexity sources than search results in practice. I don't know how well that generalises, but the traffic is not just going away.

skeledrew · 2025-08-05T17:21:25 1754414485

> In the long term, we need an incentive for the content creators to willingly allow such processing. Otherwise, a lot of high quality content will eventually become members-only with DRM-like anti agent protections.

I partially agree with this. Yes, some incentive is OK, for some cases. I wouldn't be OK with a mandatory header/message for example showing up in my output, unless there's some very direct relevance to the content. But there could be some kind of tipper markup/code embedded in the site metadata that my agent abstracts away as content rating feedback options, and tips automatically made on my behalf if I have it configured and selected the "useful" option. Of course source citation should also be a mandatory part of the output, for that branding and also in case there's desire to go beyond the output.

However, there will also always be content authors out there who share quality content freely with no expectation of any kind of return. The "problem" is that such content usually isn't SEO-optimized, and so likely won't be in the top results. There will be little lost if those optimizing for return start blocking their content as they'll also be automatically deranked, by virtue of content access issues, and the non-optimized content will then rise to the surface.

TL;DR: suggested configurable creator-tipping system abstracted behind feedback options, and the likely case that those who block access will be deranked in favor of those maintaining open access.