robots.txt main purpose back in the day was curtailing penalties in the search engines when you got stuck maintaining a badly-built dynamic site that had tons of dynamic links and effectively got penalized for duplicate content. It was basically a way of saying "Hey search engines, these are the canonical URLs, ignore all the other ones with query parameters or whatever that give almost the same result."
It could also help keep 'nice' crawlers from getting stuck crawling an infinite number of pages on those sites.
Of course it never did anything for the 'bad' crawlers that would hammer your site! (And there were a lot of them, even back then.) That's what IP bans and such were for. You certainly wouldn't base it on something like User-Agent, which the user agent itself controlled! And you wouldn't expect the bad bots to play nicely just because you asked them.
That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.
Or the Evil Bit proposal, to suggest that malware should identify itself in the headers. "The Request for Comments recommended that the last remaining unused bit, the "Reserved Bit" in the IPv4 packet header, be used to indicate whether a packet had been sent with malicious intent, thus making computer security engineering an easy problem – simply ignore any messages with the evil bit set and trust the rest."
While we're at it, it should be noted that Do Not Track was not, apparently, a joke.
It's the same as a noreply email, if you can get away with sticking your fingers in your ears and humming when someone is telling you something you don't want to hear, and you have a computer to hide behind, then it's all good.
It is ridiculous, but it is what you get when you have conflicting interests and broken legislation. The rule is that tracking has to be opt-in, so websites do it the way they are more likely to get people to opt in, and it is a cookie banner before you access the content.
Do-not-track is opt-out, not opt-in, and in fact, it is not opt-anything since browsers started to set it to "1" by default without asking. There is no law forcing advertisers to honor that.
I guess it could work the other way: if you set do-not-track to 0 (meaning "do-track"), which no browser does by default, make cookies auto-accept and do not show the banner. But then the law says that it should require no more actions to refuse consent than to consent (to counter those ridiculous "accept or uncheck 100 boxes" popups), so it would mean they would also have to honor do-not-track=1, which they don't want to.
I don't know how legislation could be unbroken. Users don't want ads, don't want tracking, they just want the service they ask for and don't want to pay for it. Service providers want exactly the opposite. Also people need services and services need users. There is no solution that will satisfy everyone.
Labor laws are not set to satisfy everyone, they are set such that a company cannot use it’s outsized power to exploit their workers, and that workers have fair chance at negotiating a fair deal, despite holding less power.
Similarly consumer protection laws—which the cookie banners are—are not set to satisfy everyone, they are set such that companies cannot use their outsized power to exploit their customers. A good consumer protection law will simply ban harmful behavior regardless of whether companies which engage in said harmful behavior want are satisfied with that ban or not. A good consumer protection law, will satisfy the user (or rather the general public) but it may satisfy the companies.
Good consumer protection laws are things like disclosure requirements or anti-tying rules that address information asymmetries or enable rather than restrict customer choice.
Bad consumer protection laws try to pretend that trade offs don't exist. You don't want to see ads, that's fine, but now you either need to self-host that thing or pay someone else money to do it because they're no longer getting money from ads.
There is no point in having an opt in for tracking. If the user can be deprived of something for not opting in (i.e. you can't use the service) then it's useless, and if they can't then the number of people who would purposely opt in is entirely negligible and you ought to stop beating around the bush and do a tracking ban. But don't pretend that's not going to mean less "free stuff".
The problem is legislators are self-serving. They want to be seen doing something without actually forcing the trade off that would annihilate all of these companies, so instead they implement something compromised to claim they've done something even though they haven't actually done any good. Hence obnoxious cookie banners.
That whole argument assumes that you as a consumer can always find a product with exactly the features you want. Because that's a laughable fiction, there need to be laws with teeth to punish bad behaviors that nearly every product would indulge in otherwise. That means things like requiring sites to get permission to track, and punishing those that track users without permission. It's a good policy in theory, but it needs to be paired with good enforcement, and that's where things are currently lacking.
> That's whole argument assumes that you as a consumer can always find a product with exactly the features you want. Because that's a laughable fiction
There are very many industries where this is exactly what happens. If you want a stack of lumber or a bag of oranges, it's a fungible commodity and there is no seller who can prevent you from buying the same thing from someone else if you don't like their terms.
If this is ever not the case, the thing you should be addressing is that, instead of trying to coerce an oligopoly that shouldn't exist into behaving under the threat of government penalties rather than competitive pressure. Because an uncompetitive market can screw you in ten thousand different ways regardless of whether you've made a dozen of them illegal.
> That means things like requiring sites to get permission to track, and punishing those that track users without permission. It's a good policy in theory, but it needs to be paired with good enforcement, and that's where things are currently lacking.
It's not a good policy in theory because the theory is ridiculous. If you have to consent to being tracked in exchange for nothing, nobody is going to do that. If you want a ban on tracking then call it what it is instead of trying to pretend that it isn't a ban on the "free services in exchange for tracking data" business model.
I think you might be misunderstanding the purpose of consumer protection. It is not about consumer choice, but rather it is about protecting consumer from the inherent power imbalance that exists between the company and their customers. If there is no way to doing a service for free without harming the customers, this service should be regulated such that no vendor is able to provide this service for free. It may seem punishing for the customers, but it is not. It protects the general public from this harmful behavior.
I actually agree with you that cookie banners are a bad policy, but for a different reason. As I understand it there are already requirements that the same service should also be available to opt-out users, however as your parent noted, enforcement is an issue. I, however, think that tracking users is extremely consumer hostile, and I think a much better policy would be a simple ban on targeted advertising.
> I think you might be misunderstanding the purpose of consumer protection. It is not about consumer choice, but rather it is about protecting consumer from the inherent power imbalance that exists between the company and their customers.
There isn't an inherent power imbalance that exists between the company and their customers, when there is consumer choice. Which is why regulations that restrict rather than expand consumer choice are ill-conceived.
> If there is no way to doing a service for free without harming the customers, this service should be regulated such that no vendor is able to provide this service for free.
But that isn't what those regulations do, because legislators want to pretend to do something while not actually forcing the trade off inherent in really doing the thing they're only pretending to do.
> I, however, think that tracking users is extremely consumer hostile, and I think a much better policy would be a simple ban on targeted advertising.
Which is a misunderstanding of the problem.
What's actually happening in these markets is that we a) have laws that create a strong network effect (e.g. adversarial interoperability is constrained rather than required) which means that b) the largest networks win, and the networks available for free then becomes the largest.
Which in turn means you don't have a choice, because Facebook is tracking everyone but everybody else is using Facebook, which means you're stuck using Facebook.
If you ban the tracking while leaving Facebook as the incumbent, two things happen. First, those laws are extremely difficult to enforce because neither you nor the government can easily tell what they do with the information they inherently get from the use of a centralized service, so they aren't effective. And second, they come up with some other business model -- which will still be abusive because they still have market power from the network effect -- and then get to blame the new cash extraction scheme on the law.
Whereas if you do what you ought to do and facilitate adversarial interoperability, that still sinks their business model, because then people are accessing everything via user agents that block tracking and ads, but it does it while also breaking their network effect by opening up the networks so they can't use their market power to swap in some new abusive business model.
I am not a legislator, nor an expert in consumer law, and there is no way I could think of a regulation against targeted advertising, but that doesn’t mean it is impossible. I think claiming it to be impossible demonstrate a lack of imagination. And I would even think some consumer protection, or privacy advocacy groups have already drafted some legislation outlines for regulating targeted ads (as I said, I’m not an expert, and wouldn’t even know where to begin looking for one [maybe the EFF?]).
> There isn't an inherent power imbalance that exists between the company and their customers
That is very simplistic, and maybe idealistic from an unrealistic view of free-market capitalism. But there is certainly an inherent power imbalance. Before leaded gasoline was banned, it was extremely hard for environmentally conscious consumer to make the ethical choice and buy unleaded gasoline. Before seatbelts were required, a safety aware consumer might still have bought a car without one simply because the cars with seatbelts were either unavailable or unaffordable. Those aren’t real choices, but rather choices which are forced onto the consumer as a result of the competitive environment where the consumer hostile option generates much more revenue for the company.
> I am not a legislator, nor an expert in consumer law, and there is no way I could think of a regulation against targeted advertising, but that doesn’t mean it is impossible.
The hard part isn't the rule, it's the enforcement.
To begin with, banning targeted advertising isn't really what you want to do anyway. If you have a sandwich shop in Pittsburgh and you put up billboards in Pittsburgh but not in Anchorage, you're targeting people in Pittsburgh. If you sell servers and you buy ads in a tech magazine, you're targeting tech people. I assume you're not proposing to require someone who wants to buy ads for their local independent pet store to have nearly all of them shown to people who are on the other side of the country?
What you're really trying to do is ban the use of individualized tracking data. But that's extremely difficult to detect, because if you tell Facebook "show this ad to people in Miami", how do you know if it's showing them to someone because they're viewing a post likely to be popular with people in Miami in general vs. because the company is keeping surveillance dossiers on every individual user?
The only thing that actually works is for them not to have the data to begin with. Which is the thing where you have to empower user agents to provably constrain what information services have about their users, i.e. adversarial interoperability.
> That is very simplistic, and maybe idealistic from an unrealistic view of free-market capitalism.
It's a factual description of competitive markets.
> Before leaded gasoline was banned, it was extremely hard for environmentally conscious consumer to make the ethical choice and buy unleaded gasoline.
The ban on leaded gasoline isn't a consumer protection regulation, it's an environmental regulation. Gas stations weren't selling leaded gasoline in spite of customers preferring unleaded, they were selling it because it was cheaper to make and therefore what customers preferred in the absence of a ban. It's a completely different category of problem and results from an externality in which the seller and the buyer both want the same thing but that thing harms some third party who isn't participating in the transaction.
> Before seatbelts were required, a safety aware consumer might still have bought a car without one simply because the cars with seatbelts were either unavailable or unaffordable.
This is how safety features evolve.
Seat belts were invented in the 19th century but we didn't start getting strong evidence of their effectiveness until the 1950s and 60s. Meanwhile that's the same period of time the US started building the interstate system with the corresponding increase in vehicle ownership, and therefore accidents.
So into the 1960s there was an increasing concern about vehicle safety, the percentage of cars offered with seat belts started increasing, and then Congress decided to mandate them -- which is what the market was already doing, because the customers (who are largely the same people as the voters) were demanding it.
That is a consistent trend. Things like that get mandated just as the majority of the market starts offering them, and then Congress swoops in to take credit for the benefit of what was already happening regardless.
What those laws really do is a) increase compliance costs (and therefore prices), and b) prohibit the minority of customers from buying something for specific reasons which is different than what the majority wants, because it's banned. For example, all cars are now required to have anti-lock brakes, but ABS can increase stopping distances on certain types of terrain. A professional driver who is buying a vehicle for specific use on those types of terrain is now prohibited from buying a vehicle without ABS on purpose even though it's known to cause safety problems for them.
> Those aren’t real choices, but rather choices which are forced onto the consumer as a result of the competitive environment where the consumer hostile option generates much more revenue for the company.
That type of choice is the thing that specifically doesn't happen in a competitive market, because then the consumer goes to a competitor.
Where it does happen is in uncompetitive markets, but in that case what you need is not to restrict the customer's choices, it's to increase competition.
> since browsers started to set it to "1" by default without asking
IIRC IE10 did that, to much outcry because it upended the whole DNT being an explicit choice; no other browser (including Edge) set it as a default.
There have been thoughts about using DNT (the technical communication mechanism about consent/objection) in correlation with GDPR (the legal framework to enforce consent/objection compliance)
The GDPR explicitly mentions objection via technical means:
> In the context of the use of information society services, and notwithstanding Directive 2002/58/EC, the data subject may exercise his or her right to object by automated means using technical specifications.
I myself consider DNT as what it means at face value: I do not want to be tracked, by anyone, ever. I don't know what's "confusing" about that.
The only ones that are "confused" are the ones it would be detrimental to i.e the ones that perform and extract value from the tracking, and make people run in circles with contrived explanations.
It would be perfectly trivial for a browser to pop up a permission request per website like there is for webcams or microphone or notifications, and show no popup should I elect to blanket deny through global setting.
For one, Do Not Track is on the client side and you just hope and pray that the server honors it, whereas cookie consent modals are something built by and placed in the website.
I think you can reasonably assume that if a website went through the trouble of making such a modal (for legal compliance reasons), the functionality works (also for legal compliance reasons). And, you as the client can verify whether it works, and can choose not to store them regardless.
I would assume most websites would still set cookies even if you reject the consent, because the consent is only about not technically necessary cookies. Just because the website sets cookies doesn't tell you whether it respects you selection. Only if it doesn't set any cookies can you be sure, and I would assume that's a small minority of websites.
The goal with Do Not Track was legal (get governments to recognize it as the user declining consent for tracking and forbidding additional pop-ups) and not technological.
Unfortunately, the legal part of it failed, even in the EU.
So it did the same work that a sitemap does? Interesting.
Or maybe more like the opposite: robots.txt told bots what not to touch, while sitemaps point them to what should be indexed. I didn’t realize its original purpose was to manage duplicate content penalties though. That adds a lot of historical context to how we think about SEO controls today.
> I didn’t realize its original purpose was to manage duplicate content penalties though.
That wasn’t its original purpose. It’s true that you didn’t want crawlers to read duplicate content, but it wasn’t because search engines penalised you for it – WWW search engines had only just been invented and they didn’t penalise duplicate content. It was mostly about stopping crawlers from unnecessarily consuming server resources. This is what the RFC from 1994 says:
> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
> It was mostly about stopping crawlers from unnecessarily consuming server resources.
Very much so.
Computation was still expensive, and http servers were bad at running cgi scripts (particularly compared to the streamlined amazing things they can be today).
SEO considerations came way way later.
They were also used, and still are, by sites that have good reasons to not want results in search engines. Lots of court files and transcripts, for instance, are hidden behind robots.txt.
I think this is still relevant today in cases where there are not many resources available: think free tiers, smallest fixed cost/fixed allocation scenarios, etc.
I always consider "good" a bot that doesn't disguise itself and follows the robots.txt rules. I may not consider good the final intent of the bot or the company behind it, but the crawler behaviour is fundamentally good.
Especially considering the fact that it is super easy to disguise a crawler and not follow the robots conventions
Well you as the person running a website can define unilaterally what you consider good and bad. You may want bots to crawl everything, nothing or (most likely) something inbetween. Then you judge bots based on those guidelines. You know like a solicitor that rings your bell that has a text above it saying "No solicitors", certain assumptions can be made about those who ignore it.
> That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.
It's usually a bad default to assume incompetence on the part of others, especially when many experienced and knowledgeable people have to be involved to make a thing happen.
The idea behind the DNT header was to back it up with legislation-- and sure you can't catch and prosecute all tracking, but there are limitations on the scale of criminal move fast and break things before someone rats you out. :P
I admit I'm one of those people. After decades where I should perhaps be a bit more cynical, from time to time I am still shocked or saddened when I see people do things that benefit themselves over others.
But I kinda like having this attitude and expectation. Makes me feel healthier.
i like that Veritasium vid a lot, i've watched it a couple times. The thing is, there's no way to retaliate against a crawler ignoring robots.txt. IP bans don't work, user agent bans don't work, there's no human to shame on social media ether. If there's no way to retaliate or provide some kind of meaningful negative feedback then the whole thing breaks down. Back to the Veritasium video, if a crawler defects they reap the reward but there's no way for the content provider to defect so the crawler defects 100% of the time and gets 100% of the defection points. I can't remember when i first read the rfp for robots.txt but I do remember finding it strange that it was a "pretty please" request against a crawler that has a financial incentive to crawl as much as it can. Why even go through the effort to type it out?
EDIT: i thought about it for a min, i think in the olden days a crawler crawling every path through a website could yield an inferior search index. So robots.txt gave search engines a hint on what content was valuable to index. The content provider gained because their SEO was better (and cpu util. lower) and the search engine gained because their index was better. So there was an advantage to cooperation then but with crawlers feeding LLMs that isn't the case.
This is a really cool tool. I haven't seen it before. Thank you for sharing it!
On their README.md they state:
> This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies. Anubis is as lightweight as possible to ensure that everyone can afford to protect the communities closest to them.
> Trust by default, also by default, never ignoring suspicious signals.
While I absolutely love the intent of this idea, it quickly falls apart when you're dealing with systems where you only get the signals after you've already lost everything of value.
It's easy to believe, though, and most of us do it every day. For example, our commute to work is marked by the trust that other drivers will cooperate, following the rules, so that we all get to where we are going.
There are varying degrees of this through our lives, where the trust lies not in the fact that people will just follow the rules because they are rules, but because the rules set expectations, allowing everyone to (more or less) know what's going on and decide accordingly. This also makes it easier to single out the people who do not think the rules apply to them so we can avoid trusting them (and, probably, avoid them in general).
In Southern Europe, and countries with similar cultures, we don't obey rules because someone says so, we obey them when we see that is actually reasonable to do so, hence my remark regarding culture as I also experienced living in countries where everyone mostly blindly follow the rules, even if they happen to be nonsense.
Naturally I am talking about cultures where that decision has not been taken away from their citizens.
> I also experienced living in countries where everyone mostly blindly follow the rules, even if they happen to be nonsense.
The problem with that is that most people are not educated enough to judge what makes sense and what doesn’t, and the less educated you are, the more likely you are to believe you know what makes sense when you’re actually wrong. These are exactly the people that should be following the rules blindly, until they actually put in the effort to learn why those rules exist.
I believe there is a difference between education and critical thinking. One may not have a certain level of education, but could exercise a great degree of critical thinking. I think that education can help you understand the context of the problem better. But there are also plenty of people who are not asking the right questions or not asking questions - period - who have lots of education behind them. Ironically, sometimes education is the path that leads to blind trust and lack of challenging the status quo.
> the less educated you are, the more likely you are to believe you know what makes sense
It actually frightens me how true this statement is.
To reinforce my initial position about how important the rules are for setting expectations, I usually use cyclists as an example. Many follow the proposed rules, understanding they are traffic, and right of way is not automagically granted based on the choice of vehicle, having more to do with direction and the flow of said traffic.
But there's always a bad apple, a cyclist who assumes themselves to be exempt from the rules and rides against the flow of traffic, then wonders why they got clipped because a right-turning driver wasn't expecting a vehicle to be coming from the direction traffic is not supposed to come from.
In the end, it's not really about what we drive or how we get around, but whether we are self-aware enough to understand that the rules apply to us, and collectively so. Setting the expectation of what each of our behaviors will be is precisely what creates the safety that comes with following them, and only the dummies seem to be the ones who think they are exempt.
As a French, being passed by the right by Italian drivers on the highway really makes me feel the superiority of Southern Europeans judgment over my puny habit of blindly following rules. Or does it?
But yes, I do the same. I just do not come here to pretend this is virtue.
The rules in France are probably different but passing on the right is legal on Italian highways, in one circumstance: if one keeps driving on the lane on the right and somebody slower happens to be driving on the lane on the left. The rationale is that it normally happens when traffic is packed, so it's ok even if there is little traffic. Everybody keep driving straight and there is no danger.
It's not legal if somebody is following the slower car on the left and steers to the right to pass. However some drivers stick to the left at a speed slower than the limit and if they don't yield what happens is that eventually they get passed on the right.
The two cases have different names. The normal pass is "sorpasso", the other one (passing by not steering) is "superamento", which is odd but they had to find a word for it.
Not sure if it is a virtue, but standing as a pedestrians in an empty street at 3 AM waiting for a traffic light to turn green doesn't make much sense either, it isn't as if a ghost car is coming out of nowhere.
It should be a matter of judgement and not following rules just because.
I kind of agree. The rules for safety should be simple, straightforward, and protect you in the "edge cases", i.e. following while not paying 100% of attention, protect you with a malicious actor in mind aka reckless driver, etc. Ideally, in a system like that it should be a difficult and intentional behavior if one wanted to break the rules rather than to follow them.
I agree. I mostly mean that it is good to strive towards a system of rules that will be easy to follow and difficult to break by default. That is an ideal case. In reality, it is never that simple.
> For example, our commute to work is marked by the trust that other drivers will cooperate, following the rules, so that we all get to where we are going.
That trust comes from the knowledge that it's likely that those drivers also don't want to crash, and would rather prefer to get where they're going.
I apologize for that. I try to mitigate my US-centricness in my comments as much as possible, understanding completely that I am speaking with a global audience, but I am definitely not perfect at it :D
I suppose the same goes if you take the tube, ride a bike, walk, etc? There's still rules in terms of behavior, flow of traffic (even foot traffic), etc, that helps set a number of expectations so everyone can decide and behave accordingly. Happy to hear different thoughts on this!
The scenario I remember was that the underfunded math department had an underpowered server connected via a wide and short pipe to the overfunded CS department and webcrawler experiments would crash the math department's web site repeatedly.
What everybody is missing is that AI inference (not training) is a route out of the enshittification economy. One reason why Cloudflare is harassing you all the time to click on traffic lights and motorcycles is to slam the door from some of the exit routes.
It is so interesting to track this technology's origin back to the source. It makes sense that it would come from a background of limited resources where things would break if you overwhelm it. It didn't take much to do so.
I still see the value in robots.txt and DNT as a clear, standardised way of posting a "don't do this" sign that companies could be forced to respect through legal means.
The GDPR requires consent for tracking. DNT is a very clear "I do not consent" statement. It's a very widely known standard in the industry. It would therefore make sense that a court would eventually find companies not respecting it are in breach of the GDPR.
Would robot traffic be considered tracking in light of GDPR standards? As far as I know there are no regulatory rules in relation to enforcing robots behaviors outside of robots.txt, which is more of an honor system.
DNT and GDPR was just an example. In a court case about tracking, DNT could be found to be a clear and explicit opt out. Similarly, in a case about excessive scraping or the use of scraped information, robots txt could be used as a clear and explicit signal that the site operator does not want their pages harvested. It all but certainly gets rid of the "they put it on the public web so we assumed we can scrape it, we can'task everyone for permission" argument. They can't claim it was "in good faith" if there's a widely-accepted standard for opting out.
robots.txt main purpose back in the day was curtailing penalties in the search engines when you got stuck maintaining a badly-built dynamic site that had tons of dynamic links and effectively got penalized for duplicate content. It was basically a way of saying "Hey search engines, these are the canonical URLs, ignore all the other ones with query parameters or whatever that give almost the same result."
It could also help keep 'nice' crawlers from getting stuck crawling an infinite number of pages on those sites.
Of course it never did anything for the 'bad' crawlers that would hammer your site! (And there were a lot of them, even back then.) That's what IP bans and such were for. You certainly wouldn't base it on something like User-Agent, which the user agent itself controlled! And you wouldn't expect the bad bots to play nicely just because you asked them.
That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.
Or the Evil Bit proposal, to suggest that malware should identify itself in the headers. "The Request for Comments recommended that the last remaining unused bit, the "Reserved Bit" in the IPv4 packet header, be used to indicate whether a packet had been sent with malicious intent, thus making computer security engineering an easy problem – simply ignore any messages with the evil bit set and trust the rest."