Bypassing website anti-scraping protections

userbinator · on June 1, 2018

For example, for google.com, you can typically make only around 300 requests per day, and if you reach this limit, you will see a CAPTCHA instead of search results.

300 is pretty easy to achieve if you're "Googling hard enough" (make 5 slightly different queries, go through the 20 pages of results it's willing to show you, repeat 3 times...), and I've seen it trigger far before that if you are searching for more obscure things. It seems almost hostile to those searching for IC part numbers, specific and very exact phrases, and just "non mainstream" content in general.

How sad it is then, that we are told and have internalised the notion that we should use search engines like Google to find things, and yet it prevents us from "trying too hard" to find what we're looking for...

wumpus · on June 1, 2018

From my experience at blekko, 99.9% of the "people" who go deep into the results pages for a single query are actually bots. You're a very unusual user, and there are a lot of bots.

xenomachina · on June 1, 2018

There's a difference between going deep into the results, and progressively refining a query. The former is pretty indicative of bot behavior -- humans rarely go past even the first page of results. I do the latter all the time, and this frequently gets me Google's captcha, especially if I'm doing something like using site: and inurl: operators.

reitanqild · on June 1, 2018

I've managed to trigger Google's bot detector too while trying to find documentation for a certain bank api (legitimate reasons, we were supoosed to integrate and their docs didn't make sense).

danmg · on June 1, 2018

I run up against this all the time. My browser is fast from blocking all third party trackers and scripts. My searches are faster, so it thinks they're automated.

nerdponx · on June 1, 2018

Use DDG. It's fine for most things. Use Google as a fallback.

drewmol · on June 2, 2018

>Use Google as a fallback.

To do this from DDG, prepend your search using !g to search Google

kiliankoe · on June 2, 2018

Minor fyi, it doesn't matter if the bang comes at the beginning of the query. You can save the keystroke when refining to fall back to google and just throw it to the end of the query.

drewmol · on June 3, 2018

This is major, thx.

jardah · on June 1, 2018

Yes it is, and since it's IP based, it's even easier if you are for example working from an office and there are multiple people using google.

But that is why they only show recaptcha, you fill it in and you will get extemption cookie for 30 more requests :D

buzer · on June 1, 2018

> But that is why they only show recaptcha, you fill it in and you will get extemption cookie for 30 more requests :D

Does that actually work? Whenever I have been searching some obscure things and managed to get the captcha after 6-10 pages, it just goes to loop where it keeps giving it constantly. Though it stops giving it if I change the search terms.

confounded · on June 1, 2018

With a VPN on Brave on iOS, Google will only show me infinite captchas.

0x00000000 · on June 1, 2018

Instagram is the worst I have come across. If you are on a page with 1000+ pictures trying to find something near the bottom, you have to let it load each new group sequentially, then after a while it starts timing you out for like 60 seconds or longer every couple times you load more. God forbid you accidentally navigate away while scrolling you have to start all over again from the top.

Due to recent events it seems they got scared, locked down their API, then tightened down the request limit to prevent scraping to the point it is hardly usable on desktop anyway.

AznHisoka · on June 1, 2018

Linkedin is even worse, imo. Go to any random company page and it'll show you a page, asking you to login. Refresh it again and it'll show you the page, without asking you to login.

jorge_gonzalez · on June 2, 2018

That is probably because LinkedIn Authwall algorithm is on A/B testing, they do lots of machine learning so that they stop bots, but mostly scraping Linkedin is quite impossible i would say even on small volumes

textmode · on June 1, 2018

This sounds like an issue that is specific to Javascript-controlled browsers. If using a traditional, non-Javascript tcp/tls/http client it is trivial to extract the image urls and other information from the page using a single HTTP request (and from each successive page using more HTTP requests in a single connection, if "has_next_page" is "true"). No "API" needed. Can you provide an example of a single page with 1000+ images?

0x00000000 · on June 1, 2018

https://www.instagram.com/ryuji513

It looks like it just hits https://www.instagram.com/graphql/query/... every time you scroll down so if you scroll too fast it just hammers it and throttles your requests to that endpoint.

textmode · on June 2, 2018

1. Fetch 1st page.

Note id of user (e.g. 1954202703). this is the "id": value in the url.

Note end_cursor. This is used for the "after": value in the url

Note rhx_gis. This is used to create the "X-Instagram-GIS:" header.

Looking at archive.org, it seems as recently as last year, end_cursor was once all that was needed.

2. Fetch js from ProfilePageContainer url in 1st page (e.g., https://www.instagram.com/static/bundles/base/ProfilePageCon...)

Note queryId (e.g. 42323d64886122307be10013ad2dcc44)

This is used for "query_hash" in the url.

3. Create header "X-Instagram-GIS:"

Apparently this is some MD5 hash of rhx_gis and the query string variables according to this source:

https://www.diggernaut.com/blog/how-to-scrape-pages-infinite...

However a little experimentation revealed that generation of rhx_gis or this hash must also incorporate the user-agent string -- change any character in the user-agent string and the request will fail.

They also put IP address and a Unix time value in a cookie but the cookie can be deleted and the request still succeeds.

For example the final url for the first 12 photos is:

https://www.instagram.com/graphql/query/?query_hash=42323d64...

Overall, seems not too much work for someone who really wants to automate retrieval of Instagram photos. These requests for successive groups of 12 can be RFC 2616 pipelined over a single connection. Not long ago and for some number of years, it was even easier (e.g. just use end_cursor value as "max_id" in url).

diggernaut · on June 2, 2018

They recently removed User Agent and CSRF token from signature creation process. Right now used only rhx_gis parameter and URL decoded variables from query string to generate MD5 signature. However, your findings about user agents looks interesting. I assume they may use user agent to generate rhx_gis. It could explain why auth doesnt work if you change single char in user agent.

dx034 · on June 1, 2018

The rule set must be more complex. I often use VPN which results in captchas on many pages but I never get one on Google. I guess the 300 queries/IP only count if other parameters indicate crawling.

cm2187 · on June 1, 2018

But it's a bit clunky. I was running searches through an embedded webbrowser in a c# application, which is really an embedded internet explorer and was very quickly presented with a captcha. It was a human viewing the results, but a script constructing the query string, but that was enough the be labelled as a crawler.

jocoda · on June 1, 2018

often all it takes is to be using webbrowser. I would love to use webbrowser for small search utilities because it's so easy to use, but it seems to be a magnet for problems.

jardah · on June 1, 2018

Yea, it was a very general example, since there is at least one rule that is based on rate limiting too, and this 300/IP limit is what have seen on average.

epanchin · on June 1, 2018

With a search as you type feature (Google instant?) that limit could be hit in a few searches...

duxup · on June 3, 2018

Often when I fire up my VPN I get CAPTCHAed on the first search.

avip · on June 1, 2018

Just be a proud false positive.

cm2187 · on June 1, 2018

There is an irony in google preventing web scraping given that their business is pretty much built on web scraping.

vinceguidry · on June 1, 2018

Why is there irony in that? Anyone can go build a crawler and scrape the web the way Google scrapes it so they can compete with Google. Google protecting its site from scraping means you can't compete with Google using Google's own resources.

That said, automated research fascinates me, I wouldn't want to scrape Google to make my own Google, but rather to make private repositories of information that I can then query efficiently. I would love to find any kind of scriptable search engine access, paid or free. Not entirely sure how to look though.

PaulHoule · on June 1, 2018

Think different. Try bing, it has an API.

I think bing is close to Google in quality. Some people might even like it better. On the other hand I think DDG is the Sprint of search engines.

Google used to have a search API and they discontinued it because they said most of the people who used it were SEO people.

People who do pay-per-click are into A/B testing and other quantitative testing. Google is all for you doing that if you pay for advertising. Their mainstay of anti-SEO is doing arbitrary and random things to make it impossible for SEOs to go at it quantitatively. (They have patents on this!)

One reason so many sites go to a harvesting business model is that once a site is established you can make the slightest change and then your search rankings plummet. If you depend on search engine traffic it is a huge risk that you can't do anything about unless you are about.com (bought a 'competitive' search engine and just might be able to make an antitrust case against Google.)

hangonhn · on June 1, 2018

Can you elaborate more on this statement? "On the other hand I think DDG is the Sprint of search engines."

I've been interested in switching to DDG for a while but as a former Sprint customer, that statement scares me but maybe some explanation from you might understand your opinion better.

tracker1 · on June 1, 2018

I'm not sure about the comparison itself... I've tried DDG several times, I search for technical things in generic ways a lot. DDG almost never gives me what I want in the first page. Google almost always does.

leesalminen · on June 1, 2018

Same here. It’s hard to blame DDG though- Google’s search index of Stack Overflow is better than SO’s own.

tracker1 · on June 1, 2018

It's not a matter of blame at all... I'd love to see some challengers. In the end, google knows a lot about me and is really good at delivering personalized results because of it.

PaulHoule · on June 1, 2018

Is there any site that indexes itself better than Google?

PaulHoule · on June 1, 2018

Maybe if I used DDG more I would learn to parse the results better but the first thing I see are many results that have a "dark pattern" appearance to me.

Often I do get a good result on the first page but often results #1-#N vary from third rate to non-sequitur and then result #N+1 is the one that should be at #1, where maybe N is drawn from Uniform(3,6). I see this so much I can't imagine it is an accident. If anything it seems to be 70% more evil than Google.

burfog · on June 2, 2018

I find DDG to be slightly better for technical things. It is pretty similar though.

The real difference is non-technical things. Google filters out unflattering results and one side of anything even remotely political. It's a nerfed world, kind of like a Disney theme park. I'm an adult and I don't need to be led with blinders to the googly viewpoint.

JeremyBanks · on June 1, 2018

> I think bing is close to Google in quality. Some people might even like it better. On the other hand I think DDG is the Sprint of search engines.

Isn't DDG just a Bing wrapper with a few frills in the results?

cm2187 · on June 1, 2018

The irony is the "do as I say, not as I do".

vinceguidry · on June 1, 2018

Google's web scrapers obey robots.txt, you can stop Google from crawling your website if you want. Google doesn't want you crawling their website.

That word, I don't think it means what you think it means.

icebraining · on June 1, 2018

Google supports consensual scraping, and respects sites which opt-out (using robots.txt) just like they have. It's no more ironic than someone selling a product they don't happen to use themselves.

throwawaymath · on June 1, 2018

I think there's a credible argument that it's not purely consensual. Websites are forced to allow search engines with a lot of market share to scrape them or they won't be found.

No matter how well-intentioned you are, if you write your own scraper and have it abide by robots.txt, you'll never get nearly as many resources as Google or Bing. Many websites approve only their scrapers and ban everything else outright.

I don't have anything against the large search engines, it's just not really easy to say no to their scrapers for most websites.

recursive · on June 1, 2018

I didn't consent to all this debt. It was just not really easy to say no to all these great credit cards.

zephyrfalcon · on June 1, 2018

The irony is that if every site protected themselves from web scraping, there would be no Google.

fauigerzigerk · on June 1, 2018

>Anyone can go build a crawler and scrape the web the way Google scrapes it so they can compete with Google.

Unfortunately that is not the case. Many paywalled sites will let googlebot index their content but block other crawlers.

They may have good reasons for doing that in some cases, but as a consequence the level playing field you're talking about no longer exists.

Also, the purpose of using Google as part of some automated process is usually not to compete with Google's search engine, but to complete some specific and limited task.

I don't understand why Google does not have a general search API offering. I'm sure many people would happily pay for it.

PaulHoule · on June 1, 2018

I have had web crawlers from China crawl my site multiple times a day but never send me traffic. Same with Yandex. I like the bing search engine but often it does not like my site. If it doesn't send any traffic, why let them run up my AWS bill?

fauigerzigerk · on June 1, 2018

I understand that, but I think there are good reasons why we shouldn't always act in the narrowest sense of our self interest (provided we have enough financial wiggle room).

A search monopoly is not good for website owners. It makes us very dependent on the whims of that monopolist.

If you block all crawlers that don't already have a large market share and send back a lot of traffic, you're killing any possibility for new competitors to get a foot in the door.

Also, you're killing any chance for something unexpected to happen, such as someone having a great idea based on crawled data that could change all our lives for the better without ever sending traffic to your site.

Now, I'm not telling you what you can and cannot afford. If crawlers cost me a ton of money that I don't have I would certainly act exactly like you suggested.

sydd · on June 1, 2018

> It makes us very dependent on the whims of that monopolist.

Very true. Only allow Google and you are helping them to build their monopoly. And if they have full monopoly they do what they want - including asking you money to be included in the search results.

Spivak · on June 2, 2018

I can't even imagine how many businesses would be extatic about the ability to do this. Might as well cut out the SEO middleman.

matheusmoreira · on June 1, 2018

>Many paywalled sites will let googlebot index their content but block other crawlers.

Doesn't that infringe upon Google's own rules? I always thought Google didn't like it when sites served its crawler content that's different from what users get when they follow Google's link.

fauigerzigerk · on June 1, 2018

That's why many paywalled sites give you a few free articles per month if you're coming from a Google search results page.

But it no longer works at all sites. Maybe the rule has been dropped now that paywalls are becoming more popular (with publishers that is)

textmode · on June 1, 2018

But as he describes it, the GP is not trying to "compete" with Google (whatever that means), he is only trying to do some comprehensive searches.

He is not selling advertising.

He is not even running a public website.

Google is preventing you from using automation to create private (i.e. personal) respositories of information, even when that information is public and (ironically) Google itself relied on automation ("bots") to collect it.

tomaskafka · on June 5, 2018

> I wouldn't want to scrape Google to make my own Google, but rather to make private repositories of information that I can then query efficiently.

That's what Apify in original post does - including a public database of scrapers, so there is a high chance you could use already finished scraper :).

jedimastert · on June 1, 2018

> Anyone can go build a crawler and scrape the web the way Google scrapes it so they can compete with Google.

I don't think that making a scraper will make you competitive with Google. If you can make a site ranking algorithm that compete's with google, on the other hand, you might have a chance

heavenlyblue · on June 1, 2018

The site ranking algorithm is a solved problem.

The one reason Google is competitive is due to them taking advantage of the cheap labour that keeps track of ranking manipulation.

Luckily most of the search problems have nothing to do with ranking manipulation.

sheeshkebab · on June 2, 2018

site ranking is not a “solved problem” - google tries to solve it all the time and yet finding anything other than trending or popular stuff still takes more than several attempts (and often doesn’t even result in best results).

heavenlyblue · on June 3, 2018

Google has a set of contradicting requirements for the interface they've got on their website.

From one side it's along the natural-language interface from Alexa or alike; from the other side it's an interface of search for people who generally need access to information.

If Google exposed interfaces similar to Elastic Search - the search would never be an issue anymore; but it would not be easy to use by the users.

jamra · on June 1, 2018

That’s precisely why they don’t want you cheating the hard part and just storing the results. It makes sense to me. Work on your own machine learning if you want good results.

toomuchtodo · on June 1, 2018

There is no such thing as cheating, only staying within boundaries that don't land you in jail or sued in your own jurisdiction. If you can get an edge by using Google's own data, do so.

lugg · on June 1, 2018

Bing bing bing!

Er, ughm. I mean,

Ding ding ding!

anc84 · on June 1, 2018

You could argue that Google should work on their own knowledge database instead of learning from other people's content and/or presenting other people's content in their own frontends (shopping etc)...

heavenlyblue · on June 1, 2018

This is what Common Crawl does: http://commoncrawl.org/. I think more people should know about it.

buzer · on June 1, 2018

> Anyone can go build a crawler and scrape the web the way Google scrapes it so they can compete with Google.

They cannot. Googlebot & some other search engine bots (like Bing's & Yandex's) get special treatment in various websites. This includes things like ban on non-whitelisted scrapers & bypassing paywalls. If you are not already established player in the field, you would not get able to scrape the same websites as the established players can.

bmpafa · on June 1, 2018

As I understand it, this was the rationale behind the courts' decision to prohibit LinkedIn from banning people from scraping public profiles.

Basically it was anti competitive to grant certain privileges to major players around 'public data,' but to block smaller players.

No telling if/when ramifications from that decision (last year) hit existing anti scraping measures, though.

KasianFranks · on June 2, 2018

Google crawls the web, they don't scrape it. There's a big difference.

codetrotter · on June 1, 2018

> to make private repositories of information that I can then query efficiently

You and me both :)

I still haven't gotten around to do much about it, but for example one thing I've been thinking about is to have my system integrated with my desktop so that it has some situational context.

For example, it would look at the programs that I have currently running.

Let's say that it saw that I had PyCharm open where I was editing some Python 3 files. Furthermore I also had Vim open where I was editing some HTML, CSS and JavaScript files.

It would maintain a list of all items that had been in focus during the previous 30 minutes or something.

When I then searched for let's say sort list it would look at the list and see that most recently I had been editing a Python file in PyCharm so result number 1 would be how to sort a list in Python 3. Before that I had also focused Vim with a JS file, so sorting arrays in JS would be result number 2.

Results:

1. Python 3. Sort list "a_list". In-place: a_list.sort(). Build new sorted list from iterable: b_list = sorted(a_list).

2. JavaScript. Sort array "an_array". In-place: a_list.sort(). Create a new shallow copy and sort array: let another_array = an_array.concat().sort().

And if the system was even smarter, it would also be able to know details about what I'd been doing. For example it could see that while editing a JavaScript file I had most recently been writing code that was doing some operations with WebGL, and before that I was editing code that was changing style properties and before that something that was working with Canvas, so if I then search for blend, it would use this information.

Results:

1. WebGL Lesson 8 – the depth buffer, transparency and blending. http://learningwebgl.com/blog/?p=859

2. Basics of CSS Blend Modes. https://css-tricks.com/basics-css-blend-modes/

3. CanvasRenderingContext2D.globalCompositeOperation. https://developer.mozilla.org/en-US/docs/Web/API/CanvasRende...

Something like that.

And because it's for the limited amount of things that I am interested in and developed for myself only (as opposed to trying to give super relevant information for every person in the world), it might be doable to some extent.

Here is a book that might be of interest to you; Relevant Search. https://www.manning.com/books/relevant-search. I bought a copy myself but have yet to read it.

hartator · on June 1, 2018

Shameless plug: it’s why we made SerpApi! (https://serpapi.com)

jorge_gonzalez · on June 2, 2018

I find serp api very interesting but the big data plan is very expensive still for medium size companies. Does it really work for Google?

tossimba · on June 3, 2018

just scrape startpage.com

avip · on June 1, 2018

Google indexers respect robots.txt, so there goes the irony.

p49k · on June 2, 2018

They also provide something of value to the operators of the websites they scrape, namely search traffic.

pbhjpbhj · on June 1, 2018

I've heard that they [sometimes] visit but don't index, is that true?

sarchertech · on June 1, 2018

A company I consulted for was using a paid API to handle search.

Despite the fact that the entire site was available in an easy to scrape XML format, scrapers kept using the search feature.

They were trying very hard to overcome my countermeasures--they had a seemingly limitless pool of IPs, they were rotating user agent strings, and they tried to randomize search behavior.

Everytime I implemented a new countermeasure they'd try to find a way around it. It was maddening because we made everything available for them through the XML feed. They just wouldn't use it.

dredmorbius · on June 1, 2018

An explicit message to use that feed as part of the countermeasures might be useful. Did/do you do this?

jardah · on June 1, 2018

That is kinda sad to hear. The approach should always be to go through the path of least resistance and smallest effect on the website. So for example, if a company has API that can be used instead of scraping their website, then it's always preferred to use the API. Same would go for the XML you mentioned.

It's bad that not everyone works like this; there are quite a lot of people who would rather brute-force a solution than think about it.

wumpus · on June 1, 2018

The path of least resistance for the bots appears to be that they have a tool that scrapes search results, and nothing to talk to an API.

sheeshkebab · on June 2, 2018

Interestingly XML is not easy to parse with a lot of these scraping tools that rely on JavaScript... sure, the tools can easily parse HTML and convert to json or csv, but taking xml in random format and doing the same is rather difficult.

It may have been better to just publish the site in HTML format with an easy to find link on front page to access it.

MightySCollins · on June 1, 2018

We have kind of the same thing. All data is in an API which is less than a cheap VPS and display messages to them about using the API if they get blocked but they just come back with a new IP every time.

unreal37 · on June 1, 2018

You had a paid API, and people wanted the information for free....

Not unexpected I guess.

bo1024 · on June 1, 2018

After searching "algolia" mentioned below, I figured out the misunderstanding. The company was paying somebody else per search made on their web site. So every time a scraper called the website's search function, it cost the website money.

mlevental · on June 1, 2018

you've misunderstood: the site used something like algolia (which the site paid for) to index. the scrapers were hitting that service (which was costing the site) rather than parsing the xml (which already had everything).

sarchertech · on June 1, 2018

The paid API was just to handle the search of the company database (couldn't change that for political reasons).

They weren't getting any information that they couldn't get through the XML.

ryanwaggoner · on June 1, 2018

Maybe it was sabotage.

dsamarin · on June 2, 2018

> we have developed a solution which removes the property from the web browser and thus prevents these kind of protections from figuring out that the browser is automated

Eli Grey and I have bypassed your "hideWebDriver()" function[1] in a single line of code:

    if (navigator.webdriver || (Navigator.prototype && Object.getOwnPropertyDescriptors(Navigator.prototype)["webdriver"])) {
        // Chrome headless detected - navigator.webdriver exists or was redefined
    }

[1]: https://github.com/apifytech/apify-js/blob/262a2e604b1adb3d8...

jardah · on June 2, 2018

Good point. Haven't seen a single detection library do this, but at least now I know, that I still need to work on alternative solution. Thanks

dredmorbius · on June 1, 2018

Since people are asking "why would you do such a thing" or insinuating that scraping need only be to compete somehow with Google, I'll present a use I've found quite interesting, that doesn't seek to replicate or replace Google search, and which hasn't been readily attainable other than by scraping Google search results, in part. The tool I've used (crude, but reasonably effective) has applied numerous attempts to work around bot-detection, some modestly effective. (Rate-limiting most especially.)

I've found the practice of looking at search-term frequency, across a domain or set of domains (using the "site"<domain>" Google search filter) to be useful, for example the "Top 100 Global Thinkers" report linked below.

It uses 100 search terms -- "global thinkers" identified by Foreign Policy magazine -- searched across a set of about 100 domains and TLDs, largely social media, various journalism (newspaper / magazine), and a few institutional sites, as well as selected national and other top-level domains. The result is an interesting profile of where more robust online discussion or commentary might be found.

https://www.reddit.com/r/dredmorbius/comments/3hp41w/trackin...

The full report requires running roughly 100 x 100, or 10,000, Google searches. I'm finding that it's necessary to space these ~5-10 minutes apart, which means that the full analysis takes over a month of wall-clock time, from a single IP.

I've considered several possible follow-ups to this study, including more or alternate domains, different keywords, and various other variants, but both the run-time and codeing to bypass bot-detection put me off this.

I've tried reaching out to Googlers I know to see if there's any possible alternative means of acquiring this information, to no avail. I've also looked for various research interfaces or APIs, with no joy.

DuckDuckGo and other search sites don't have the rate-limiting (I've used them for other purposes), but also don't have the (granted, often very inaccurate / imprecise) match-counts which Google offers.

Putting this out there both as an example and a request for suggestions as to how I might improve or modify the process.

shabble · on June 2, 2018

Have you considered some sort of "crowdsourcing" / voluntary botnet type approach?

The ArchiveTeam[1] have a simple VM image that anyone can use to schedule and coordinate large site archival jobs that might already address some of teh issues.

Might be tricky to find people willing to provide resources, but with even a smallish group it might work out. May need to consider abuse and run multiple queries and compare results, which might add to the overall request cost.

[1] https://www.archiveteam.org/

dredmorbius · on June 2, 2018

The thought's occurred.

My approach is sufficiently fluid that this would mean pushing pretty crude code to a bunch of hosts frequently and on a irregular basis. The runs themselves are fairly ad hoc.

Being able to directly query a corpus (IA, DDG, Bing, etc.) is another option.

Search across large corpora remains fairly expensive, I can understand hesitency here.

Nonstandardisation of search APIs across sites is another frustration.

DFHippie · on June 1, 2018

I find it odd how little a basic principle enters into this: don't do something to someone when they make it clear they don't want you to do it.

icebraining · on June 1, 2018

Eh, the principle might be good (though it's not odd that not everyone shares the same principles), but one can hold it and still have exceptions. For example, what about a governmental institution or a public company¹? What about a semi-public company, like a monopolist utility? What if the uploader of the data is OK with it, but the site hoster prevents it?

¹ in the sense of owned by the State, not listed on the stock market

DFHippie · on June 1, 2018

I'm not saying there aren't legitimate reasons for writing scrapers. I've written plenty myself. It was just odd to see this disregarded entirely.

As for the commonality of principles, game theory explains most of them, so it isn't more surprising than that we all work with the same prime numbers, say. A simple principle of reciprocity will produce something along the lines of "respect other people's wishes".

icebraining · on June 2, 2018

It was just odd to see this disregarded entirely.

I can't say I agree. I mean, a serious, interesting essay can certainly be written on the ethics of scraping. But these short preludes on technical posts just end up sounding either like a disingenuous legal disclaimer or a preachy paternalistic tirade.

mark_edward · on June 1, 2018

Move fast and break things, like norms

DFHippie · on June 1, 2018

And I get downvoted as a norm breaker for mentioning norm breaking. Hacker News becomes Fight Club.

Asooka · on June 1, 2018

Has there been precedent established whether bypassing anti-scraping does or does not violate the CFAA?

cookiecaper · on June 1, 2018

The generally-accepted precedent is that yes, unwanted scraping violates the CFAA. Given the age of the problem, the case law is still developing around it, but there have been many high-profile scraping cases, and the scraper almost always loses.

The reality is that the CFAA is extremely broad and if we want to protect "scraping", better termed something like "data preservation" or "data recovery", we need to change the CFAA, copyright, and the applicability of EULAs (which effectively work to plug any tiny leak that someone may've found through the CFAA-copyright combo).

Copyright itself makes it effectively illegal to read a web page without the owner's consent, even if a) there is no trespass/unauthorized access (CFAA); and b) there is no infringement in the actual content extracted. This occurs because the markup and other necessary supporting material around a page is a copyrighted work, and just reading it into memory and then immediately discarding is considered sufficiently tangible to infringe on the copyrighted work.

This is called the "RAM Copy Doctrine", and it has been [mis]applied to scrapers many times. In Facebook v. Power Ventures, it was used to stop a startup from helping Facebook users extract their own content. That founder was left owing $3M in damages to Facebook.

LinkedIn v. Hi5 is the most notable recent exception, but those rulings seem to be pure judicial activism unsupported by precedent or really any legal underpinnings, and will surely be overturned on appeal.

For every high-profile LinkedIn v. Hi5-style success, there are a good number of losses. It is fairly routine now after 3Taps.

IANAL, but my SaaS business, which depended on a key piece of scraped data, was destroyed by a legal threat from a Fortune 100.

ksahin · on June 1, 2018

It's a complex subject.

For example, the Linkedin case : https://arstechnica.com/tech-policy/2017/08/court-rejects-li...

Craiglist sued some companies too.

To my understanding, scraping can be legal if it's done properly, meaning not sending too many requests at the same time, and if it does not affect the underlying infrastructure.

It seems like in the US or in Europe, even if there is any anti-bot / anti-scraping section in the website's TOS, public data can be scraped. Sometimes, even "private" data can be extracted using bots. For example, lots of "bank account aggregators" has won lawsuits against banks.

Karawebnetwork · on June 1, 2018

The issue is that if you allowed all web scrapping you could DDOS websites and get a free out of jail card by telling "oh, we were simply scrapping some data and it glitched out".

confounded · on June 1, 2018

That sounds like a possibly-less-time-in-jail-card.

dvfjsdhgfv · on June 1, 2018

> there are already anti-scraping solutions on the market that can detect its usage based on a variable it puts into the browser's window.navigator property. Thankfully, we have developed a solution which removes the property from the web browser

Does anyone know what exactly the property in question is?

zzzcpan · on June 1, 2018

Probably navigator.webdriver, but there are multiple properties.

https://antoinevastel.com/bot%20detection/2017/08/05/detect-...

https://antoinevastel.com/bot%20detection/2018/01/17/detect-...

dvfjsdhgfv · on June 1, 2018

That's why I was wondering. Last time I checked headless Chrome could be pretty reliably detected in a number of ways, as you say. That they mention just one variable seems quite odd, given that they position themselves as specialists in the field.

jardah · on June 1, 2018

The webdriver property is as far as we know the only one that stays different if you use non-headless chrome with puppeteer. Rest can be handled by use of non-headless chrome as mentioned in the article.

But you are right, after reading through it again, this section of the article should be improved.

aruggirello · on June 1, 2018

Is there anything like this to reliably detect Firefox headless?

matthewmacleod · on June 1, 2018

On one hand, it does make a lot of sense that many web publishers want to keep people from scraping content, given the way that it's often used nefariously, to violate copyright, or for spam purposes.

But there are totally legitimate reasons to scrape as well. Altmetric (https://www.altmetric.com), which is the company I work for, tracks links to scientific research. So when someone on e.g. Twitter links to a page on nature.com, we want to scrap the page they linked to and figure out which paper they are talking about (if any). Academic publishers can be particularly sensitive to scraping, making the endeavour much more work than it needs to be.

It's a real shame that the web has moved to be so closed off in many ways.

unreal37 · on June 1, 2018

The web is not becoming closed off from users. It's becoming hostile to bots. Not the same.

benologist · on June 1, 2018

At this point you should just consider your HTML/HTTP interface an API because when you use headless browser technology readily available with any programming language it becomes exactly that.

matheusmoreira · on June 1, 2018

The HTML really is the API.

Writing a site-specific browser has always been a fun project for me. It just pulls the information I want directly from my favorite websites. Maximum signal-to-noise ratio and I get ad blocking for free.

People think Javascript-based sites are safer, but it's in fact even easier to access the content because there's usually a programmatic interface available.

systematical · on June 1, 2018

Pretty much rate limit and proxy through tor is all this article needed to say. Sometimes you need to fake some cookie data or get a session set first. At least for static data.

For dynamic sure puppeteer if you have too but my god the exceptions and stack traces need some work.

But most websites don't enact protections because it's generally not worth the opportunity cost. So you really just scrape with your LOC.

If you move money or can't code then mozenda.

paulie_a · on June 1, 2018

I'm glad eBay never implemented that, I wrote a scraper that hit them 5-6 billion times over a couple of years.

uptown · on June 1, 2018

What was your objective?

paulie_a · on June 2, 2018

To find a particular product that could be repaired and resold

blackbrokkoli · on June 1, 2018

Completely wild guess: Good/(profitable) product offers?

erikrothoff · on June 1, 2018

I was playing around with the idea of using Tor to get around IP blocks. I played around a bit with code but the Tor binary dependency was a bit much for my use case. Curious to know if anyone else tried this?

always_good · on June 1, 2018

Everyone else has the same idea which is why it often makes sense to block Tor outright.

drawnwren · on June 2, 2018

Yep. Tor gateways are usually included along with VPNs and AWS ips in most basic ip blocklists.

vbezhenar · on June 1, 2018

I don't like those attempts. Tor is easily detectable and those attempts just make Tor network banned from website, hurting legitimate Tor users.

rosha · on June 2, 2018

Scraping on high volumes any serp is pain if your business relies on it and most services out there do not work on high volumes or they work and crazy expensive.

I have checked few solutions out there, I am using now proxycrawl. Developers of their api helped me get a very high volume of Serp data from different search engines like Yandex, google and yahoo and bing. I also use them for Javascript crawling as for our project we need lots of content which is rendered via javascript. I am amazed of how their API endpoint works. It is basically sending a URL to their API and you are good to start. Make sure to contact them for some sites as they do not allow you to crawl the world by default unless you prove your use case, they liked my product and that is how it got started. I've really having successful experience with it so I totally recommend, you basically communicate with developers who does lots of work to make it happen. As I am mainly in JS I asked for a Node JS package and they just built it open sourced. https://github.com/proxycrawl/proxycrawl-node

jorge_gonzalez · on June 2, 2018

It would be interesting to know what technologies they use to scrape on high volume for 0.005 US cent a successful request. I checked this package and it looks decent, i like dependency free libraries. I'll check their API for Bing. Thanks