Since people are asking "why would you do such a thing" or insinuating that scraping need only be to compete somehow with Google, I'll present a use I've found quite interesting, that doesn't seek to replicate or replace Google search, and which hasn't been readily attainable other than by scraping Google search results, in part. The tool I've used (crude, but reasonably effective) has applied numerous attempts to work around bot-detection, some modestly effective. (Rate-limiting most especially.)
I've found the practice of looking at search-term frequency, across a domain or set of domains (using the "site"<domain>" Google search filter) to be useful, for example the "Top 100 Global Thinkers" report linked below.
It uses 100 search terms -- "global thinkers" identified by Foreign Policy magazine -- searched across a set of about 100 domains and TLDs, largely social media, various journalism (newspaper / magazine), and a few institutional sites, as well as selected national and other top-level domains. The result is an interesting profile of where more robust online discussion or commentary might be found.
The full report requires running roughly 100 x 100, or 10,000, Google searches. I'm finding that it's necessary to space these ~5-10 minutes apart, which means that the full analysis takes over a month of wall-clock time, from a single IP.
I've considered several possible follow-ups to this study, including more or alternate domains, different keywords, and various other variants, but both the run-time and codeing to bypass bot-detection put me off this.
I've tried reaching out to Googlers I know to see if there's any possible alternative means of acquiring this information, to no avail. I've also looked for various research interfaces or APIs, with no joy.
DuckDuckGo and other search sites don't have the rate-limiting (I've used them for other purposes), but also don't have the (granted, often very inaccurate / imprecise) match-counts which Google offers.
Putting this out there both as an example and a request for suggestions as to how I might improve or modify the process.
Have you considered some sort of "crowdsourcing" / voluntary botnet type approach?
The ArchiveTeam[1] have a simple VM image that anyone can use to schedule and coordinate large site archival jobs that might already address some of teh issues.
Might be tricky to find people willing to provide resources, but with even a smallish group it might work out. May need to consider abuse and run multiple queries and compare results, which might add to the overall request cost.
My approach is sufficiently fluid that this would mean pushing pretty crude code to a bunch of hosts frequently and on a irregular basis. The runs themselves are fairly ad hoc.
Being able to directly query a corpus (IA, DDG, Bing, etc.) is another option.
Search across large corpora remains fairly expensive, I can understand hesitency here.
Nonstandardisation of search APIs across sites is another frustration.
I've found the practice of looking at search-term frequency, across a domain or set of domains (using the "site"<domain>" Google search filter) to be useful, for example the "Top 100 Global Thinkers" report linked below.
It uses 100 search terms -- "global thinkers" identified by Foreign Policy magazine -- searched across a set of about 100 domains and TLDs, largely social media, various journalism (newspaper / magazine), and a few institutional sites, as well as selected national and other top-level domains. The result is an interesting profile of where more robust online discussion or commentary might be found.
https://www.reddit.com/r/dredmorbius/comments/3hp41w/trackin...
The full report requires running roughly 100 x 100, or 10,000, Google searches. I'm finding that it's necessary to space these ~5-10 minutes apart, which means that the full analysis takes over a month of wall-clock time, from a single IP.
I've considered several possible follow-ups to this study, including more or alternate domains, different keywords, and various other variants, but both the run-time and codeing to bypass bot-detection put me off this.
I've tried reaching out to Googlers I know to see if there's any possible alternative means of acquiring this information, to no avail. I've also looked for various research interfaces or APIs, with no joy.
DuckDuckGo and other search sites don't have the rate-limiting (I've used them for other purposes), but also don't have the (granted, often very inaccurate / imprecise) match-counts which Google offers.
Putting this out there both as an example and a request for suggestions as to how I might improve or modify the process.