If you're interested in a publicly queryable index of the web, you could try running a search server such as ElasticSearch on the Common Crawl[1] corpus.
ElasticSearch runs the search backend of WordPress, 600 million+ documents in total[2], so extending it to a Common Crawl archive seems possible.
n.b. I'm a data scientist at Common Crawl, so have a vested interest!
Also, whatever experiment you end up pursuing, remember to use spot instances if your setup allows for transient nodes - it'll substantially decrease your burn rate (usually 1/10th the price) allowing for even larger and more insane experiments :)
I had a crawling project where I wanted to get a sense for a few ad-related things on the internet and came upon common crawl and was initially excited since I thought it would have incidentally captured the data I wanted, but I was disappointed to find that they did not do any kind of JS execution, which limited the effectiveness for me pretty drastically.
n.b. I'm a data scientist at Common Crawl, so have a vested interest!
Also, whatever experiment you end up pursuing, remember to use spot instances if your setup allows for transient nodes - it'll substantially decrease your burn rate (usually 1/10th the price) allowing for even larger and more insane experiments :)
[1]: http://commoncrawl.org/
[2]: http://gibrown.com/2014/01/09/scaling-elasticsearch-part-1-o...