Hacker News new | past | comments | ask | show | jobs | submit login

If you're interested in a publicly queryable index of the web, you could try running a search server such as ElasticSearch on the Common Crawl[1] corpus. ElasticSearch runs the search backend of WordPress, 600 million+ documents in total[2], so extending it to a Common Crawl archive seems possible.

n.b. I'm a data scientist at Common Crawl, so have a vested interest!

Also, whatever experiment you end up pursuing, remember to use spot instances if your setup allows for transient nodes - it'll substantially decrease your burn rate (usually 1/10th the price) allowing for even larger and more insane experiments :)

[1]: http://commoncrawl.org/

[2]: http://gibrown.com/2014/01/09/scaling-elasticsearch-part-1-o...




I had a crawling project where I wanted to get a sense for a few ad-related things on the internet and came upon common crawl and was initially excited since I thought it would have incidentally captured the data I wanted, but I was disappointed to find that they did not do any kind of JS execution, which limited the effectiveness for me pretty drastically.


I'd never heard of Common Crawl before but it looks like an awesome project! Keep up the good work!


How up-to-date is commoncrawl data?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: