Hacker News new | past | comments | ask | show | jobs | submit login

1.) Don't do it yourself. Use Amazon's Alexa Web Search service (aws.amazon.com). Through that you can access Alexa's 10 billion page index, complete with all the pages, run complex queries etc. Plays nicely with EC2.

2.) If you must do it yourself, Heritrix is the most sophisticated crawler out there (crawler.archive.org).

3.) Nutch is an option, but nowhere near as powerful as Heritrix.

Don't try to reinvent the wheel, writing a robust crawler is a lot of work as there are endless edge cases to take care of (if you are looking into a general purpose web crawler)




Nutch is good and I would second it, but I would suggest to NOT build a crawler - its not trivial and inadvised in a startup, that is unless your startup is just about building a crawler.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: