Hacker News new | past | comments | ask | show | jobs | submit login

I concur with the Nutch vote; but more specifically, take a look at the crawler code written in the src trunk for use with Hadoop. That is probably a good place to start. Also worth a look is Heritrix (crawler for archive.org). http://sourceforge.net/projects/archive-crawler Sadly, this too is written in Java.

The only Python one I am aware of for which code is available is: http://sourceforge.net/projects/ruya/

Edit: You might also want to take a look at http://wiki.apache.org/hadoop/AmazonEC2

Edit2: Polybot is another Python based crawler, but no code. However, the paper has some interesting ideas:

Design and Implementation of a High-Performance Distributed Web Crawler. V. Shkapenyuk and T. Suel. IEEE International Conference on Data Engineering, February 2002. http://cis.poly.edu/westlab/polybot/




Good response. We've created a basic crawler in Python, but are looking for something more powerful too. Heritrix above looks good




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: