I concur with the Nutch vote; but more specifically, take a look at the crawler ...

I concur with the Nutch vote; but more specifically, take a look at the crawler code written in the src trunk for use with Hadoop. That is probably a good place to start. Also worth a look is Heritrix (crawler for archive.org). http://sourceforge.net/projects/archive-crawler Sadly, this too is written in Java.

The only Python one I am aware of for which code is available is: http://sourceforge.net/projects/ruya/

Edit: You might also want to take a look at http://wiki.apache.org/hadoop/AmazonEC2

Edit2: Polybot is another Python based crawler, but no code. However, the paper has some interesting ideas:

Design and Implementation of a High-Performance Distributed Web Crawler. V. Shkapenyuk and T. Suel. IEEE International Conference on Data Engineering, February 2002. http://cis.poly.edu/westlab/polybot/