Crawlers are one of those projects that's honestly best left to someone else. Fu...

Crawlers are one of those projects that's honestly best left to someone else. Fun for a hobby, but a nightmare to get right, and someone has already done the work for you. The exception is limited-use tools like Wget that can give you practical results for small-domain retrieval, but then kill you on CPU and memory and is impossible to scale; use a better tool or customize an existing one if you need to support large-scale crawls.

Some of the "little things" matter much more than your content analyzer or HTTP parsing - DNS performance and multi-homing being just a few that can have drastic effects.

Just as an example of how complex it gets, here's a brief overview of some of the features all crawlers should take into account: http://en.wikipedia.org/wiki/Web_crawler