1.) Don't do it yourself. Use Amazon's Alexa Web Search service (aws.amazon.com)...

1.) Don't do it yourself. Use Amazon's Alexa Web Search service (aws.amazon.com). Through that you can access Alexa's 10 billion page index, complete with all the pages, run complex queries etc. Plays nicely with EC2.

2.) If you must do it yourself, Heritrix is the most sophisticated crawler out there (crawler.archive.org).

3.) Nutch is an option, but nowhere near as powerful as Heritrix.

Don't try to reinvent the wheel, writing a robust crawler is a lot of work as there are endless edge cases to take care of (if you are looking into a general purpose web crawler)