Given the way that Google is heading ( http://news.ycombinator.com/item?id=14989...

Given the way that Google is heading ( http://news.ycombinator.com/item?id=149894 ), you should have started at least six months ago. Regardless:

1. 10 years ago, at least 99% of web pages failed validation. Nowadays, the majority still fail. You could validate and then fall through to tag soup processing.

2. 10 years ago, the conventional wisdom ( http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTO... ) was to use a compiled language, such as C, for spidering ( http://www.tbray.org/ongoing/When/200x/2003/12/03/Robots ). Given that memory increases faster than processing power which increases faster than bandwidth, this may not be the case nowadays.

3. That's the meta problem. Solve that and you may find that a search engine is easier.