1. 10 years ago, at least 99% of web pages failed validation. Nowadays, the majority still fail. You could validate and then fall through to tag soup processing.
2. 10 years ago, the conventional wisdom ( http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTO... ) was to use a compiled language, such as C, for spidering ( http://www.tbray.org/ongoing/When/200x/2003/12/03/Robots ). Given that memory increases faster than processing power which increases faster than bandwidth, this may not be the case nowadays.
3. That's the meta problem. Solve that and you may find that a search engine is easier.
1. 10 years ago, at least 99% of web pages failed validation. Nowadays, the majority still fail. You could validate and then fall through to tag soup processing.
2. 10 years ago, the conventional wisdom ( http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTO... ) was to use a compiled language, such as C, for spidering ( http://www.tbray.org/ongoing/When/200x/2003/12/03/Robots ). Given that memory increases faster than processing power which increases faster than bandwidth, this may not be the case nowadays.
3. That's the meta problem. Solve that and you may find that a search engine is easier.