In addition to the lack of removing porn and the ordering of the results not priorizing "quality" sources, some of the indexed site data is at least 4-6 months old and has heavily changed since the last crawl. I even got 404 errors. That makes it very hard to really find use in the project other than for academic interest.
Here is some input based on my experience building a similar project at my former company. (We did not quite get to 2B pages, but were close to ~300M):
For creating a really viable (alternative) search engine, the freshness of your index is going to be a fairly important factor. Now, obviously, re-crawling a massive index frequently/regularly is going to need/consume some huge amounts of bandwidth + CPU cycles. Here is how we had optimized the resource utilization:
Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.
Corresponding to each indexed URL, also store a sort-of 'crawl-history' (If space is a constraint, don't store each version of the URL, store only the latest one). On each re-crawl, store two data fields: time-stamp and a boolean if the URL content has changed since last crawl. As more re-crawl cycles run, you will be able to calculate/predict the 'update frequency' of each URL. Then, prioritize the re-crawls based on the update frequency score (i.e. re-crawl those with higher scores more frequently and the others less frequently).
If you need any more help/input, let me know and I'll be happy to do what I can.
We had also (obviously) built a (proprietary) ranking algo that took into account some 60+ individual factors. If it can be of any help, I'll create a list and send it to you.
Good idea. However, I'll need to really exercise the gray cells to put together the list so it might take me a couple of days. Once done, I'll post it here.