In addition to the lack of removing porn and the ordering of the results not pri...

deusu · on Sept 13, 2016

A fresh recrawl is currently running. Should take about 2-3 months. Newly crawled data will gradually replace older data during that time.

webtechgal · on Sept 13, 2016

Great work, congrats. :-)

Here is some input based on my experience building a similar project at my former company. (We did not quite get to 2B pages, but were close to ~300M):

For creating a really viable (alternative) search engine, the freshness of your index is going to be a fairly important factor. Now, obviously, re-crawling a massive index frequently/regularly is going to need/consume some huge amounts of bandwidth + CPU cycles. Here is how we had optimized the resource utilization:

Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.

Corresponding to each indexed URL, also store a sort-of 'crawl-history' (If space is a constraint, don't store each version of the URL, store only the latest one). On each re-crawl, store two data fields: time-stamp and a boolean if the URL content has changed since last crawl. As more re-crawl cycles run, you will be able to calculate/predict the 'update frequency' of each URL. Then, prioritize the re-crawls based on the update frequency score (i.e. re-crawl those with higher scores more frequently and the others less frequently).

If you need any more help/input, let me know and I'll be happy to do what I can.

HTH and all the best moving forward.

webtechgal · on Sept 13, 2016

We had also (obviously) built a (proprietary) ranking algo that took into account some 60+ individual factors. If it can be of any help, I'll create a list and send it to you.

ddorian43 · on Sept 13, 2016

Why not write that list here ?

webtechgal · on Sept 13, 2016

Good idea. However, I'll need to really exercise the gray cells to put together the list so it might take me a couple of days. Once done, I'll post it here.