Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In addition to the lack of removing porn and the ordering of the results not priorizing "quality" sources, some of the indexed site data is at least 4-6 months old and has heavily changed since the last crawl. I even got 404 errors. That makes it very hard to really find use in the project other than for academic interest.


A fresh recrawl is currently running. Should take about 2-3 months. Newly crawled data will gradually replace older data during that time.


Great work, congrats. :-)

Here is some input based on my experience building a similar project at my former company. (We did not quite get to 2B pages, but were close to ~300M):

For creating a really viable (alternative) search engine, the freshness of your index is going to be a fairly important factor. Now, obviously, re-crawling a massive index frequently/regularly is going to need/consume some huge amounts of bandwidth + CPU cycles. Here is how we had optimized the resource utilization:

Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.

Corresponding to each indexed URL, also store a sort-of 'crawl-history' (If space is a constraint, don't store each version of the URL, store only the latest one). On each re-crawl, store two data fields: time-stamp and a boolean if the URL content has changed since last crawl. As more re-crawl cycles run, you will be able to calculate/predict the 'update frequency' of each URL. Then, prioritize the re-crawls based on the update frequency score (i.e. re-crawl those with higher scores more frequently and the others less frequently).

If you need any more help/input, let me know and I'll be happy to do what I can.

HTH and all the best moving forward.


We had also (obviously) built a (proprietary) ranking algo that took into account some 60+ individual factors. If it can be of any help, I'll create a list and send it to you.


Why not write that list here ?


Good idea. However, I'll need to really exercise the gray cells to put together the list so it might take me a couple of days. Once done, I'll post it here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: