Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wanna mention Nutch here: http://nutch.apache.org/ since it has been around for a while and a lot of thought was put into its design. For instance, people are discussing data stores, Nutch uses Hadoop.

The web is probably bigger than you think, Google says "when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!" (July, 2008)

You might consider just crawling certain parts of the web, or using a search engine (api, like Yahoo! BOSS) to gather relevant links and crawl from there, using a depth limit. Just an idea.



I used to be on the Google indexing team. Disregarding limits on the length of URLs, the size of the visible web is already infinite. For instance, there are many calendar pages out there that will happily give you month after month ad infinitum if you keep following the "next" link.

Now, depending on how you prune your crawl to get rid of "uninteresting" content (such as infinite calendars) and how you deduplicate the pages you find, you'll come up with vastly varying estimates of how big the visible web is.

Edit: on a side note, don't crawl the web using a naive depth-first search. You'll get stuck in some uninteresting infinitely deep branch of the web.


You're right, I forgot to write it explicitly in the article but if someone will follow istructions (extract all <a> tags and add them to the index) that method is tacit.


1 trillion in 2008.... while not that long ago I wouldn't be surprised if this has exploded since then.

Any numbers?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: