I wanna mention Nutch here: http://nutch.apache.org/ since it has been around fo...

KMag · on Sept 30, 2013

I used to be on the Google indexing team. Disregarding limits on the length of URLs, the size of the visible web is already infinite. For instance, there are many calendar pages out there that will happily give you month after month ad infinitum if you keep following the "next" link.

Now, depending on how you prune your crawl to get rid of "uninteresting" content (such as infinite calendars) and how you deduplicate the pages you find, you'll come up with vastly varying estimates of how big the visible web is.

Edit: on a side note, don't crawl the web using a naive depth-first search. You'll get stuck in some uninteresting infinitely deep branch of the web.

EmanueleMinotto · on Sept 30, 2013

You're right, I forgot to write it explicitly in the article but if someone will follow istructions (extract all <a> tags and add them to the index) that method is tacit.

thejosh · on Sept 30, 2013

1 trillion in 2008.... while not that long ago I wouldn't be surprised if this has exploded since then.

Any numbers?