True, but then Google doesn't just download the page source and index that. They run JavaScript in some cases to get to the actual content. This must come at a significant cost. Their index is enormous as well:
"The Google Search index contains hundreds of billions of web pages and is well over 100,000,000 gigabytes in size."
Sure, download and run the javascript, but then you can snapshot the DOM, grab the text, and discard all the rest. The HTML and js is of little practical value for the index after that point.
Google's index is likely very large because they don't have any real economic incentives to keeping it small.
>... but then you can snapshot the DOM, grab the text, and discard all the rest
Yes, absolutely, I didn't mean to imply otherwise. But first you have to figure out what you can discard beyond the HTML tags themselves to avoid indexing all the garbage that is on each and every page.
When I tried to do this I came to the conclusion that I needed to actually render the page to find out where on the page a particular piece of text was, what font size it had, if it was even visible, etc. And then there's JavaScript of course.
So what I'm saying is that storing a couple of kilobytes is probably not the most costly part of indexing a page.
> When I tried to do this I came to the conclusion that I needed to actually render the page to find out where on the page a particular piece of text was, what font size it had, if it was even visible, etc. And then there's JavaScript of course.
Are there open source projects devoted to this functionality? It’s becoming more and more a sticking point for working with LLMs. Grabbing the text without navigation and other crap but while maintaining formatting and links, etc
For my specific purposes it has always been good enough to apply some simple heuristics. But that wouldn't have been possible without access to post rendering information, which only a real browser (https://pptr.dev) can reliably produce.
There are many software libraries that can output just the text from HTML or run JS. For C# there's HTML Agility Pack and PuppeteerSharp, for example. I did use them for web scrapping.
You don't need to store it indefinitely though, and there's not much point in crawling faster than you can process the data.
The couple of kilobytes per document is the actual storage footprint. Sure you need to massage the data, but that almost entirely CPU bound. You also need a lot of RAM for keeping the hot parts of the index.
"The Google Search index contains hundreds of billions of web pages and is well over 100,000,000 gigabytes in size."
https://www.google.com/intl/en_uk/search/howsearchworks/how-...
Doesn't mean you have to be as big as Google to do something useful of course.