True, but then Google doesn't just download the page source and index that. They...

marginalia_nu · on June 29, 2023

Sure, download and run the javascript, but then you can snapshot the DOM, grab the text, and discard all the rest. The HTML and js is of little practical value for the index after that point.

Google's index is likely very large because they don't have any real economic incentives to keeping it small.

fauigerzigerk · on June 29, 2023

>... but then you can snapshot the DOM, grab the text, and discard all the rest

Yes, absolutely, I didn't mean to imply otherwise. But first you have to figure out what you can discard beyond the HTML tags themselves to avoid indexing all the garbage that is on each and every page.

When I tried to do this I came to the conclusion that I needed to actually render the page to find out where on the page a particular piece of text was, what font size it had, if it was even visible, etc. And then there's JavaScript of course.

So what I'm saying is that storing a couple of kilobytes is probably not the most costly part of indexing a page.

akiselev · on June 29, 2023

> When I tried to do this I came to the conclusion that I needed to actually render the page to find out where on the page a particular piece of text was, what font size it had, if it was even visible, etc. And then there's JavaScript of course.

Are there open source projects devoted to this functionality? It’s becoming more and more a sticking point for working with LLMs. Grabbing the text without navigation and other crap but while maintaining formatting and links, etc

fauigerzigerk · on June 29, 2023

Good question (meaning I don't know :)

For my specific purposes it has always been good enough to apply some simple heuristics. But that wouldn't have been possible without access to post rendering information, which only a real browser (https://pptr.dev) can reliably produce.

DeathArrow · on June 29, 2023

There are many software libraries that can output just the text from HTML or run JS. For C# there's HTML Agility Pack and PuppeteerSharp, for example. I did use them for web scrapping.

marginalia_nu · on June 29, 2023

You don't need to store it indefinitely though, and there's not much point in crawling faster than you can process the data.

The couple of kilobytes per document is the actual storage footprint. Sure you need to massage the data, but that almost entirely CPU bound. You also need a lot of RAM for keeping the hot parts of the index.

ColinHayhurst · on June 29, 2023

I've been told, by a very credible source that would know, that the top level index is (only) 10 billion web pages.