Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One random thought... Google goes to SomeWebsite.com. The site has only enough HTML to load a big ol' JavaScript app, which Google slowly crawls. Well, that JS app makes a bunch of AJAX calls. There's no reason I can think of that would prevent Google from remembering which AJAX calls were made, and then just crawling the URLs for those calls on subsequent visits. Why load SomeWebsite.com's JavaScript.com every time you want to index the site, when you can just remember that the JS calls SomeWebsite.com/some-endpoint.json? Sucking the JSON out of an endpoint might even be faster that indexing regular HTML. Haven't written a lot of crawlers, so I'm mostly guessing here.


Crawling AJAX data alone makes no sense because it could be just a piece of JSON and Google needs a rendered HTML page with an URL it can show in the results. If you have some data that are not available at a separate URL (e.g. they are loaded when user presses a button), they will not be indexed.


Because then the crawler would be making the assumption that the main page with the JS loader on it would never change. If it never loads that page and always assumes it should be requesting the XMLhttpRequest endpoints, then changes to the site as a whole might never be indexed.


You still have to run the JS on the original page to know which text loaded from the JSON actually gets shown. You might have N pages loading the same file but showing different things.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: