Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Seems like an easy solution to this problem would use two functions.

One function that takes the output of the page, and renders it so only what's user visible, actually gets indexed. So no headers, no JSON data, no nothing, unless it's actually in the final outcome of the page when rendered. This would require jsdom or some other DOM implementation. Hardly hard for Google (Chrome) to achieve this, and been done multiple times.

Second function is a function that does the same call twice, passing the page to function one each time, then compare them two. If you make two calls right next to each other, and some data is different, you discard that from your search index. Instead you only index data that appears in both calls.

Now you don't have the issue of "dynamic content" anymore...



Typically dynamic content doesn't change from second to second, it changes after 5 minutes or an hour or 1 day, actually it is extremely site specific too.

But I do like your idea.

To go a bit further on your idea - you could apply machine learning to analyse the changes. So for example, ML could determine what is probably the "content area" of the page simply by having built out a NN for each website that self-expires the training data at about 1 month (to account for redesigns over time).

The major problem will still be "ads" in the middle of the content, especially odd scroll designs ads that have a different "picture" at each scroll position, as well as video ads that are likely to be different at each screen shot.

Another form of ad being the "linked" words like when you see random words of the paragraphs becoming links that go to shitty websites that define the word but show a bunch of other ads.

I suppose Google could simply install UBlock in it's training data collector harness to help with that stuff. >()




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: