Hacker News new | past | comments | ask | show | jobs | submit login

I've noticed that I have started doing that recently - appending reddit to my queries.

There just seems to be a load of imitation sites now, like 6 different wrapper sites for GitHub, 8 for StackOverflow, a couple for GitLab, something aggregating a load of forums - so the first couple of pages are the exact same content - just from 15 different sites that copy the originals.

At least going with a community site there tends to be actual discussion and or useful links to the relevant content




Those are infuriating. I hate to see ACTUAL content creators having their livelihoods stolen this way. Why wouldn't Google filter out the worst offenders? It takes literally one minute to get a nice list of a dozen imitation sites that nobody would miss. Maybe Google feels a little inhibited from 'choosing the winners' for all but the largest cases?


One FTE at Google could probably filter out like 99% of the SEO spam sites in technical english querries.

It would be a winning battle, since it is less work to blacklist than to make a high scoring site.

I guess Google Search internally is a mess. Maybe they have no clue what they are doing or have some really bad directors and lower managers messing stuff up.

Maybe there are so much blackbox ML called from 1000s of Perl files that the engineers don't understand what is happening.


I often wonder how much modern IT infrastructure is simply this mess of 'we have no idea how it really works' blackboxes strung together with API calls.

I suspect you're right about how much of a true understanding they (at google) still have of the behaviour of their search engine.


>Why wouldn't Google filter out the worst offenders?

There are no Google adverts on GitHub, Stack Overflow, etc but there are on many of the copycat sites.


I'm not sure about these days, but historically the engineers on Google search wanted to fix these problems algorithmically, rather than delisting specific sites by hand


And, again historically, Amith Singhal and team preferred ranking algorithms to powerful-but-opaque L2R (learning to rank) approaches.


Here is my uBlock filter with hundreds of GitHub/StackOverflow copycats: https://github.com/quenhus/uBlock-Origin-dev-filter

It blocks copycats and hide them from multiple search engines. You may also use the list with uBlacklist.


If you can do this, so can Google. This just shows they refuse to.


> If you can do this, so can Google. This just shows they refuse to.

If they immediately blocked these sites then Google would get a lot of flack for censoring the web.

I don't like these sites as much as anyone. A while back I even tweeted about[0] having a dream where I wrote a browser extension to intercept and redirect these copycat sites to the real site.

In my mind this falls into the same category as phone spam. The phone networks could block these but how would you feel if you knew your phone company was auto-filtering incoming calls without you having any control over that? It's a very thin line.

Hopefully one days algorithms will be smart enough to auto de-rank copycat sites or blatant plagiarism so they don't show up on the first page.

[0]: https://twitter.com/nickjanetakis/status/1473671136928018434


They already de-rank plenty of sites for countless abuses, especially for gaming search. They have been doing this for a long time, and no one has ever called it censorship. This is the first time I've heard of anyone even suggesting this.

Also, their ranking algorithm is extremely complex. To suggest one complex algorithm is censorship and another is unbiased search results is to have a very naive understanding of how search works.


>algorithms will be smart enough to auto de-rank copycat sites or blatant plagiarism

So... if google creates an algorithm to detect copycatting/plagiarism it's okay for them to deploy it, but it's not okay if they do it by hand?


> So... if google creates an algorithm to detect copycatting/plagiarism it's okay for them to deploy it, but it's not okay if they do it by hand?

No, I thought more about my comment a day later. I don't know what a fair answer is. Being ranked on page 216 by an algorithm or de-listed manually is basically the same outcome.


I have found that installing uBlacklist (a browser extension) and blocking these sites from search results as I encounter them helps noticeably. There are only so many of these "clone" sites that rank highly on Google, so I found it pretty easy to keep up with them for the things I usually search for. There are even shared uBlacklist lists for things like SO clones, but I haven't even bothered to use them.


Ye I have that one to and search hits gets notably better by just adding some 20 sites to it for tech querries.

I makes me wonder how Google can mess this up.


It's not a bug, it's a feature. You search more times, see more ads.


There HAS to be a way for google to detect a site is a copy and de-rank it. I refuse to believe their army of PhDs can't figure this out. Google's incentives are wrong. They make more money from SEO spam with ads than from the original sites.


appending "wiki" is also really useful if you're looking for straight facts


It’s sad but I also noticed I have to add « wiki » more and more because Wikipedia is increasingly not the first result for searches where it should be the first result. Instead there’s often the stupid Google widget obviously copying Wikipedia’s content without a direct link to the actual page.


I've seen Encyclopedia Britannica ranking above Wikipedia. It was really weird, I read both, Wikipedia was better.


We need AdBlock lists for search engines at this point.





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: