The irony is that (I just checked) saowen.com has stolen your "The day I found Saowen.com had stolen my content" content as well! It's right up there at their front page.
I am vehemently against Saowen.com stealing content, but this is practically quite similar to how outline works (to remove css or bypass paywall), wonder why people (especially on HN) don't have issue with those:
I think what outline does would be much harder to implement without JS. As it stands, it seems to take parse the URL, fetch the page with a specific UA / some other magic, then automatically format it, without having to store anything on disk and without having to generate static html. Wouldn't this be much, much harder without using JS?
> without having to store anything on disk and without having to generate static html. Wouldn't this be much, much harder without using JS?
While it's possible to do without storing anything on disk (strange requirement imo - but you can keep it in memory on the server if you really want), it's a lot more practical to do this work in the backend. And that's also what outline is doing:
outline's frontend will do an AJAX request to outline's servers to actually fetch the article from a blog and serve that back[1]. So they could do this easily without frontend javascript. But I think the UX would suffer on many levels. Having a frontend in javascript allows outline to do better caching, better user experience, etc. The only downside to using JS is that it doesn't work for people still in 1995 and for people who disable their javascript.
I've had the same issue with saowen.com copying articles off my blog (https://sighack.com) onto their website and I'm glad OP called them out on it publicly.
Interestingly, they always seem to backdate the articles by a few days from my actual date of posting! I also noticed they link to my website, but with a nofollow attribute.
Yeah that's likely the reason. It's just funny to me that such a simple hack can confuse Google into not being able to reliably pinpoint the original source.
This happens to all of my blog posts - on dozens of "aggregator" sites. Some, like Outline, are happy to offer an opt-out. But the rest are just SEO farms - lifting content and passing it off as their own.
I occasionally report them. If they're hosted in the EU, I'm usually successful in getting them taken down. US hosting companies just don't care. Chinese & Russian companies don't even answer emails.
I'd be tempted to include blogs about/links to Tiananmen Square protests etc then report the copies to the Chinese authorities. Possibly do something similar for Russia about homosexuality or whatever.
If they are mirroring your content, figure out a signature of their crawler, and start serving "special" content to them.. Perhaps a bit of JS which does a window.location check and redirects if it's not your own (chances are they might do some poor hostname search and replace, so you'd have to obfuscate).
I wonder why Google is so bad at picking up that these pages are clickfarms, and not instead return the original when searching.
In my native language (norwegian), I have lately had issues with searching for stuff, opening the page linked by google, to quickly realise it's not even legible norwegian, just auto-generated content (google translate of some product review I guess?). Absolutely useless content, no idea why it ranks so high. How can they not manage to filter out stuff like this?
Good point. So if Bing started getting better at detecting duplicate articles and keeping track of duplicates per domain and ranked them lower that would be a big improvement over Google. Doesn't sound hard to do it you've got search engine size resources.
Saowen.com is clearly a mainland Chinese website. Assuming the copying is automatic, all you have to do is post an entry on the evils of communism with pictures of Winnie the Poo interspaced within it. Add in a few photos of the Dalai Lama and a statement like "Xi Jinping is the biggest moron in history".
Then you report it to the Chinese authorities and the penalties for Saowen will be much much higher than a Google search takedown!
I think that's a useful walk-through for anyone coming across this that wouldn't know what to do to try to get the listing removed; with the appropriate 3 pieces of evidence it looks as if it worked well.
I wonder if it's possible to automate the process - i.e. to alert the content creators that saowen.com is stealing from and help them complete the process?
I assume if each page has the markup tag
> <strong class="fn" itemprop="author">nickmchardy.com</strong>
You can then either attempt an email to info@{} the tld of the author tag or scrape that site for email addressees on there.
Assuming that most of these are blogs, such as the case here, hopefully there wouldn't be too many addresses on each domain. So hopwfully relatively easy to do ... ?
Good idea to disable the default feed option in wordpress to avoid this kind of stuff. It won't stop a determined plagiarist but should eliminate a lot of the automated stuff?
These Chinese aggregators are all the same, just like Toutiao from Bytedance (that TikTok company), not long ago the world's most valuable startup.
Together with Tencent's WeChat and other me-too sites they created the "self-media" cottage industry, basically lend credibility and let laymen publish all sorts of lowbrow and sensationalistic content to earn ad money.
If you check https://www.bilibili.com/, a very popular Chinese video site (I'd say also a stomping ground for anime pedophiles which the site was built upon. US listed btw.), you can find pirated US tv shows and basicially every popular yutube video, reprocessed, edited and watermarked as their "original" content, raking in money for their uploaders. Toutiao again is just the same model.
Part of the reason why Bytedance is so valuable, is Chinese just love these things, their de facto "news" source, the vast majority of people didn't receive higher edcation and doesn't want to read serious articles (which Chinese state-controlled media also lack).
Nobody care if it's original, they just want entertaining, explosive and easy-to-consume content.
Copying and plagiarism is so prevalent and pervasive that these "self-media" platforms have to introduce an "original" tag.
They will copy your content and edit it as the due process anyway, so add whatever you like and it won't deter them.
Change some words and the paragraph order, add funny pics and snippets, alter key parts to make it much more explosive and eyecatching, voila. They even offer part-time on-line jobs for this reprocessing.
Bilibili certainly has copyright issues, but it's the same issue with YouTube and other millions of websites with user-uploaded pirated contents. I wouldn't say Bilibili is much worse than any other websites.
Funny I've been using Youtube for years, the only pirated content I've encountered were a few TV documenties and some unwacthable movies, and they seem to be deleted often.
Never saw Youtube promote and recommend any pirated content to me, unlike Bilibili.
Bilibili's front page is almost normal and all good, their shady stuff seems buried deep, my ipad is full of pirated Youtube videos with hundreds of thousands of views.
I mean, interesting, but after reading the first link I found this pretty good rebuttal:
So it is odd to hear today that America stole secrets from Arkwright when Arkwright himself was claiming that he had disclosed the machines in his patent specifications to a degree sufficient to make and use them.