The day I found Saowen.com had stolen my content

Tor3 · on Dec 20, 2018

The irony is that (I just checked) saowen.com has stolen your "The day I found Saowen.com had stolen my content" content as well! It's right up there at their front page.

bausshf · on Dec 20, 2018

That's absolutely hilarious!

cm2187 · on Dec 20, 2018

And making money out of all the page views that hacker news is bringing them!

avaku · on Dec 20, 2018

I hope this doesn't lead to an infinite loop ;)

ar-jan · on Dec 20, 2018

I tried to document this on archive.org but got "This url is not available on the live web or can not be archived."

H4CK3RM4N · on Dec 20, 2018

Try archive.is

ar-jan · on Dec 20, 2018

I did, but it gave garbled content.

paraditedc · on Dec 20, 2018

It's on Saowen.com now as well:

https://www.saowen.com/a/3b4deadad676021a51cbcee5d5fc1ec4269...

I am vehemently against Saowen.com stealing content, but this is practically quite similar to how outline works (to remove css or bypass paywall), wonder why people (especially on HN) don't have issue with those:

https://outline.com/DnVGCw

https://www.outline.com/dmca.html

Edit: Replace "Not to defend the site" with "I am vehemently against Saowen.com stealing content"

chmod775 · on Dec 20, 2018

Outline makes articles more readable instead of making them worse by adding a metric ton of ads and trackers.

Outline is better from a moral standpoint and closer to say (but not quite) the internet archive.

paraditedc · on Dec 20, 2018

My bad. I totally missed the ads and trackers part (because I use uBlock Origin).

forgotmypw2 · on Dec 20, 2018

only if you have js enabled, otherwise it's a blank page

9935c101ab17a66 · on Dec 20, 2018

I think what outline does would be much harder to implement without JS. As it stands, it seems to take parse the URL, fetch the page with a specific UA / some other magic, then automatically format it, without having to store anything on disk and without having to generate static html. Wouldn't this be much, much harder without using JS?

askmike · on Dec 20, 2018

> without having to store anything on disk and without having to generate static html. Wouldn't this be much, much harder without using JS?

While it's possible to do without storing anything on disk (strange requirement imo - but you can keep it in memory on the server if you really want), it's a lot more practical to do this work in the backend. And that's also what outline is doing:

outline's frontend will do an AJAX request to outline's servers to actually fetch the article from a blog and serve that back[1]. So they could do this easily without frontend javascript. But I think the UX would suffer on many levels. Having a frontend in javascript allows outline to do better caching, better user experience, etc. The only downside to using JS is that it doesn't work for people still in 1995 and for people who disable their javascript.

[1]: example fetch call: https://outlineapi.com/v3/get_article?id=6ps979

edit: fixed typos

ryanlol · on Dec 20, 2018

>but this is practically quite similar to how outline works

Not at all. Outline doesn't essentially present the content as their own.

paraditedc · on Dec 20, 2018

Neither does Saowen.com. Above the title you can see:

> 2018-12-20 nickmchardy.com

The difference is that the link in Saowen.com directs to https://www.saowen.com/source/site/nickmchardy_com, whereas the one on outline goes to the source.

ryanlol · on Dec 20, 2018

I fundamentally disagree that this is effectively attributing the content.

deytempo · on Dec 20, 2018

The more they are talked about the higher google and other search engines will rank them

paraditedc · on Dec 20, 2018

I get the idea of back links. But I doubt our small discussion on HN would have much impact given it already copied thousands of posts at this point.

mvanga · on Dec 20, 2018

I've had the same issue with saowen.com copying articles off my blog (https://sighack.com) onto their website and I'm glad OP called them out on it publicly.

They even have an index page with a list of all my articles, updated as I post: https://www.saowen.com/source/site/sighack_com

Interestingly, they always seem to backdate the articles by a few days from my actual date of posting! I also noticed they link to my website, but with a nofollow attribute.

Completely nuts...

thecatspaw · on Dec 20, 2018

Backdatinh is most likely for SEO, in the hopes that google recognizes them as the original, and your blog the copy

mvanga · on Dec 20, 2018

Yeah that's likely the reason. It's just funny to me that such a simple hack can confuse Google into not being able to reliably pinpoint the original source.

paraditedc · on Dec 20, 2018

Finally a problem that blockchain technology can solve?

cm2187 · on Dec 20, 2018

It's kind of ironic (and sad) that dereferencing from google is as good as removing the offending content from the internet.

edent · on Dec 20, 2018

This happens to all of my blog posts - on dozens of "aggregator" sites. Some, like Outline, are happy to offer an opt-out. But the rest are just SEO farms - lifting content and passing it off as their own.

I occasionally report them. If they're hosted in the EU, I'm usually successful in getting them taken down. US hosting companies just don't care. Chinese & Russian companies don't even answer emails.

It's rather frustrating.

superflyguy · on Dec 20, 2018

I'd be tempted to include blogs about/links to Tiananmen Square protests etc then report the copies to the Chinese authorities. Possibly do something similar for Russia about homosexuality or whatever.

dhimes · on Dec 20, 2018

Exactly my thought. Some really nasty base64 encoded images and stuff, if I could identify them by IP or some other way.

spydum · on Dec 20, 2018

If they are mirroring your content, figure out a signature of their crawler, and start serving "special" content to them.. Perhaps a bit of JS which does a window.location check and redirects if it's not your own (chances are they might do some poor hostname search and replace, so you'd have to obfuscate).

dhimes · on Dec 20, 2018

Can you id them by ip?

th0br0 · on Dec 20, 2018

Blurring out the ticket id isn't that useful when the TXT record still exists...

matsemann · on Dec 20, 2018

I wonder why Google is so bad at picking up that these pages are clickfarms, and not instead return the original when searching.

In my native language (norwegian), I have lately had issues with searching for stuff, opening the page linked by google, to quickly realise it's not even legible norwegian, just auto-generated content (google translate of some product review I guess?). Absolutely useless content, no idea why it ranks so high. How can they not manage to filter out stuff like this?

peteretep · on Dec 20, 2018

> I wonder why Google is so bad at picking up that these pages are clickfarms, and not instead return the original when searching.

They have a monopoly on search, why do they care?

dublo7 · on Dec 20, 2018

Good point. So if Bing started getting better at detecting duplicate articles and keeping track of duplicates per domain and ranked them lower that would be a big improvement over Google. Doesn't sound hard to do it you've got search engine size resources.

potatofarmer45 · on Dec 20, 2018

Saowen.com is clearly a mainland Chinese website. Assuming the copying is automatic, all you have to do is post an entry on the evils of communism with pictures of Winnie the Poo interspaced within it. Add in a few photos of the Dalai Lama and a statement like "Xi Jinping is the biggest moron in history".

Then you report it to the Chinese authorities and the penalties for Saowen will be much much higher than a Google search takedown!

djaychela · on Dec 20, 2018

I think that's a useful walk-through for anyone coming across this that wouldn't know what to do to try to get the listing removed; with the appropriate 3 pieces of evidence it looks as if it worked well.

I wonder if it's possible to automate the process - i.e. to alert the content creators that saowen.com is stealing from and help them complete the process?

Browun · on Dec 20, 2018

I assume if each page has the markup tag > <strong class="fn" itemprop="author">nickmchardy.com</strong>

You can then either attempt an email to info@{} the tld of the author tag or scrape that site for email addressees on there.

Assuming that most of these are blogs, such as the case here, hopefully there wouldn't be too many addresses on each domain. So hopwfully relatively easy to do ... ?

Would be interested in pursuing this though

dazc · on Dec 20, 2018

Good idea to disable the default feed option in wordpress to avoid this kind of stuff. It won't stop a determined plagiarist but should eliminate a lot of the automated stuff?

Looks like the author has done that now.

new_here · on Dec 20, 2018

Is there not some collective action that can be taken to have Google penalise Saowen.com and other sites that engage in this kind of plagiarism?

matsemann · on Dec 20, 2018

> This article hot links images hosted at nickmchardy.com

Yeahhh, I would quickly have changed the contents of those images..

deytempo · on Dec 20, 2018

Not just that, you could redirect requests from their URL to images of whatever you want wherever you want via htaccess

jsjohnst · on Dec 20, 2018

Just be careful doing that when it’s a tabloid trash blogger who is linking to your photos. Some fights are best left un-fought.

hartator · on Dec 20, 2018

It’s seo 101, but avoids mentioning the theif domain name and linking to it.

cauldron · on Dec 20, 2018

These Chinese aggregators are all the same, just like Toutiao from Bytedance (that TikTok company), not long ago the world's most valuable startup.

Together with Tencent's WeChat and other me-too sites they created the "self-media" cottage industry, basically lend credibility and let laymen publish all sorts of lowbrow and sensationalistic content to earn ad money.

If you check https://www.bilibili.com/, a very popular Chinese video site (I'd say also a stomping ground for anime pedophiles which the site was built upon. US listed btw.), you can find pirated US tv shows and basicially every popular yutube video, reprocessed, edited and watermarked as their "original" content, raking in money for their uploaders. Toutiao again is just the same model.

Part of the reason why Bytedance is so valuable, is Chinese just love these things, their de facto "news" source, the vast majority of people didn't receive higher edcation and doesn't want to read serious articles (which Chinese state-controlled media also lack).

Nobody care if it's original, they just want entertaining, explosive and easy-to-consume content.

ar-jan · on Dec 20, 2018

Time to add a map of Taiwan as independent state, Senkaku Islands as Japanese, etc., as an aside to any and all articles published? ;)

cauldron · on Dec 20, 2018

Copying and plagiarism is so prevalent and pervasive that these "self-media" platforms have to introduce an "original" tag.

They will copy your content and edit it as the due process anyway, so add whatever you like and it won't deter them.

Change some words and the paragraph order, add funny pics and snippets, alter key parts to make it much more explosive and eyecatching, voila. They even offer part-time on-line jobs for this reprocessing.

akfanta · on Dec 20, 2018

> the vast majority of people didn't receive higher edcation and doesn't want to read serious articles

> Nobody care if it's original, they just want entertaining, explosive and easy-to-consume content.

This is hardly a unique thing to Chinese.

paraditedc · on Dec 20, 2018

Bilibili certainly has copyright issues, but it's the same issue with YouTube and other millions of websites with user-uploaded pirated contents. I wouldn't say Bilibili is much worse than any other websites.

cauldron · on Dec 20, 2018

Is this whataboutism?

Funny I've been using Youtube for years, the only pirated content I've encountered were a few TV documenties and some unwacthable movies, and they seem to be deleted often.

Never saw Youtube promote and recommend any pirated content to me, unlike Bilibili.

Bilibili's front page is almost normal and all good, their shady stuff seems buried deep, my ipad is full of pirated Youtube videos with hundreds of thousands of views.

paraditedc · on Dec 20, 2018

> Is this whataboutism?

1. The topic on this thread is copyright, not bilibili or YouTube per se.

2. You brought up YouTube yourself.

I agree with the rest of your statements. YouTube is better than other websites in protecting copyright.

aaaaaaaaaab · on Dec 20, 2018

Chinese stealing IP. Color me surprised!

dang · on Dec 22, 2018

Please don't post unsubstantive comments here, and especially not nationalistic flamebait.

andybak · on Dec 20, 2018

Just for a bit of historical perspective:

https://www.ipwatchdog.com/2017/07/05/americas-industrial-re...

https://www.pri.org/stories/2014-02-18/us-complains-other-na...

https://www.realclearmarkets.com/articles/2018/07/30/ip_thef...

leereeves · on Dec 20, 2018

Are you suggesting that because people 200 years ago did something, we have to tolerate the same thing now? That's just silly.

andybak · on Dec 21, 2018

Of course I'm not. That's why I used the precise phrase I used. Key words: "historical" and "perspective".

leereeves · on Dec 21, 2018

Then how is this "historical perspective" relevant to the current discussion?

9935c101ab17a66 · on Dec 20, 2018

I mean, interesting, but after reading the first link I found this pretty good rebuttal:

So it is odd to hear today that America stole secrets from Arkwright when Arkwright himself was claiming that he had disclosed the machines in his patent specifications to a degree sufficient to make and use them.

Comment #5: Edward Heller July 6, 2017 9:43 am

andybak · on Dec 21, 2018

Aren't patents all about disclosure? They were meant to prevent secrecy in return for a limited monopoly.