Scraping content issues

snorkel · on May 30, 2007

Anyone can legally complain if you copy their content. So they send a C&D letter and you remove whatever offended them, no harm done.

If you want to prevent bots from scraping your content then take advantage of the fact that most bots don't do Javascript: in your server code render the content of each page with some simple encoding that makes text unreadable then add a piece of javascript to window.onload() thats decodes and displays the content.

dpapathanasiou · on May 30, 2007

So they send a C&D letter and you remove whatever offended them, no harm done.

Agreed.

It's easier to ask forgiveness (or just stop scraping if they complain) than get permission.

tocomment · on May 30, 2007

I've always wondered. Are there any bots that can do javascript? Is there a js engine you can stick in your bot so it reads everything on a page just like a user would see it?

lupin_sansei · on May 31, 2007

Yes. You you can either automate IE http://support.microsoft.com/kb/167658 (or Firefox http://www.iol.ie/~locka/mozilla/mozilla.htm ) from your code and make a bot that way.

Or use a "bot" like this which uses IE under the hood: http://search.cpan.org/dist/Win32-IE-Mechanize/lib/Win32/IE/Mechanize.pm or this one which uses Mozilla under the hood: http://search.cpan.org/~slanning/Mozilla-Mechanize-0.05/lib/Mozilla/Mechanize.pm

andre · on May 30, 2007

If a company has no terms of use or any other kind of policy on their site, what are the issues in scraping the content? any way to prevent it?

ks · on May 30, 2007

You always have a copyright, even if you don't say so on the page.

They could of course add a robots.txt and stop nicely behaved scrapers that way, but to stop all scraping is impossible. There's always a way. The best you can hope for is to make it so hard, that they don't bother creating a custom made scraper.