Hacker News new | past | comments | ask | show | jobs | submit login
Scraping content issues
4 points by andre on May 30, 2007 | hide | past | favorite | 6 comments



Anyone can legally complain if you copy their content. So they send a C&D letter and you remove whatever offended them, no harm done.

If you want to prevent bots from scraping your content then take advantage of the fact that most bots don't do Javascript: in your server code render the content of each page with some simple encoding that makes text unreadable then add a piece of javascript to window.onload() thats decodes and displays the content.


So they send a C&D letter and you remove whatever offended them, no harm done.

Agreed.

It's easier to ask forgiveness (or just stop scraping if they complain) than get permission.


I've always wondered. Are there any bots that can do javascript? Is there a js engine you can stick in your bot so it reads everything on a page just like a user would see it?


Yes. You you can either automate IE http://support.microsoft.com/kb/167658 (or Firefox http://www.iol.ie/~locka/mozilla/mozilla.htm ) from your code and make a bot that way.

Or use a "bot" like this which uses IE under the hood: http://search.cpan.org/dist/Win32-IE-Mechanize/lib/Win32/IE/Mechanize.pm or this one which uses Mozilla under the hood: http://search.cpan.org/~slanning/Mozilla-Mechanize-0.05/lib/Mozilla/Mechanize.pm


If a company has no terms of use or any other kind of policy on their site, what are the issues in scraping the content? any way to prevent it?


You always have a copyright, even if you don't say so on the page.

They could of course add a robots.txt and stop nicely behaved scrapers that way, but to stop all scraping is impossible. There's always a way. The best you can hope for is to make it so hard, that they don't bother creating a custom made scraper.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: