Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Navigating the WARC file format (commoncrawl.org)
40 points by zbowling on April 2, 2014 | hide | past | favorite | 2 comments


Wait, Common Crawl is part of Automattic? I had no idea!


It's not. The example shows them crawling a page from 102jamzorlando.cbslocal.com, which is hosted by Wordpress.com. Apparently Automattic inserts that recruitment ad into the headers of every site they host. (At least, all the ones I've checked so far.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: