would love to have even smaller subsets (like 5gb) that students can casually pl...

Aloisius · on Nov 28, 2013

You can fetch a single WARC file directly like say:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/warc/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.gz

They are around 850 MB each.

The text extracts and metadata files are generated off individual WARC files, so it is pretty easy to get the corresponding sets of files. For the above it would be:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/wat/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wat.gz

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/wet/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wet.gz

ccleve · on Nov 28, 2013

Is there any way to get incrementals? It would be extremely valuable is to get the pages that were added/changed/deleted each day. Some kind of a daily feed of a more limited size.

froo · on Nov 28, 2013

  s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/

That should get you about 90% on your way.

alok-g · on Nov 28, 2013

Totally true. Smaller versions are not helpful for just casual/student use; it also helps in code development and debugging. Otherwise, algorithm development gets impeded by scaling issues.

ma2rten · on Nov 28, 2013

Also interesting for machine learning. Where you can use it as a background collection.