Hacker News new | past | comments | ask | show | jobs | submit login

would love to have even smaller subsets (like 5gb) that students can casually play around with too to practice and learn tools and algos :) (if it's not too much trouble!)



You can fetch a single WARC file directly like say:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/warc/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.gz

They are around 850 MB each.

The text extracts and metadata files are generated off individual WARC files, so it is pretty easy to get the corresponding sets of files. For the above it would be:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/wat/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wat.gz

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/wet/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wet.gz


Is there any way to get incrementals? It would be extremely valuable is to get the pages that were added/changed/deleted each day. Some kind of a daily feed of a more limited size.


  s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/
That should get you about 90% on your way.


Totally true. Smaller versions are not helpful for just casual/student use; it also helps in code development and debugging. Otherwise, algorithm development gets impeded by scaling issues.


Also interesting for machine learning. Where you can use it as a background collection.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: