would love to have even smaller subsets (like 5gb) that students can casually play around with too to practice and learn tools and algos :) (if it's not too much trouble!)
The text extracts and metadata files are generated off individual WARC files, so it is pretty easy to get the corresponding sets of files. For the above it would be:
Is there any way to get incrementals? It would be extremely valuable is to get the pages that were added/changed/deleted each day. Some kind of a daily feed of a more limited size.
Totally true. Smaller versions are not helpful for just casual/student use; it also helps in code development and debugging. Otherwise, algorithm development gets impeded by scaling issues.