Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As of aug 16, common crawl has 1.73n pages. For the complimentary set of urls, if any benefit you can use their data dump as seed.

If the metadata (such as last modified) size of your index is small enough to upload to aws, you can also reduce your re-crawl efforts when they have a fresh release.



It doesn't have to be small to donate to Common Crawl, they have a free S3 bucket.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: