Yea, its sad cause you have this cool middleware that can do so many things with the whole EIP theory and all the premade plugins.
What is eating this guys lunch really? and I think there is a void there somewhere thats not nicely filled currently. Some pain points that are exposed because of its demise. Thoughts?
Yes, that's the one! Those numbers refer to the number of papers in each torrent, so each one contains 100,000 papers giving a current total of 66+ million.
The torrents of 100,000 are broken into 1000-paper zip archives that can be downloaded individually, so it's pretty manageable if you want to just check out a random sampling of the papers.
I would love to see somebody do some kind of massive scale analysis of the papers, but just extracting plain text from all those PDFs is a pretty herculean task considering that many would need to be OCRed, and others end up pretty garbled / misformated with pdftotext and the like.
I thought about mirroring it, the repository db is 200MB and simple in structure, but then you have to have quite a lot of hdd on your side (20, 200TB maybe more, can't recall)
What is eating this guys lunch really? and I think there is a void there somewhere thats not nicely filled currently. Some pain points that are exposed because of its demise. Thoughts?