Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That page is a bit outdated because I am still finetuning the on-site archive system before I do a writeup.

I still use archiver-bot etc, they're just not how I do the on-site archives. See https://github.com/gwern/gwern.net/blob/master/build/LinkArc... https://github.com/gwern/gwern.net/blob/master/build/linkArc... for that.

The quick summary is that PDFs are automatically downloaded, hosted locally, and links rewritten to the local PDF; other URLs, after a delay, call the CLI version of https://github.com/gildas-lormeau/SingleFile to run headless Chrome to dump a snapshot, which are manually reviewed by myself & improved as necessary, and then links get rewritten to the snapshot HTML. They get some no-crawl HTTP headers and robots.txt exclusions to try to reduce copyright trouble.



THANK YOU for scratching that itch.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: