Thanks for posting @mieubrisse! I haven't posted it on HN myself in a long time but I just released ArchiveBox v0.7.2 a couple days ago, so it's great timing.
I encourage people to also check out the list of ArchiveBox alternatives we maintain if ArchiveBox doesn't quite fit your needs.
Can I just say how amazingly awesome that yourself and the project maintains a list of alternatives. More projects need to do this kind of thing to encourage cross pollination of effort.
You're welcome! I stumbled across it by coincidence today and had an epiphany, "Whoa, forget DeFi - it never occurred to me how important decentralized archiving is. What if archive.org goes down? How would we rebuild?" Thanks for making this!
Love this. That being said, I tried a bunch of these and landed on Shiori; I think my take was that ArchiveBox is great if you definitely want options and be comprehensive, but if you're mostly just going for the articles and text and want something simpler, this is it. (I teach at a college and don't want to lose good articles, and also gives me some nice uniform formatting)
I spun up my own Archivebox after archive.org wouldn't let me archive some news stories and I heard about them removing other content. Instead of calling the Internet Archive the wayback machine, I now call it the maybe back machine. IA is a centralized service and subject to the government and other powerful pressures any centralized popular service faces. If you want to archive something that might now or in future want to be erased by people in power, you should decentralize it to somewhere like an archivebox. This is especially useful if you are writing a book with many citations.
As the ArchiveBox creator I give a good chunk of the ArchiveBox donations I get to Archive.org, and I talk with them a few times a year to share knowledge. I think both centralized and decentralized approaches have their place, neither one can cover every use-case or doomsday scenario fully.
ArchiveBox also saves URLs it ingests to Archive.org by default for this reason!
Are you assuming "people in power" were tied to those situations, though? Specifically, did you check if they were following a robots.txt? I have some criticisms for how they handle robots.txt, but if that's the root cause then it paints a very different picture.
I recall trying to archive a story on a Des Moines local news site. It was publicly available and searchable. I understand people can request of IA not to allow their content to be stored, and there are situations where content is removed on request. Beyond that, it is all opaque to me what goes on there.
This is uncanny, I just discovered ArchiveBox earlier today and set up a self-hosted instance on some home hardware for a collection of bookmarks of useful guides, tutorials, and references I've collected over the years.
Setting it up on K8s with sonic [1] as the search backend and importing a few hundred URLs only took ~an hour or so, and the cached pages look great for the most part.
I looked at ArchiveBox and several similar projects a while ago, but realised I didn't want anything so complex. I just wanted bookmarks, with free-text content search so I could find something again based on more than just a title.
Emphasis on tiny system requirements and dependancies (single binary, no service dependencies). As a consequence the text indexing is very basic (basic HTML scrape). But it's working for me :-)
Yeah sorry this is on my list of things to fix, just haven't gotten around to it.
It's annoying because the site is autogenerated from the README markdown and it's tricky to add custom CSS without increasing build process complexity a bunch. PRs welcome!
I researched various archiving alternatives for something I needed recently. I subscribe to a paid Substack for an educational course that will end mid-year, and I want to archive the course posts before it ends (the course provider has even recommended people end their Substack subscription after it ends).
For this purpose, I found the SingleFile browser extension to be the best fit. It's a browser extension, so paywall cookies are already present, and I just manually archive the previous week's content, after the discussion phase has concluded. It creates a single self-contained file with all images and comments, etc., but all non-page-local links still resolve externally (which is as-desired, for my use case). It can be configured to auto-generate a convenient filename, and to use self-extracting compression.
I preferred this to an automated process based on, e.g., RSS, because I can ensure the archive occurs after all the useful course comments back-and-forth has concluded, and it's trivial to set up and use.
SingleFile is amazing. I also recommend ArchiveWeb.page / browsertrix. Both projects truly do more to solve the hard problems of internet archiving than ArchiveBox (which is just a wrapper + admin UI for a collection of tools).
ArchiveBox actually uses SingleFile internally as one our methods to save every page (among others), and we try to send a portion of our donations periodically to @gildas-lormeau to support his awesome work on it!
I also use some of the browser extensions to save a replica of certain pages ( I also use single File ) FireShot and/or GoFullPage ( I use the paid option on both extensions ) I like singlefile extension because it is can be configured to save pages automatically. Videos are recordable with Camtasia (Paid ) , but there are free options ...
singlefile is so good i am upset that firefox can't screenshot correctly by itself, again. I used to run a URL to image service for both archival and sharing that was dead simple - just fetch it with firefox headless and take a screenshot. The floating footers on a lot of sites, as well as some adware interfere with firefox screenshots now, so i just stopped backing up pages. Singlefile is getting a lot of use since i found out about it.
My primary concern about archivebox (and the WARC stuff) is the TB of existing archival stuff i already have.
I serve the output of SingleFile on my home network. It generates html, so I just push it to my file store. That said, my use-case (archiving a paid Substack course that is well worth paying for) is definitely only for personal use.
I also came across ArchiveBox a few days ago to see if I should migrate off my home-grown solution with Puppeteer, SingleFile & readability.js.
I've been working on getting it deployed to fly.io with LSVD so it can scale to zero while storing everything on an S3-backed volume as described here[0].
My biggest disappointment so far is that it seems like a fairly large lift to make ublock origin work because extensions don't work in headless chrome (?). It seems like using pihole is current best method to block ads [1].
ArchiveBox doesn't really phone home for any reason, so unfortunately I don't have good analytics to know how many real users there are (I'm the ArchiveBox creator). We also set noindex/nofollow on public snapshot content by default, so unfortunately we cant search the web to find public ArchiveBox instances either (for good reason, otherwise users would immediately get tons of automated DMCA notices / copyright trouble).
We do add `ArchiveBox/v0.x.x` to the user agent for all requests by default + push URLs to Archive.org. So in theory someone at Archive.org could look in their server logs for that string and get a pretty good idea of the daily activity (at least for users with default settings). I've asked them a few times in person to run that search but never gotten a follow-up. They're probably very busy and it's just for curiosity, but it would be nice to know someday!
I may add an opt-in federation option at some point in the far future, it would be great to figure out a way to link willing donor's ArchiveBox instances together for public benefit.
For anyone who uses Chrome and wants to view their archived pages in the browser as if they were still online (URL and everything intact), and also full-text search through their browsing history that was archived (like AB plans to add in future, I think, right nikki?) you can check out DownloadNet: https://github.com/dosyago/DownloadNet
You can have multiple archives, and even use a mode where you only archive pages you bookmark rather than everything.
Last year I've been working in a Golang open source tool with a more modest approach by now (just command line) but with a similar goal (to keep personal info), in my tool formats are described using simple YAML templates and stored in a sqlite db file (https://github.com/khromalabs/keeper), glad to know about more open source tools exploring similar ideas.
ArchiveBox is a great bit of kit and I've been using it for a while, I'm currently ingesting my browser bookmarks from Nextcloud bookmarks (using floccus sync from my browser) via RSS. That said, even though it's archiving features a poorer, I've been looking in using linkwarden for the partner approval factor and better integration with my SSO setup.
note you no longer need to create a user manually though, so this shouldn't be an issue anymore. just set ADMIN_USERNAME and ADMIN_PASSWORD env vars and it'll autocreate the user and collection on first run.
This is awesome, I couldn't identify from the readme how you tell it what to save and was wondering whether this could be driven by a Browser add-on/extension?
That's cool. I've been using archivebox together with other tools to achieve this. It may be cool to get some integrations too, rather than yet another from-scratch rag.
I encourage people to also check out the list of ArchiveBox alternatives we maintain if ArchiveBox doesn't quite fit your needs.
https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...