ArchiveBox: Open-source self-hosted web archiving

nikisweeting · on Jan 11, 2024

Thanks for posting @mieubrisse! I haven't posted it on HN myself in a long time but I just released ArchiveBox v0.7.2 a couple days ago, so it's great timing.

I encourage people to also check out the list of ArchiveBox alternatives we maintain if ArchiveBox doesn't quite fit your needs.

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

dugite-code · on Jan 12, 2024

Can I just say how amazingly awesome that yourself and the project maintains a list of alternatives. More projects need to do this kind of thing to encourage cross pollination of effort.

j45 · on Jan 12, 2024

This is one of the most comprehensive lists I've ever seen.

mieubrisse · on Jan 12, 2024

You're welcome! I stumbled across it by coincidence today and had an epiphany, "Whoa, forget DeFi - it never occurred to me how important decentralized archiving is. What if archive.org goes down? How would we rebuild?" Thanks for making this!

antman · on Jan 13, 2024

Does anyone of these support highlighted webpages? (except hypothes.is which os super unfriendly for self hosting)

nikisweeting · on Jan 15, 2024

Unfortunately no fully self-hostable options that are fleshed out besides Hypothesis as far as I know. A lot of the paid options have it though.

Let me know if you find a good option later on and I'll add it to the list.

jrm4 · on Jan 12, 2024

Love this. That being said, I tried a bunch of these and landed on Shiori; I think my take was that ArchiveBox is great if you definitely want options and be comprehensive, but if you're mostly just going for the articles and text and want something simpler, this is it. (I teach at a college and don't want to lose good articles, and also gives me some nice uniform formatting)

https://github.com/go-shiori/shiori

kornhole · on Jan 11, 2024

I spun up my own Archivebox after archive.org wouldn't let me archive some news stories and I heard about them removing other content. Instead of calling the Internet Archive the wayback machine, I now call it the maybe back machine. IA is a centralized service and subject to the government and other powerful pressures any centralized popular service faces. If you want to archive something that might now or in future want to be erased by people in power, you should decentralize it to somewhere like an archivebox. This is especially useful if you are writing a book with many citations.

nikisweeting · on Jan 11, 2024

As the ArchiveBox creator I give a good chunk of the ArchiveBox donations I get to Archive.org, and I talk with them a few times a year to share knowledge. I think both centralized and decentralized approaches have their place, neither one can cover every use-case or doomsday scenario fully.

ArchiveBox also saves URLs it ingests to Archive.org by default for this reason!

Dylan16807 · on Jan 11, 2024

Are you assuming "people in power" were tied to those situations, though? Specifically, did you check if they were following a robots.txt? I have some criticisms for how they handle robots.txt, but if that's the root cause then it paints a very different picture.

kornhole · on Jan 12, 2024

I recall trying to archive a story on a Des Moines local news site. It was publicly available and searchable. I understand people can request of IA not to allow their content to be stored, and there are situations where content is removed on request. Beyond that, it is all opaque to me what goes on there.

abound · on Jan 12, 2024

This is uncanny, I just discovered ArchiveBox earlier today and set up a self-hosted instance on some home hardware for a collection of bookmarks of useful guides, tutorials, and references I've collected over the years.

Setting it up on K8s with sonic [1] as the search backend and importing a few hundred URLs only took ~an hour or so, and the cached pages look great for the most part.

[1] https://github.com/valeriansaliou/sonic

tardisx · on Jan 12, 2024

I looked at ArchiveBox and several similar projects a while ago, but realised I didn't want anything so complex. I just wanted bookmarks, with free-text content search so I could find something again based on more than just a title.

So I wrote my own: https://github.com/tardisx/linkwallet

Emphasis on tiny system requirements and dependancies (single binary, no service dependencies). As a consequence the text indexing is very basic (basic HTML scrape). But it's working for me :-)

qingcharles · on Jan 14, 2024

This looks perfect for my needs, thank you :)

parasti · on Jan 13, 2024

The screenshot section single-handedly breaks mobile UX due to overflow.

nikisweeting · on Jan 15, 2024

Yeah sorry this is on my list of things to fix, just haven't gotten around to it.

It's annoying because the site is autogenerated from the README markdown and it's tricky to add custom CSS without increasing build process complexity a bunch. PRs welcome!

dundarious · on Jan 11, 2024

I researched various archiving alternatives for something I needed recently. I subscribe to a paid Substack for an educational course that will end mid-year, and I want to archive the course posts before it ends (the course provider has even recommended people end their Substack subscription after it ends).

For this purpose, I found the SingleFile browser extension to be the best fit. It's a browser extension, so paywall cookies are already present, and I just manually archive the previous week's content, after the discussion phase has concluded. It creates a single self-contained file with all images and comments, etc., but all non-page-local links still resolve externally (which is as-desired, for my use case). It can be configured to auto-generate a convenient filename, and to use self-extracting compression.

I preferred this to an automated process based on, e.g., RSS, because I can ensure the archive occurs after all the useful course comments back-and-forth has concluded, and it's trivial to set up and use.

nikisweeting · on Jan 11, 2024

SingleFile is amazing. I also recommend ArchiveWeb.page / browsertrix. Both projects truly do more to solve the hard problems of internet archiving than ArchiveBox (which is just a wrapper + admin UI for a collection of tools).

ArchiveBox actually uses SingleFile internally as one our methods to save every page (among others), and we try to send a portion of our donations periodically to @gildas-lormeau to support his awesome work on it!

asdefghyk · on Jan 11, 2024

I also use some of the browser extensions to save a replica of certain pages ( I also use single File ) FireShot and/or GoFullPage ( I use the paid option on both extensions ) I like singlefile extension because it is can be configured to save pages automatically. Videos are recordable with Camtasia (Paid ) , but there are free options ...

genewitch · on Jan 12, 2024

singlefile is so good i am upset that firefox can't screenshot correctly by itself, again. I used to run a URL to image service for both archival and sharing that was dead simple - just fetch it with firefox headless and take a screenshot. The floating footers on a lot of sites, as well as some adware interfere with firefox screenshots now, so i just stopped backing up pages. Singlefile is getting a lot of use since i found out about it.

My primary concern about archivebox (and the WARC stuff) is the TB of existing archival stuff i already have.

kornhole · on Jan 11, 2024

That is a great solution for local copies. Archivebox is on a web server to make the archives available to anyone on the internet.

dundarious · on Jan 12, 2024

I serve the output of SingleFile on my home network. It generates html, so I just push it to my file store. That said, my use-case (archiving a paid Substack course that is well worth paying for) is definitely only for personal use.

dtkav · on Jan 12, 2024

I also came across ArchiveBox a few days ago to see if I should migrate off my home-grown solution with Puppeteer, SingleFile & readability.js.

I've been working on getting it deployed to fly.io with LSVD so it can scale to zero while storing everything on an S3-backed volume as described here[0].

My biggest disappointment so far is that it seems like a fairly large lift to make ublock origin work because extensions don't work in headless chrome (?). It seems like using pihole is current best method to block ads [1].

[0] https://community.fly.io/t/bottomless-s3-backed-volumes/1564... [1] https://github.com/ArchiveBox/ArchiveBox/issues/211

nikisweeting · on Jan 15, 2024

Extensions do work in headless chrome! It's tricky but it should work by adding these to the default CHROME_ARGS:

`--disable-extensions-except=/path/to/your/extension/` `--load-extension=/path/to/your/extension/`

loceng · on Jan 11, 2024

Are there any figures available anywhere as to how many people actively-passively maintain a personal-private archive?

nikisweeting · on Jan 11, 2024

ArchiveBox doesn't really phone home for any reason, so unfortunately I don't have good analytics to know how many real users there are (I'm the ArchiveBox creator). We also set noindex/nofollow on public snapshot content by default, so unfortunately we cant search the web to find public ArchiveBox instances either (for good reason, otherwise users would immediately get tons of automated DMCA notices / copyright trouble).

We do add `ArchiveBox/v0.x.x` to the user agent for all requests by default + push URLs to Archive.org. So in theory someone at Archive.org could look in their server logs for that string and get a pretty good idea of the daily activity (at least for users with default settings). I've asked them a few times in person to run that search but never gotten a follow-up. They're probably very busy and it's just for curiosity, but it would be nice to know someday!

The only other metrics we have as of 2024/01:

- ~5m Docker Hub image pulls

- ~17k Github Stars

- ~1k issues and PRs, 100+ contributors

- ~1k browser extension users

nikisweeting · on Jan 12, 2024

I may add an opt-in federation option at some point in the far future, it would be great to figure out a way to link willing donor's ArchiveBox instances together for public benefit.

Follow here for progress: https://github.com/ArchiveBox/ArchiveBox/issues/50

keepamovin · on Jan 12, 2024

For anyone who uses Chrome and wants to view their archived pages in the browser as if they were still online (URL and everything intact), and also full-text search through their browsing history that was archived (like AB plans to add in future, I think, right nikki?) you can check out DownloadNet: https://github.com/dosyago/DownloadNet

You can have multiple archives, and even use a mode where you only archive pages you bookmark rather than everything.

rgomez · on Jan 12, 2024

Last year I've been working in a Golang open source tool with a more modest approach by now (just command line) but with a similar goal (to keep personal info), in my tool formats are described using simple YAML templates and stored in a sqlite db file (https://github.com/khromalabs/keeper), glad to know about more open source tools exploring similar ideas.

dugite-code · on Jan 12, 2024

ArchiveBox is a great bit of kit and I've been using it for a while, I'm currently ingesting my browser bookmarks from Nextcloud bookmarks (using floccus sync from my browser) via RSS. That said, even though it's archiving features a poorer, I've been looking in using linkwarden for the partner approval factor and better integration with my SSO setup.

A4ET8a8uTh0 · on Jan 12, 2024

For those who want to test in unraid and run into root issue after initial setup:

https://3xn.nl/projects/category/unraid/

First time user, but its one of those things I did not know I wanted.

nikisweeting · on Jan 12, 2024

Direct link: https://3xn.nl/projects/2022/02/17/archivebox-root-issue-in-...

note you no longer need to create a user manually though, so this shouldn't be an issue anymore. just set ADMIN_USERNAME and ADMIN_PASSWORD env vars and it'll autocreate the user and collection on first run.

https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#...

A4ET8a8uTh0 · on Jan 12, 2024

I ran into it on most version of unraid, but I used community apps and not github, which may be the reason I got the error.

theK · on Jan 12, 2024

This is awesome, I couldn't identify from the readme how you tell it what to save and was wondering whether this could be driven by a Browser add-on/extension?

nikisweeting · on Jan 15, 2024

There is a section in the README outlining all the input formats it can take, including a link to the browser extension:

https://github.com/ArchiveBox/ArchiveBox#input-formats

CrypticShift · on Jan 12, 2024

This is one of those great projects that would benefit from local LLM integration.

nikisweeting · on Jan 12, 2024

Stay tuned ;)

https://github.com/ArchiveBox/ArchiveBox/issues/1139

rcarmo · on Jan 12, 2024

I’d like to see ArchiveBox and sonic as a data source for retrieval augmented generation, really. It would make more sense.

viraptor · on Jan 12, 2024

That's cool. I've been using archivebox together with other tools to achieve this. It may be cool to get some integrations too, rather than yet another from-scratch rag.

valsk · on Jan 11, 2024

This was created 5 years ago..

nikisweeting · on Jan 12, 2024

Actually closer to 7 years ago :)

You can learn about the origin story / motivation here:

https://github.com/ArchiveBox/ArchiveBox#background--motivat...

https://2020.pycon.co/en/talks/5/ (a conference talk I gave about it)

codsane · on Jan 11, 2024

And it’s still being maintained :)