Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Make Your Own Internet Archive with Archive Box (nixintel.info)
257 points by adamhearn on Jan 19, 2021 | hide | past | favorite | 77 comments


I feel like a simple automatic capture of timestamp + url + screenshot would already be very useful. This gives you a visual memory of the things you've seen on the web. I've wanted to develop this for a while, as a browser plugin.

Being able to skim the past month or two click around the thumbnails would already be amazing. I've wanted to do that many times before to check if my memory was correct, or if a page changed since I last saw it, or figure out when I last saw something online.

You don't need a special viewer for it, as your operating system's file explorer can view the screenshots already, and you don't need to set up a crawl. Screenshots also compress well, as webp or png after crunching it.


A few years ago, in an attempt to increase productivity, I used a screen recorder that took a screenshot every 10 seconds and played it back at the end of every day. So I had a timelapse of how I was spending my time -- mostly online. It was very enlightening.

The most efficient format to store a sequence of screenshots in is video, because most of them will have heavily overlapping data.


Huh. That's a pretty nifty thing to do. Just wrote a python script to do that for me and it's running in the background right now. Shall be interesting to come back to it today in the evening. Do you recall how many seconds each screenshot would show itself in the final video (basically what was the framerate?). Currently considering about 4 frames per second but would love to get your take on it :)


It's up to preference really. I had mine set to max frame rate, then I'd use the [] keys in VLC to slow down the speed. It would take 1-2 minutes to view my day (8-12 hours).

I think at one point I used an ImageMagick script to add timestamps to them.

Originally I used a Windows thing called TimeSnapper which as a bonus lets you scrub through time and shows when there was / wasn't computer activity.


I did it before just saving as images and using feh to play with the playback speed. It's easy to play with so just see what works for you


I've dreamed about this as well, basically a personalized FullStory that allows you to search and replay all of your sessions across sites.

Easy block list for sensitive things like banking, internal sites, email, etc.

I currently use the Session Buddy Chrome Plugin, which helps in some cases (I was able to find a hard to Google repo today, for example), but the historic context is largely missing.


Sounds like this might be what you want if you only want screenshots + timestamps and nothing else:

    archivebox oneshot --extract=screenshot 'https://example.com'
or

    archivebox add --extract=screenshot < ~/Desktop/browser_bookmarks.html


> screenshot

Wouldn't it be more useful and take less space to use SignleFile?


You'd think so, but code (even web code) needs to be executed and is brittle. Some of it doesn't even work as an archive even right after saving. All my images from 10-20 years ago work perfectly today. None of my code does without some major effort.


This doesn't allow full text search easily, though.


PDF with an image on one page, then the plain text of the page flowed over following pages.


+ (text minus stop words)


ocr?


why go from text to image and back to text? seems wasteful and error prone...


It's a hard problem to figure out what's readable text on a page, and what isn't. Even Google has a hard time figuring that out. OCR works very well with screenshots, and is purely computation time. But the real reason is generally just having timestamps, urls, and screenshots is good enough. I usually remember about when it was, and some words in the url, and don't need the heavyweight text search setup.


Just hard with the "read more" buttons.


Trying to parse the SPAs of today is just painful. Simpler to just render the page screenshot and OCR! Guaranteed to only index text that actually matters


This article is blogspam.

The repository has enough information on its own: https://github.com/ArchiveBox/ArchiveBox


Disagree. I find links to repositories to be less accessible than blog posts.


It has its own website too.

https://archivebox.io/


Interesting side note:

It seems like a lot of people in this thread have an interest in retaining a "replayable timeline" of their own browsing/reading history.

There's probably enough support here to gather a few contributors for an open source project.


I seem to remember Google's Larry Page once proposed a similar thing in the early days, a product that would record all you read on your computer (to make it searchable later), but now I can't find it mentioned anywhere, am I imagining things?


A "remember everything for me" tool is often called a "Memex" https://en.wikipedia.org/wiki/Memex


If you use Google Chrome as your primary browser this exists at https://myactivity.google.com/item


There are a bunch of projects trying to do different flavors of this already, check out some of these:

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...


Hey all, @pirate (ArchiveBox maintainer) here, thanks for posting this @adamhearn.

If you like ArchiveBox check out our new Twitter account for the project, https://twitter.com/ArchiveBoxApp we just opened it and we'll be posting announcements and prerelease sneak-peeks on there in the future.


i use this every single day and think very highly of it. thanks for reminding me - i'm going to sponsor this developer on github...


It is the right thought, aligned to the spirit of open source.


Quote: "..even if you instruct it to begin archiving a site then it can easily fail if that site’s robots.txt prevents crawling"

Huh? Does actually the big corporations care anymore about robots.txt? Nowadays is more of a "netiquette" than anything else. Google definitely ignores it. Dunno DuckDucGo what it does


How long until this is a feature baked into a mainstream web browser? Archive, prefetch, cache, all variants on a theme. History, bookmarks, local search engine, all the same.


I often wish that I could do a full text search of every page I've already visited.


Safari’s history used to crudely do that somewhat. You could open up the history and search for any word that appeared on a page you browsed and it would filter it to list the matching pages.


Not exactly what you are asking here (if I understand correctly) but I have been using historio.us for a year or so and I am happy with it.


I'm working in that in my "self host the internet offline from your browsing history" project

https://github.com/c9fe/22120

It makes a web archive from everything you browse, and lately I've been working on the full text search


Seems it requires chrome to work?


Wow, your reading comprehension is amazingly good. Yep, that's correct.


There's a link to more info on the Chrome thing but it 404s

https://github.com/c9fe/22120/issues/57



Thank you for saving that, pirate niki! I think i saved a copy as well, right? Yep, here https://archive.is/jcURO

I've got lots of these archives of issues and comments lying around. It's good to see more!

Btw i knew you'd show up on a comment thread where i posted my stuff. You're like obsessed with my project, right? Why, haha? Maybe because you've got you own archive project, the archive box. "Competition". Hahaha. I hope you haven't got to the point of "search alerts" obsessed, hahaha, don't fret too much, niki pirate! Everyone check out pirate's the archive box, he made the trip here, do him a favor!

https://github.com/ArchiveBox/ArchiveBox

Thanks for showing up niiki, it's good to see you again. See you next time!


I didn't save it, it was picked up by archive.org's own crawler in Nov. Also I think you're way overthinking this, I just saw this because it's a comment thread on a post about ArchiveBox. It's a little weird that you frame it as competition, there are plenty of open source archiving projects, it's not a war. We can be civil, you don't have to mock me or do stuff like blocking me on Github, I have no beef with 22120.


I'm just looking for the link, I'll post it when i find it. Just a sec!

Edit ok!

Here's my reply: https://pastebin.com/0gM0LN9j?oh46khe5gh8in


You can pretend it's like that, but that's not how it is. I'll reply you tomorrow


I think the way it is now in the readme accurately reflects the state of development. Still possible, but not actively working on and not currently interested in any non chrome stuff.

But at any time, it may just come back, you know? Boom.. and then it's back. Haha !

Which suits me perfectly for where i am right now. :)


Forever. Site owners will shit purple Twinkies and put the developers under a copyright cosh if the feature is released or not removed.


It could respect caching headers, just make it visible in the UI and show “expired” content if current content is unavailable. I don’t see how this would be an issue for site owners, they could adjust their content headers if they want something different.


Is there a list of web page archive formats I could look at? There are a few things I'd love to do where it would be very handy to have one file per page


The main archive formats for web content are WARC, ZIM, Memento, and static HTML (e.g. from a tool like wget or Singlefile).

If you want 1 page per URL I recommend Singlefile.

Lots more info here if you want to compare different software options: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...


I use this with an automated script that watches my Twitter activity. If I like a tweet it determines if it contains a URL then archives it.


This would be a nice thing to be able to run on a Synology NAS or other kind of device that typically has terabytes of storage.


that's what i do - there's a docker image, 1 line script + cron job. it archives an rss feed of links i gather


How do you generate that rss feed?


getpocket.com


It runs quite well in docker. I still feed my instance by hand but eventually need to write a firefox extension to push history semi-live.


so.. you CAN have a box that is "the internet"....


Yes Jen


How can I use this to archive sites/pages that require logging in to see?



From the blog comments, I think this is what you’re after https://github.com/c9fe/22120


Tried this a while ago, disappointed at HD usage.

My solution as heavy TTS user who has balabolka setup to read copied text which naturally leaves a log for future reference. There's extentions to auto copy highlighted text and append urls which makes entire flow straight forward. Log each day is around 1-5mbs of text saved in a big folder. Biggest limitation is trying to advance search unstructured text files by complex keywords within dates. I'm sure I can setup each clip with delimiters so logs can be imported into a searchable DB, just too lazy.


I think you tried a very old version ;) all that has long since changed. As of v0.5 ArchiveBox has everything in a Sqlite3 DB and full-text search is implemented with Sonic.


You will need a lot of disk storage right?


There are different extractors/services, and you can toggle them pretty easily. By default it screenshots everything, exports a PDF, saves like 4 different HTML copies and submits the link to the wayback machine. It also tries to extract important text, and stores that separately. You could easily configure it to only extract text, turn off some HTML extractors, or disable the PDF and screenshot captures if you want to prioritize disk space.


it doesn't show in the Screenshot in the article, but ArchiveBox in Aug 2020 implemented the "readability article text extractor", see description in the release notes: https://github.com/pirate/ArchiveBox/releases/tag/v0.4.14 and the module that does the work https://github.com/pirate/readability-extractor

By only extracting text and article images you could go deep into an archive. If you skip images, much more so


That probably depends on the scope of what you're looking to archive. If you're looking to make up local backup of your bookmarks folder (as one of the intentions seems to be), probably not an unreasonable amount of storage. Maybe a few GB at most(if you have a moderate to large bookmarks folder), depending on how many sites/heavy the sites are?


For reference, archivebox uses 250GB for 5000 links in my setup.


That is an insane amount of storage for so few links. Is your setup somehow very greedy?

Saving article only view (images + text) should probably do better

I suspect your numbers come from JavaScript and css, etc? Is there a way for archivebox to not download react 5000 times but share source files? Most likely custom bundles that sites compile will not make this possible most of the time. Just thinking out loud here.


It's recommended to run it on a compressed filesystem like ZFS. On mine it's using ~75GB for ~3000 URLs. It varies greatly depending on the content, usually the vast majority of storage is from video/audio ripped with youtube-dl.


A real OSINT archive box would also capture all non-inline JavaScript, CSS and blob: files.


How does archive.is trick news sites into showing content without the paywall? Is it pure user agent spoofing?

I'm wondering if this could be applied here.


Yeup, just the reason why we expose the USER_AGENT options in ArchiveBox config ;)

https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

I don't want to officially endorse using the Google bot user agent, but you're welcome to try it on your own and see if it improves the experience.


How does ArchiveBox function compared to https://archivarix.com? I recently used Archivarix to backup a large website (93k pages), but it messed up the js/css.


Can you configure this tool to login to websites (for paid news subscriptions) and get past those paywalls?


Yeah, it supports it but there are security considerations if you're doing it for anything more serious than news content. See here: https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...


Thanks, much appreciated. This is a very informative set of things to watch out for that I wouldn't have thought of otherwise.


Make sure you read through this section as well to fully understand the security concerns:

https://github.com/ArchiveBox/ArchiveBox#caveats


That is the default for the screenshot, pdf and one of the html archives: they use your chrome cookies.


Likely not without some modification, but you could try this:

https://www.jacoduplessis.co.za/bypass-paywall/




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: