Make Your Own Internet Archive with Archive Box

lazyjeff · on Jan 19, 2021

I feel like a simple automatic capture of timestamp + url + screenshot would already be very useful. This gives you a visual memory of the things you've seen on the web. I've wanted to develop this for a while, as a browser plugin.

Being able to skim the past month or two click around the thumbnails would already be amazing. I've wanted to do that many times before to check if my memory was correct, or if a page changed since I last saw it, or figure out when I last saw something online.

You don't need a special viewer for it, as your operating system's file explorer can view the screenshots already, and you don't need to set up a crawl. Screenshots also compress well, as webp or png after crunching it.

andai · on Jan 20, 2021

A few years ago, in an attempt to increase productivity, I used a screen recorder that took a screenshot every 10 seconds and played it back at the end of every day. So I had a timelapse of how I was spending my time -- mostly online. It was very enlightening.

The most efficient format to store a sequence of screenshots in is video, because most of them will have heavily overlapping data.

nstart · on Jan 20, 2021

Huh. That's a pretty nifty thing to do. Just wrote a python script to do that for me and it's running in the background right now. Shall be interesting to come back to it today in the evening. Do you recall how many seconds each screenshot would show itself in the final video (basically what was the framerate?). Currently considering about 4 frames per second but would love to get your take on it :)

andai · on Jan 20, 2021

It's up to preference really. I had mine set to max frame rate, then I'd use the [] keys in VLC to slow down the speed. It would take 1-2 minutes to view my day (8-12 hours).

I think at one point I used an ImageMagick script to add timestamps to them.

Originally I used a Windows thing called TimeSnapper which as a bonus lets you scrub through time and shows when there was / wasn't computer activity.

AlecSchueler · on Jan 20, 2021

I did it before just saving as images and using feh to play with the playback speed. It's easy to play with so just see what works for you

jumploops · on Jan 20, 2021

I've dreamed about this as well, basically a personalized FullStory that allows you to search and replay all of your sessions across sites.

Easy block list for sensitive things like banking, internal sites, email, etc.

I currently use the Session Buddy Chrome Plugin, which helps in some cases (I was able to find a hard to Google repo today, for example), but the historic context is largely missing.

nikisweeting · on Jan 20, 2021

Sounds like this might be what you want if you only want screenshots + timestamps and nothing else:

    archivebox oneshot --extract=screenshot 'https://example.com'

or

    archivebox add --extract=screenshot < ~/Desktop/browser_bookmarks.html

smnrchrds · on Jan 20, 2021

> screenshot

Wouldn't it be more useful and take less space to use SignleFile?

lazyjeff · on Jan 20, 2021

You'd think so, but code (even web code) needs to be executed and is brittle. Some of it doesn't even work as an archive even right after saving. All my images from 10-20 years ago work perfectly today. None of my code does without some major effort.

bravura · on Jan 20, 2021

This doesn't allow full text search easily, though.

rhizome · on Jan 20, 2021

PDF with an image on one page, then the plain text of the page flowed over following pages.

BrianOnHN · on Jan 20, 2021

+ (text minus stop words)

ramraj07 · on Jan 20, 2021

Triv888 · on Jan 20, 2021

why go from text to image and back to text? seems wasteful and error prone...

lazyjeff · on Jan 20, 2021

It's a hard problem to figure out what's readable text on a page, and what isn't. Even Google has a hard time figuring that out. OCR works very well with screenshots, and is purely computation time. But the real reason is generally just having timestamps, urls, and screenshots is good enough. I usually remember about when it was, and some words in the url, and don't need the heavyweight text search setup.

Moru · on Jan 20, 2021

Just hard with the "read more" buttons.

ramraj07 · on Jan 20, 2021

Trying to parse the SPAs of today is just painful. Simpler to just render the page screenshot and OCR! Guaranteed to only index text that actually matters

remirk · on Jan 19, 2021

This article is blogspam.

The repository has enough information on its own: https://github.com/ArchiveBox/ArchiveBox

Robotbeat · on Jan 19, 2021

Disagree. I find links to repositories to be less accessible than blog posts.

klelatti · on Jan 19, 2021

It has its own website too.

https://archivebox.io/

matt_f · on Jan 20, 2021

Interesting side note:

It seems like a lot of people in this thread have an interest in retaining a "replayable timeline" of their own browsing/reading history.

There's probably enough support here to gather a few contributors for an open source project.

hobo_mark · on Jan 20, 2021

I seem to remember Google's Larry Page once proposed a similar thing in the early days, a product that would record all you read on your computer (to make it searchable later), but now I can't find it mentioned anywhere, am I imagining things?

nikisweeting · on Jan 20, 2021

A "remember everything for me" tool is often called a "Memex" https://en.wikipedia.org/wiki/Memex

dgeiser13 · on Jan 20, 2021

If you use Google Chrome as your primary browser this exists at https://myactivity.google.com/item

nikisweeting · on Jan 20, 2021

There are a bunch of projects trying to do different flavors of this already, check out some of these:

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

nikisweeting · on Jan 20, 2021

Hey all, @pirate (ArchiveBox maintainer) here, thanks for posting this @adamhearn.

If you like ArchiveBox check out our new Twitter account for the project, https://twitter.com/ArchiveBoxApp we just opened it and we'll be posting announcements and prerelease sneak-peeks on there in the future.

blastro · on Jan 19, 2021

i use this every single day and think very highly of it. thanks for reminding me - i'm going to sponsor this developer on github...

m-s-sripati · on Jan 20, 2021

It is the right thought, aligned to the spirit of open source.

unnouinceput · on Jan 20, 2021

Quote: "..even if you instruct it to begin archiving a site then it can easily fail if that site’s robots.txt prevents crawling"

Huh? Does actually the big corporations care anymore about robots.txt? Nowadays is more of a "netiquette" than anything else. Google definitely ignores it. Dunno DuckDucGo what it does

zeckalpha · on Jan 20, 2021

How long until this is a feature baked into a mainstream web browser? Archive, prefetch, cache, all variants on a theme. History, bookmarks, local search engine, all the same.

andai · on Jan 20, 2021

I often wish that I could do a full text search of every page I've already visited.

lloeki · on Jan 20, 2021

Safari’s history used to crudely do that somewhat. You could open up the history and search for any word that appeared on a page you browsed and it would filter it to list the matching pages.

Pamar · on Jan 20, 2021

Not exactly what you are asking here (if I understand correctly) but I have been using historio.us for a year or so and I am happy with it.

mail2merge · on Jan 20, 2021

I'm working in that in my "self host the internet offline from your browsing history" project

https://github.com/c9fe/22120

It makes a web archive from everything you browse, and lately I've been working on the full text search

Moru · on Jan 20, 2021

Seems it requires chrome to work?

mail2merge · on Jan 20, 2021

Wow, your reading comprehension is amazingly good. Yep, that's correct.

andai · on Jan 20, 2021

There's a link to more info on the Chrome thing but it 404s

https://github.com/c9fe/22120/issues/57

nikisweeting · on Jan 20, 2021

http://web.archive.org/web/20201206153345/https://github.com...

mail2merge · on Jan 21, 2021

Thank you for saving that, pirate niki! I think i saved a copy as well, right? Yep, here https://archive.is/jcURO

I've got lots of these archives of issues and comments lying around. It's good to see more!

Btw i knew you'd show up on a comment thread where i posted my stuff. You're like obsessed with my project, right? Why, haha? Maybe because you've got you own archive project, the archive box. "Competition". Hahaha. I hope you haven't got to the point of "search alerts" obsessed, hahaha, don't fret too much, niki pirate! Everyone check out pirate's the archive box, he made the trip here, do him a favor!

https://github.com/ArchiveBox/ArchiveBox

Thanks for showing up niiki, it's good to see you again. See you next time!

nikisweeting · on Jan 21, 2021

I didn't save it, it was picked up by archive.org's own crawler in Nov. Also I think you're way overthinking this, I just saw this because it's a comment thread on a post about ArchiveBox. It's a little weird that you frame it as competition, there are plenty of open source archiving projects, it's not a war. We can be civil, you don't have to mock me or do stuff like blocking me on Github, I have no beef with 22120.

mail2merde · on Jan 25, 2021

I'm just looking for the link, I'll post it when i find it. Just a sec!

Edit ok!

Here's my reply: https://pastebin.com/0gM0LN9j?oh46khe5gh8in

mail2megre · on Jan 21, 2021

You can pretend it's like that, but that's not how it is. I'll reply you tomorrow

mail2merge · on Jan 21, 2021

I think the way it is now in the readme accurately reflects the state of development. Still possible, but not actively working on and not currently interested in any non chrome stuff.

But at any time, it may just come back, you know? Boom.. and then it's back. Haha !

Which suits me perfectly for where i am right now. :)

rhizome · on Jan 20, 2021

Forever. Site owners will shit purple Twinkies and put the developers under a copyright cosh if the feature is released or not removed.

zeckalpha · on Jan 20, 2021

It could respect caching headers, just make it visible in the UI and show “expired” content if current content is unavailable. I don’t see how this would be an issue for site owners, they could adjust their content headers if they want something different.

jedimastert · on Jan 19, 2021

Is there a list of web page archive formats I could look at? There are a few things I'd love to do where it would be very handy to have one file per page

nikisweeting · on Jan 20, 2021

The main archive formats for web content are WARC, ZIM, Memento, and static HTML (e.g. from a tool like wget or Singlefile).

If you want 1 page per URL I recommend Singlefile.

Lots more info here if you want to compare different software options: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

0x426577617265 · on Jan 20, 2021

I use this with an automated script that watches my Twitter activity. If I like a tweet it determines if it contains a URL then archives it.

mikece · on Jan 19, 2021

This would be a nice thing to be able to run on a Synology NAS or other kind of device that typically has terabytes of storage.

blastro · on Jan 19, 2021

that's what i do - there's a docker image, 1 line script + cron job. it archives an rss feed of links i gather

tylorr · on Jan 20, 2021

How do you generate that rss feed?

blastro · on Jan 20, 2021

getpocket.com

vorpalhex · on Jan 20, 2021

It runs quite well in docker. I still feed my instance by hand but eventually need to write a firefox extension to push history semi-live.

greypowerOz · on Jan 19, 2021

so.. you CAN have a box that is "the internet"....

mosselman · on Jan 20, 2021

Yes Jen

mikiem · on Jan 20, 2021

How can I use this to archive sites/pages that require logging in to see?

nikisweeting · on Jan 20, 2021

https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

CodeWriter23 · on Jan 20, 2021

From the blog comments, I think this is what you’re after https://github.com/c9fe/22120

dirtyid · on Jan 20, 2021

Tried this a while ago, disappointed at HD usage.

My solution as heavy TTS user who has balabolka setup to read copied text which naturally leaves a log for future reference. There's extentions to auto copy highlighted text and append urls which makes entire flow straight forward. Log each day is around 1-5mbs of text saved in a big folder. Biggest limitation is trying to advance search unstructured text files by complex keywords within dates. I'm sure I can setup each clip with delimiters so logs can be imported into a searchable DB, just too lazy.

nikisweeting · on Jan 20, 2021

I think you tried a very old version ;) all that has long since changed. As of v0.5 ArchiveBox has everything in a Sqlite3 DB and full-text search is implemented with Sonic.

evc · on Jan 19, 2021

You will need a lot of disk storage right?

LEARAX · on Jan 20, 2021

There are different extractors/services, and you can toggle them pretty easily. By default it screenshots everything, exports a PDF, saves like 4 different HTML copies and submits the link to the wayback machine. It also tries to extract important text, and stores that separately. You could easily configure it to only extract text, turn off some HTML extractors, or disable the PDF and screenshot captures if you want to prioritize disk space.

flas9sd · on Jan 19, 2021

it doesn't show in the Screenshot in the article, but ArchiveBox in Aug 2020 implemented the "readability article text extractor", see description in the release notes: https://github.com/pirate/ArchiveBox/releases/tag/v0.4.14 and the module that does the work https://github.com/pirate/readability-extractor

By only extracting text and article images you could go deep into an archive. If you skip images, much more so

Ace_Archer · on Jan 19, 2021

That probably depends on the scope of what you're looking to archive. If you're looking to make up local backup of your bookmarks folder (as one of the intentions seems to be), probably not an unreasonable amount of storage. Maybe a few GB at most(if you have a moderate to large bookmarks folder), depending on how many sites/heavy the sites are?

reefab · on Jan 20, 2021

For reference, archivebox uses 250GB for 5000 links in my setup.

mosselman · on Jan 20, 2021

That is an insane amount of storage for so few links. Is your setup somehow very greedy?

Saving article only view (images + text) should probably do better

I suspect your numbers come from JavaScript and css, etc? Is there a way for archivebox to not download react 5000 times but share source files? Most likely custom bundles that sites compile will not make this possible most of the time. Just thinking out loud here.

nikisweeting · on Jan 20, 2021

It's recommended to run it on a compressed filesystem like ZFS. On mine it's using ~75GB for ~3000 URLs. It varies greatly depending on the content, usually the vast majority of storage is from video/audio ripped with youtube-dl.

egberts1 · on Jan 21, 2021

A real OSINT archive box would also capture all non-inline JavaScript, CSS and blob: files.

ketamine__ · on Jan 20, 2021

How does archive.is trick news sites into showing content without the paywall? Is it pure user agent spoofing?

I'm wondering if this could be applied here.

nikisweeting · on Jan 20, 2021

Yeup, just the reason why we expose the USER_AGENT options in ArchiveBox config ;)

https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

I don't want to officially endorse using the Google bot user agent, but you're welcome to try it on your own and see if it improves the experience.

mycall · on Jan 21, 2021

How does ArchiveBox function compared to https://archivarix.com? I recently used Archivarix to backup a large website (93k pages), but it messed up the js/css.

throwawaysea · on Jan 20, 2021

Can you configure this tool to login to websites (for paid news subscriptions) and get past those paywalls?

nikisweeting · on Jan 20, 2021

Yeah, it supports it but there are security considerations if you're doing it for anything more serious than news content. See here: https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

throwawaysea · on Jan 20, 2021

Thanks, much appreciated. This is a very informative set of things to watch out for that I wouldn't have thought of otherwise.

nikisweeting · on Jan 22, 2021

Make sure you read through this section as well to fully understand the security concerns:

https://github.com/ArchiveBox/ArchiveBox#caveats

ernesth · on Jan 20, 2021

That is the default for the screenshot, pdf and one of the html archives: they use your chrome cookies.

frombody · on Jan 20, 2021

Likely not without some modification, but you could try this:

https://www.jacoduplessis.co.za/bypass-paywall/