I feel like a simple automatic capture of timestamp + url + screenshot would already be very useful. This gives you a visual memory of the things you've seen on the web. I've wanted to develop this for a while, as a browser plugin.
Being able to skim the past month or two click around the thumbnails would already be amazing. I've wanted to do that many times before to check if my memory was correct, or if a page changed since I last saw it, or figure out when I last saw something online.
You don't need a special viewer for it, as your operating system's file explorer can view the screenshots already, and you don't need to set up a crawl. Screenshots also compress well, as webp or png after crunching it.
A few years ago, in an attempt to increase productivity, I used a screen recorder that took a screenshot every 10 seconds and played it back at the end of every day. So I had a timelapse of how I was spending my time -- mostly online. It was very enlightening.
The most efficient format to store a sequence of screenshots in is video, because most of them will have heavily overlapping data.
Huh. That's a pretty nifty thing to do. Just wrote a python script to do that for me and it's running in the background right now. Shall be interesting to come back to it today in the evening. Do you recall how many seconds each screenshot would show itself in the final video (basically what was the framerate?). Currently considering about 4 frames per second but would love to get your take on it :)
It's up to preference really. I had mine set to max frame rate, then I'd use the [] keys in VLC to slow down the speed. It would take 1-2 minutes to view my day (8-12 hours).
I think at one point I used an ImageMagick script to add timestamps to them.
Originally I used a Windows thing called TimeSnapper which as a bonus lets you scrub through time and shows when there was / wasn't computer activity.
I've dreamed about this as well, basically a personalized FullStory that allows you to search and replay all of your sessions across sites.
Easy block list for sensitive things like banking, internal sites, email, etc.
I currently use the Session Buddy Chrome Plugin, which helps in some cases (I was able to find a hard to Google repo today, for example), but the historic context is largely missing.
You'd think so, but code (even web code) needs to be executed and is brittle. Some of it doesn't even work as an archive even right after saving. All my images from 10-20 years ago work perfectly today. None of my code does without some major effort.
It's a hard problem to figure out what's readable text on a page, and what isn't. Even Google has a hard time figuring that out. OCR works very well with screenshots, and is purely computation time. But the real reason is generally just having timestamps, urls, and screenshots is good enough. I usually remember about when it was, and some words in the url, and don't need the heavyweight text search setup.
Trying to parse the SPAs of today is just painful. Simpler to just render the page screenshot and OCR! Guaranteed to only index text that actually matters
I seem to remember Google's Larry Page once proposed a similar thing in the early days, a product that would record all you read on your computer (to make it searchable later), but now I can't find it mentioned anywhere, am I imagining things?
Hey all, @pirate (ArchiveBox maintainer) here, thanks for posting this @adamhearn.
If you like ArchiveBox check out our new Twitter account for the project, https://twitter.com/ArchiveBoxApp we just opened it and we'll be posting announcements and prerelease sneak-peeks on there in the future.
Quote: "..even if you instruct it to begin archiving a site then it can easily fail if that site’s robots.txt prevents crawling"
Huh? Does actually the big corporations care anymore about robots.txt? Nowadays is more of a "netiquette" than anything else. Google definitely ignores it. Dunno DuckDucGo what it does
How long until this is a feature baked into a mainstream web browser? Archive, prefetch, cache, all variants on a theme. History, bookmarks, local search engine, all the same.
Safari’s history used to crudely do that somewhat. You could open up the history and search for any word that appeared on a page you browsed and it would filter it to list the matching pages.
Thank you for saving that, pirate niki! I think i saved a copy as well, right? Yep, here https://archive.is/jcURO
I've got lots of these archives of issues and comments lying around. It's good to see more!
Btw i knew you'd show up on a comment thread where i posted my stuff. You're like obsessed with my project, right? Why, haha? Maybe because you've got you own archive project, the archive box. "Competition". Hahaha. I hope you haven't got to the point of "search alerts" obsessed, hahaha, don't fret too much, niki pirate! Everyone check out pirate's the archive box, he made the trip here, do him a favor!
I didn't save it, it was picked up by archive.org's own crawler in Nov. Also I think you're way overthinking this, I just saw this because it's a comment thread on a post about ArchiveBox. It's a little weird that you frame it as competition, there are plenty of open source archiving projects, it's not a war. We can be civil, you don't have to mock me or do stuff like blocking me on Github, I have no beef with 22120.
I think the way it is now in the readme accurately reflects the state of development. Still possible, but not actively working on and not currently interested in any non chrome stuff.
But at any time, it may just come back, you know? Boom.. and then it's back. Haha !
Which suits me perfectly for where i am right now. :)
It could respect caching headers, just make it visible in the UI and show “expired” content if current content is unavailable. I don’t see how this would be an issue for site owners, they could adjust their content headers if they want something different.
Is there a list of web page archive formats I could look at? There are a few things I'd love to do where it would be very handy to have one file per page
My solution as heavy TTS user who has balabolka setup to read copied text which naturally leaves a log for future reference. There's extentions to auto copy highlighted text and append urls which makes entire flow straight forward. Log each day is around 1-5mbs of text saved in a big folder. Biggest limitation is trying to advance search unstructured text files by complex keywords within dates. I'm sure I can setup each clip with delimiters so logs can be imported into a searchable DB, just too lazy.
I think you tried a very old version ;) all that has long since changed. As of v0.5 ArchiveBox has everything in a Sqlite3 DB and full-text search is implemented with Sonic.
There are different extractors/services, and you can toggle them pretty easily. By default it screenshots everything, exports a PDF, saves like 4 different HTML copies and submits the link to the wayback machine. It also tries to extract important text, and stores that separately. You could easily configure it to only extract text, turn off some HTML extractors, or disable the PDF and screenshot captures if you want to prioritize disk space.
That probably depends on the scope of what you're looking to archive. If you're looking to make up local backup of your bookmarks folder (as one of the intentions seems to be), probably not an unreasonable amount of storage. Maybe a few GB at most(if you have a moderate to large bookmarks folder), depending on how many sites/heavy the sites are?
That is an insane amount of storage for so few links. Is your setup somehow very greedy?
Saving article only view (images + text) should probably do better
I suspect your numbers come from JavaScript and css, etc? Is there a way for archivebox to not download react 5000 times but share source files? Most likely custom bundles that sites compile will not make this possible most of the time. Just thinking out loud here.
It's recommended to run it on a compressed filesystem like ZFS. On mine it's using ~75GB for ~3000 URLs. It varies greatly depending on the content, usually the vast majority of storage is from video/audio ripped with youtube-dl.
How does ArchiveBox function compared to https://archivarix.com? I recently used Archivarix to backup a large website (93k pages), but it messed up the js/css.
Being able to skim the past month or two click around the thumbnails would already be amazing. I've wanted to do that many times before to check if my memory was correct, or if a page changed since I last saw it, or figure out when I last saw something online.
You don't need a special viewer for it, as your operating system's file explorer can view the screenshots already, and you don't need to set up a crawl. Screenshots also compress well, as webp or png after crunching it.