INTERNETARCHIVE.BAK: *The INTERNETARCHIVE.BAK project (also known as IA.BAK or I...

phendrenad2 · on Sept 8, 2020

I wish there were a way to get a low-rez copy of their entire archive. So, only text, no images, binaries, PDFs (other than PDFs converted to text which they seem to do). As it stands the archive is so huge, the barrier to mirroring is high.

dredmorbius · on Sept 8, 2020

Agreed.

When scoping out the size of Google+, one of ArchiveTeam's recent projects, it emerged that the typical size of a post was roughly 120 bytes, but total page weight a minimum of 1 MB, for a 1% payload to throw-weight ratio. This seems typical of much the modern Web. And that excludes external assets: images, JS, CSS, etc.

If just the source text and sufficient metadata were preserved, all of G+ would be startlingly small -- on the order of 100 GB I believe. Yes, posts could be longer (I wrote some large ones), and images (associated with about 30% of posts by my estimate) blew things up a lot. But the scary thing is actually how little content there really was. And while G+ certainly had a "ghost town" image (which I somewhat helped define), it wasn't tiny --- there were plausibly 100 - 300 million users with substantial activity.

But IA's WBM has a goal and policy of preserving the Web as it manifests, which means one hell of a lot of cruft and bloat. As you note, increasingly a liability.

ta8908695 · on Sept 8, 2020

The external assets for a page could be archived separately though, right? I would think that the static G+ assets: JS, CSS, images, etc. could be archived once, and then all the remaining data would be much closer the 120B of real content. Is there a technical reason that's not the case?

dredmorbius · on Sept 8, 2020

In theory.

In practice, this would likely involve recreating at least some of the presentation side of numerous changing (some constantly) Web apps. Which is a substantial programming overhead.

WARC is dumb as rocks, from a redundancy standpoint, but also atomically complete, independent (all WARCs are entirely self-contained), and reliable. When dealing with billions of individual websites, these are useful attributes.

It's a matter of trade-offs.