When scoping out the size of Google+, one of ArchiveTeam's recent projects, it emerged that the typical size of a post was roughly 120 bytes, but total page weight a minimum of 1 MB, for a 1% payload to throw-weight ratio. This seems typical of much the modern Web. And that excludes external assets: images, JS, CSS, etc.
If just the source text and sufficient metadata were preserved, all of G+ would be startlingly small -- on the order of 100 GB I believe. Yes, posts could be longer (I wrote some large ones), and images (associated with about 30% of posts by my estimate) blew things up a lot. But the scary thing is actually how little content there really was. And while G+ certainly had a "ghost town" image (which I somewhat helped define), it wasn't tiny --- there were plausibly 100 - 300 million users with substantial activity.
But IA's WBM has a goal and policy of preserving the Web as it manifests, which means one hell of a lot of cruft and bloat. As you note, increasingly a liability.
The external assets for a page could be archived separately though, right? I would think that the static G+ assets: JS, CSS, images, etc. could be archived once, and then all the remaining data would be much closer the 120B of real content. Is there a technical reason that's not the case?
In practice, this would likely involve recreating at least some of the presentation side of numerous changing (some constantly) Web apps. Which is a substantial programming overhead.
WARC is dumb as rocks, from a redundancy standpoint, but also atomically complete, independent (all WARCs are entirely self-contained), and reliable. When dealing with billions of individual websites, these are useful attributes.
When scoping out the size of Google+, one of ArchiveTeam's recent projects, it emerged that the typical size of a post was roughly 120 bytes, but total page weight a minimum of 1 MB, for a 1% payload to throw-weight ratio. This seems typical of much the modern Web. And that excludes external assets: images, JS, CSS, etc.
If just the source text and sufficient metadata were preserved, all of G+ would be startlingly small -- on the order of 100 GB I believe. Yes, posts could be longer (I wrote some large ones), and images (associated with about 30% of posts by my estimate) blew things up a lot. But the scary thing is actually how little content there really was. And while G+ certainly had a "ghost town" image (which I somewhat helped define), it wasn't tiny --- there were plausibly 100 - 300 million users with substantial activity.
But IA's WBM has a goal and policy of preserving the Web as it manifests, which means one hell of a lot of cruft and bloat. As you note, increasingly a liability.