Webrecorder: Capture interactive websites and replay them at a later time

Eduard · on Aug 1, 2023

I installed the "Webrecorder ArchiveWeb.page" Chrome extension and wanted to "record" (whatever that means, my assumption/hope is to make a working copy) a fancy interactive JavaScript website.

But when I visit that website, then select "Start Recording" from the Webrecorder Chrome extension, after a couple of requests, that Chrome tab will "crash" and show the "dead Tab" icon and inform me about "Error Code: STATUS_ACCESS_VIOLATION".

Heading over to https://archiveweb.page/guide :

> Help! I have an error! > See the common errors [https://archiveweb.page/troubleshooting/errors] to see if your issue is listed there, or contact us [https://archiveweb.page/contact] if it is not.

https://archiveweb.page/troubleshooting/errors -> 404 Page not found

Therefore https://archiveweb.page/contact -> 404 Page not found

Rebooting my computer and restarting Chrome did not solve the issue. ¯\_(ツ)_/¯

ikreymer · on Aug 1, 2023

If you have other extensions, disable all of them before trying to record. Content from other extensions can't be accessed due to recent security changes in Chromium, and can cause that error. We'll add it to the Common Errors Page and fix the links. Thanks!

Nathan2055 · on Aug 1, 2023

This is actually an issue with their docs that I encountered a few weeks ago when I was first experimenting with this tool. They apparently added a Spanish-language version of the docs, including an associated extra directory tree in the URL, but they failed to set up redirects or even update the existing links in the documentation.

So those two pages are actually located at https://archiveweb.page/en/troubleshooting/errors/ and https://archiveweb.page/en/contact/ respectively.

It looks like their docs site is open source at https://github.com/webrecorder/archiveweb.page-site, so I may try and send a pull request later today to go ahead and correct those links.

GrumpyNl · on Aug 1, 2023

Whats the url, maybe we can help.

Eduard · on Aug 1, 2023

Thank you for following up. Is there a way to DM you? The URL I'm having problems with archiving is sort of sensitive.

xnx · on Aug 11, 2023

Very cool tool. I tried using the Chrome extension to archive a Matterport (3d tour service) page, but got a fatal Chrome error ("Error code: STATUS_ACCESS_VIOLATION") in the process.

mavsman · on Aug 1, 2023

It's tough to tell what capture actually means in this context.

hidelooktropic · on Aug 1, 2023

It took me a minute to grasp that "capture" here meant not recording the screen but like making an offline copy of the website similar to crawling it, but it (I think) also crawls the interactions, not just the document links.

amelius · on Aug 1, 2023

How can it do that? E.g. YouTube has billions of videos, will it download them all? Or how about any website that talks to a large database?

taberiand · on Aug 1, 2023

Maybe try it on those use cases and find out what it does?

I'm guessing it's not designed to archive every single kind of interaction

server_man3000 · on Aug 1, 2023

How do people archive sites 1:1 using tools like Playwright? I’ve tried to screenshot things (looks weird) and pull the page content (have problems viewing articles like medium).

tw4l · on Aug 1, 2023

(Disclaimer: I work at Webrecorder)

Our automated crawler browsertrix-crawler (https://github.com/webrecorder/browsertrix-crawler) uses Puppeteer to run browsers that we archive in by loading pages, running behaviors such as auto-scroll, and then recording the request/response traffic in the WARC format (by default in Webrecorder tools, then packaged into a portable WACZ file: https://specs.webrecorder.net/wacz/1.1.1/). We have custom behaviors for some social media and video sites to make sure that content is appropriately captured. It is a bit of a cat-and-mouse game as we have to continue to update these behaviors as sites change, but for the most part it works pretty well. The crawler also has some job queuing functionality, supports multiple workers/browsers, and is highly configurable to set timeouts, page limits, etc.

The trickier part is in replaying the archived websites, as a certain amount of re-writing has to happen in order to make sure the HTML and JS are working with archived assets rather than the live web. One implementation of this is replayweb.page (https://github.com/webrecorder/replayweb.page), which does all of the rewriting client-side in the browser. This sets you interact with archived websites in WARC or WACZ format as if interacting with the original site. replayweb.page can run locally in your browser without needing to send any data to a server or can be hosted, including in an embedded mode.

(edit: fixed typos)

tw4l · on Aug 1, 2023

We experimented moving to Playwright but Playwright doesn't handle long-running browser sessions well, as the devs have (maybe rightly) prioritized its use for testing over archival use cases and want you to spin up a browser each time. For archival purposes, that doesn't work as well because we're not able to save the browser profile to retain cookies such as login credentials, so we've moved back to using Puppeteer for now.

andrewmackrodt · on Aug 1, 2023

You can use launchPersistentContext and define a custom user directory to maintain profile state: https://playwright.dev/docs/api/class-browsertype#browser-ty...

niam · on Aug 1, 2023

Answered my next question before I even asked it. As a next-next question:

May I ask what y'all experienced with long-running browser sessions? Or what in particular led you to believe it was unfit for this purpose?

ikreymer · on Aug 1, 2023

Memory usage increases for long-running session, see this issue for more details: https://github.com/microsoft/playwright/issues/6319

bsnnkv · on Aug 1, 2023

It's ultimately a cat and mouse game with many websites actively trying to sabotage archival efforts.

Unless this is a core functionality in something you're working on, most people will be better off using the SavePageNow API from archive.org and integrating with that. This is what I ultimately ended up doing for one of my projects.[1]

[1]: https://lgug2z.com/articles/notado-07-2023-update/

ajvs · on Aug 1, 2023

Website owners can request their site be blacklisted for archival, so this doesn't work for all websites.

bsnnkv · on Aug 1, 2023

Yeah this is true, but it works for enough websites to be a meaningful option, and it is almost always going to work better than something you have home-rolled (unless your core product is a direct competitor or something, in which case all bets are off ;))

A good is example of this is Pinboard which claims to offer website archiving. A friend has over 100,000 links saved (with an archival account) there. When we spent a few minutes looking at the archive links for those items a few weeks ago, we couldn't find a single correct, working, accessible archive from the most recently archived links (listed as archived 5 weeks ago, so also not up to date).

e12e · on Aug 2, 2023

I wonder if Firefox "reader mode as a utility" might be a viable alternative for Pinboard like "content oriented" archiving?

https://github.com/mozilla/readability

linusg789 · on Aug 1, 2023

Archive.today is another popular site that doesn't respond to blacklist requests.

jacobwilliamroy · on Aug 1, 2023

Most of the time a client just cares about the information, not the typesetting or the layout, and in that case most of the data can just be pulled from the web inspector (downloading files). I've yet to have a client ask me to also copy the typesetting and layout so I never learned how to do that.

nextaccountic · on Aug 1, 2023

Is there a browser extension that does this for a site I visit? Or maybe a proxy

nextaccountic · on Aug 1, 2023

oh there is https://github.com/webrecorder/archiveweb.page

now the thing is how easy is to port this for firefox

tw4l · on Aug 2, 2023

Unfortunately not so easy because archiveweb.page relies heavily on CDP. But contributions are welcome if you want to have a go at it!