I installed the "Webrecorder ArchiveWeb.page" Chrome extension and wanted to "record" (whatever that means, my assumption/hope is to make a working copy) a fancy interactive JavaScript website.
But when I visit that website, then select "Start Recording" from the Webrecorder Chrome extension, after a couple of requests, that Chrome tab will "crash" and show the "dead Tab" icon and inform me about "Error Code: STATUS_ACCESS_VIOLATION".
If you have other extensions, disable all of them before trying to record.
Content from other extensions can't be accessed due to recent security changes in Chromium, and can cause that error. We'll add it to the Common Errors Page and fix the links. Thanks!
This is actually an issue with their docs that I encountered a few weeks ago when I was first experimenting with this tool. They apparently added a Spanish-language version of the docs, including an associated extra directory tree in the URL, but they failed to set up redirects or even update the existing links in the documentation.
Very cool tool. I tried using the Chrome extension to archive a Matterport (3d tour service) page, but got a fatal Chrome error ("Error code: STATUS_ACCESS_VIOLATION") in the process.
It took me a minute to grasp that "capture" here meant not recording the screen but like making an offline copy of the website similar to crawling it, but it (I think) also crawls the interactions, not just the document links.
How do people archive sites 1:1 using tools like Playwright? I’ve tried to screenshot things (looks weird) and pull the page content (have problems viewing articles like medium).
Our automated crawler browsertrix-crawler (https://github.com/webrecorder/browsertrix-crawler) uses Puppeteer to run browsers that we archive in by loading pages, running behaviors such as auto-scroll, and then recording the request/response traffic in the WARC format (by default in Webrecorder tools, then packaged into a portable WACZ file: https://specs.webrecorder.net/wacz/1.1.1/). We have custom behaviors for some social media and video sites to make sure that content is appropriately captured. It is a bit of a cat-and-mouse game as we have to continue to update these behaviors as sites change, but for the most part it works pretty well. The crawler also has some job queuing functionality, supports multiple workers/browsers, and is highly configurable to set timeouts, page limits, etc.
The trickier part is in replaying the archived websites, as a certain amount of re-writing has to happen in order to make sure the HTML and JS are working with archived assets rather than the live web. One implementation of this is replayweb.page (https://github.com/webrecorder/replayweb.page), which does all of the rewriting client-side in the browser. This sets you interact with archived websites in WARC or WACZ format as if interacting with the original site. replayweb.page can run locally in your browser without needing to send any data to a server or can be hosted, including in an embedded mode.
We experimented moving to Playwright but Playwright doesn't handle long-running browser sessions well, as the devs have (maybe rightly) prioritized its use for testing over archival use cases and want you to spin up a browser each time. For archival purposes, that doesn't work as well because we're not able to save the browser profile to retain cookies such as login credentials, so we've moved back to using Puppeteer for now.
It's ultimately a cat and mouse game with many websites actively trying to sabotage archival efforts.
Unless this is a core functionality in something you're working on, most people will be better off using the SavePageNow API from archive.org and integrating with that. This is what I ultimately ended up doing for one of my projects.[1]
Yeah this is true, but it works for enough websites to be a meaningful option, and it is almost always going to work better than something you have home-rolled (unless your core product is a direct competitor or something, in which case all bets are off ;))
A good is example of this is Pinboard which claims to offer website archiving. A friend has over 100,000 links saved (with an archival account) there. When we spent a few minutes looking at the archive links for those items a few weeks ago, we couldn't find a single correct, working, accessible archive from the most recently archived links (listed as archived 5 weeks ago, so also not up to date).
Most of the time a client just cares about the information, not the typesetting or the layout, and in that case most of the data can just be pulled from the web inspector (downloading files). I've yet to have a client ask me to also copy the typesetting and layout so I never learned how to do that.
But when I visit that website, then select "Start Recording" from the Webrecorder Chrome extension, after a couple of requests, that Chrome tab will "crash" and show the "dead Tab" icon and inform me about "Error Code: STATUS_ACCESS_VIOLATION".
Heading over to https://archiveweb.page/guide :
> Help! I have an error! > See the common errors [https://archiveweb.page/troubleshooting/errors] to see if your issue is listed there, or contact us [https://archiveweb.page/contact] if it is not.
https://archiveweb.page/troubleshooting/errors -> 404 Page not found
Therefore https://archiveweb.page/contact -> 404 Page not found
Rebooting my computer and restarting Chrome did not solve the issue. ¯\_(ツ)_/¯