Here is the response I got. the route "/11840.html" is not recognized Internet A...

marttt · on July 28, 2022

Thanks for your comment. I realized now that achive.org's "archive string" (here 20220719195142if_) is updated automatically. So if I use this string + some other URL, then I get redirected to a current snapshot of that other site, e.g.

https://web.archive.org/web/20220719195142if_/http://ranprie...

points me to

https://web.archive.org/web/20220723235055if_/https://ranpri...

I suppose the string consists of date + time in hhmmss format + if_? Anyhow, looks like arbitrary strings (e.g. 19991230225818if_) also get redirected to the next existing snapshot counting from that string. This is really nice and simple for text browser scripts.

Is there some straightforward way to list all of archive.org's snapshots (of a particular site) without a javascript-enabled browser?

jwilk · on July 28, 2022

> Is there some straightforward way to list all of archive.org's snapshots (of a particular site) without a javascript-enabled browser?

I use https://github.com/jsvine/waybackpack.

  $ waybackpack --list https://diziet.dreamwidth.org/11840.html
  ...
  https://web.archive.org/web/20220727234836/https://diziet.dreamwidth.org/11840.html
  https://web.archive.org/web/20220728045504/https://diziet.dreamwidth.org/11840.html
  https://web.archive.org/web/20220728084126/https://diziet.dreamwidth.org/11840.html

1vuio0pswjnm7 · on July 28, 2022

"Is there some straightforward way to list all of archive.org's snapshots (of a particular site) without a javascript-enabled browser?"

https://archive.org/services/docs/api/wayback-cdx-server.htm...

FWIW, below is a quick and dirty script I use for a variety of purposes, such as accessing www search result URLs so I do not have to (a) use sites that do not support TLS1.3, (b) use sites that require SNI or (c) use DNS. I will call this script "www".

Example usage:

    alias links="links -no-connect"
    x=https://ranprieur.com
    # retrieve first 5 snapshots (default)
    echo $x|www >1.htm
    # retrieve first 3 snapshots
    echo $x|www 3 >1.htm
    # retrieve last 3 snapshots
    echo $x|www -3 >1.htm 
    # retrieve all snapshots
    echo $x|www 0 >1.htm
    links 1.htm

    #!/bin/sh

    LIMIT=${1-5}; 
    read x0; 
    x0=$(echo $x0|sed 's/%/&&/g');
    x1=web.archive.org;
    curl -A "" "https://$x1/cdx/search/cdx?url=$x0&fl=timestamp,original&limit=$LIMIT&showDupeCount=true" \
    |(echo "<h2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;${x0}</h2><ol><pre>";sed -n "/[0-9]\{14\} [hf]/{s/\(.* \)\(.*\)/<li><a href=https:\/\/${x1}\/web\/\1if_\/\2>\1<\/a>/;s/ </</;s/ //2p;}");
    echo "</ol></pre>";

NB. I used curl here because this is an example for HN. That does not mean I am a curl user.

I also have a small script I use for the Common Crawl archives. They also use CDX but the results are WARC files compressed with gzip. I wrote a small program in C to extract the gzip'd results after HTTP/1.1 pipelining. For retrieving results without pipelining (i.e., many TCP connections), I modified tnftp to accept a Range header.

marttt · on July 28, 2022

This is excellent, many thanks for sharing.

I don't know how I didn't think that Wayback Machine might maybe also have an API. :/ Also, lots of interesting stuff for things like the above on Common Crawl: https://commoncrawl.org/the-data/examples/

I guess my text-only browsing just got a bunch of extra batteries (thus far simply w3m + a few wget-etc scripts).

easrng · on July 28, 2022

The number is a timestamp, and the if_ just hides the toolbar, it's optional (It presumably stands for IFrame, since that's what it's used for (rewriting iframe src attributes so they don't show the toolbar))