The IA Client – The Swiss Army Knife of Internet Archive

dessant · on June 6, 2019

If you're looking for a convenient way to search web pages on the Internet Archive, you may also use my browser extension for viewing archived and cached versions of web pages, it supports 15 data sources, and page archiving is also planned.

https://github.com/dessant/view-page-archive

gitgud · on June 6, 2019

Great idea! And a good way to solve a repetitive task I run into often.

IntelMiner · on June 6, 2019

I was a little disappointed with the IA app. It works fairly well for the most part, but it seems to lag behind in features for the site

I wanted to download all of the Computer Chronicles. Both for viewing offline and to have my own "set" of files. I even re-encoded them to HEVC (from MPEG-2) and put them up here https://intelminer.com/torrents/TV%20SHOWS/Computer%20Chroni...

Getting them from the Archive through was an exercise in frustration. IA offers (and heavily recommends) using the torrent download option to ease on bandwidth cost

Unfortunately for what ever reason, there's no way to pull down the .torrent files using this method.

In the end I had to simply pull the MPEG-2 videos down one by one over the course of several months (due to speed limiting on IA's end)

textfiles · on June 6, 2019

I just tried this in bash:

for each in `ia search collection:computerchronicles --itemlist`; do ia download $each --glob=*.torrent; done

I have myself a directory of torrents.

IntelMiner · on June 6, 2019

I'm glad it works now. At the time it didn't seem to list .torrent as a valid option though

textmode · on June 7, 2019

# If are not a Python user or want to try something different (faster), can be done with sh, sed, openssl, curl/wget/etc. plus a simple utility I wrote called "yy025" (https://news.ycombinator.com/item?id=17689152). yy025 is a more generalised "Swiss Army Knife" for making requests to any website. This solution uses a traditional method called "http pipelining".

   export Connection=keep-alive;
   n=1;while true;do
   test $n -le 8||break
   echo https://archive.org/details/computerchronicles?\&sort=-downloads\&page=$n
   n=$((n+1));done \
   |yy025|openssl s_client -connect archive.org:443 -ign_eof \
   |sed '/item-ia\" [^ ]/!d;s,.*=\",,;s/\"//;s,.*,https://archive.org/download/&/&_archive.torrent,' \
   |yy025|openssl s_client -connect archive.org:443 -ign_eof|sed '/Location:/!d;s/Location: //'

# Additional command-line options for openssl s_client omitted for sake of brevity. The above outputs the torrent urls. Feed those to curl or wget or whatever similar program you choose, or maybe directly to a torrent client. Something like

   |while read url;do curl -O $url;done

unmole · on June 7, 2019

I though HTTP pipelining was discouraged and that most servers disabled by default. Is that not true?

textmode · on June 7, 2019

You are probably thinking of pipelining in terms of the popular web browsers. Those programs want to do pipelining so they can load up resources (read: today, ads) from a variety of domains in order to present a web page with graphics and advertising.

That never really worked. Thus, we have HTTP/2, authored by an ad sales company. It is very important for an ad sales company that web pages contain not only what the user is requesting but also heaps of automatically followed pointers to third party resources hosted on other domains. That is, pages need to be able to contain advertising. HTTP/1.1 pipelining is of little benefit to the ad ecosystem.

However, sometimes the user is not trying to load up a graphical web page full of third party resources. Here, the HN commenter is just trying to get some HTML, extract some URLs and then download some files. The HTML is all obtained from the same domain. This is text retrieval, nothing more.

If all the resources the user wants are from the same domain, e.g., archive.org, then pipelining works great. I have been using HTTP/1.1 pipelining to do this for several decades and it has always worked flawlessly.

Typically httpd settings for any website would allow at least 100 pipelined requests per connection. As you might imagine, often the httpd settings are just unchanged defaults. Today the limits I see are often much higher, e.g., several hundred.

It is very rare in my experience to find a site that has pipelining disabled. More likely they are disabling Connection: keep-alive and forcing all requests to be Connection: close. I rarely see this.

The HTTP/1.1 specification suggests a max connection limit per browser of two. There is no suggested limit on the number of requests per connection. In terms of efficiency, the more the better. How many connections does a popular we browser make when loading an "average" web page today? It is a lot more than two! In any event, pipelining as I have shown here stays under the two connection limit.

LeoPanthera · on June 6, 2019

There's no speed limit if you login. At least, this was the case when I also batch downloaded all the CC MPEG-2 files, and it finished overnight.

mcguire · on June 6, 2019

Is there a tutorial or introduction to the Internet Archive? There's a giant mass of fascinating stuff, but I've always had a hard time getting a handle on it.

textfiles · on June 7, 2019

I'm working on one, but there isn't one as such, no.

basicplus2 · on June 6, 2019

now just need to be able to access it like traversing the "real" internet with a time travel button

noirscape · on June 6, 2019

It's a decent client, but be aware that you might want to increase the file descriptor limit, the client at the moment doesn't properly close files from my experience in using it to upload a fairly large folder structure.

A simple `ulimit -n` with the raised descriptor limit should take care of it.

jjjake · on June 6, 2019

I think this should fix that:

https://github.com/jjjake/internetarchive/commit/1ac200cbbbe...

This change will be in the next release, v1.8.5.

noirscape · on June 17, 2019

Ah, that is good to see then!

(sorry it took me a bit to respond, I'm not on HN all that often.)

ignaloidas · on June 6, 2019

Meta: is "The Swiss Army Knife of $Something" the new "$Something for humans" ? I really hate these phrases as they technically can be put in front of almost anything and stay semi-accurate, but not give any additional information, besides being marketing-speak

eitland · on June 6, 2019

It is not new :-)

> "Perl is the Swiss Army chainsaw of scripting languages" - https://www.perl.com/pub/2000/10/begperl1.html/ > > Doug Sheppard, 2000-10-16

(Earliest quote I could find in three minutes.)

textfiles · on June 6, 2019

I chose the phrasing because I deemed it accurate for the situation. This script has multiple functions and does a number of distinct things well depending on how it is invoked.

NikkiA · on June 8, 2019

Be grateful it's not 'a pragmatic internet tool' or something.

geephroh · on June 7, 2019

My org has been using the client and python library for a couple of years to interact with IA. It's a fantastic tool -- Jake Johnson's a superhero in my book!

msla · on June 6, 2019

Last time I checked, you could only give the download subcommand one option at a time, so I wrote this shell script:

    #!/usr/bin/env zsh

    for i in "$@"
    do
         ia download $(basename $i)
    done