If you're looking for a convenient way to search web pages on the Internet Archive, you may also use my browser extension for viewing archived and cached versions of web pages, it supports 15 data sources, and page archiving is also planned.
Getting them from the Archive through was an exercise in frustration. IA offers (and heavily recommends) using the torrent download option to ease on bandwidth cost
Unfortunately for what ever reason, there's no way to pull down the .torrent files using this method.
In the end I had to simply pull the MPEG-2 videos down one by one over the course of several months (due to speed limiting on IA's end)
# If are not a Python user or want to try something different (faster), can be done with sh, sed, openssl, curl/wget/etc. plus a simple utility I wrote called "yy025" (https://news.ycombinator.com/item?id=17689152). yy025 is a more generalised "Swiss Army Knife" for making requests to any website. This solution uses a traditional method called "http pipelining".
# Additional command-line options for openssl s_client omitted for sake of brevity. The above outputs the torrent urls. Feed those to curl or wget or whatever similar program you choose, or maybe directly to a torrent client. Something like
You are probably thinking of pipelining in terms of the popular web browsers. Those programs want to do pipelining so they can load up resources (read: today, ads) from a variety of domains in order to present a web page with graphics and advertising.
That never really worked. Thus, we have HTTP/2, authored by an ad sales company. It is very important for an ad sales company that web pages contain not only what the user is requesting but also heaps of automatically followed pointers to third party resources hosted on other domains. That is, pages need to be able to contain advertising. HTTP/1.1 pipelining is of little benefit to the ad ecosystem.
However, sometimes the user is not trying to load up a graphical web page full of third party resources. Here, the HN commenter is just trying to get some HTML, extract some URLs and then download some files. The HTML is all obtained from the same domain. This is text retrieval, nothing more.
If all the resources the user wants are from the same domain, e.g., archive.org, then pipelining works great. I have been using HTTP/1.1 pipelining to do this for several decades and it has always worked flawlessly.
Typically httpd settings for any website would allow at least 100 pipelined requests per connection. As you might imagine, often the httpd settings are just unchanged defaults. Today the limits I see are often much higher, e.g., several hundred.
It is very rare in my experience to find a site that has pipelining disabled. More likely they are disabling Connection: keep-alive and forcing all requests to be Connection: close. I rarely see this.
The HTTP/1.1 specification suggests a max connection limit per browser of two. There is no suggested limit on the number of requests per connection. In terms of efficiency, the more the better. How many connections does a popular we browser make when loading an "average" web page today? It is a lot more than two! In any event, pipelining as I have shown here stays under the two connection limit.
Is there a tutorial or introduction to the Internet Archive? There's a giant mass of fascinating stuff, but I've always had a hard time getting a handle on it.
It's a decent client, but be aware that you might want to increase the file descriptor limit, the client at the moment doesn't properly close files from my experience in using it to upload a fairly large folder structure.
A simple `ulimit -n` with the raised descriptor limit should take care of it.
Meta: is "The Swiss Army Knife of $Something" the new "$Something for humans" ? I really hate these phrases as they technically can be put in front of almost anything and stay semi-accurate, but not give any additional information, besides being marketing-speak
I chose the phrasing because I deemed it accurate for the situation. This script has multiple functions and does a number of distinct things well depending on how it is invoked.
My org has been using the client and python library for a couple of years to interact with IA. It's a fantastic tool -- Jake Johnson's a superhero in my book!
https://github.com/dessant/view-page-archive