Too bad dreamwidth hosted blogs are behind cloudflare and set to block non-corpo...

jeroenhd · on July 27, 2022

The page loads fine in Ladybird[1] on Arch. It's the browser purpose-built for SerenityOS[2] using a in-house HTTP/JS/TLS engine that hasn't matured to the point of practical usability yet. If I were a site administrator using some kind of weird metric to block a browser, this thing would definitely go on the blacklist.

As for a more common uncommon browser, GNOME Web (WebKit) also works fine.

Whatever is causing you to get blocked, it's not the browser engine you're using. Check your plugins, antivirus, MITM engines, and whatever else messes with your connection. It could also be a simple IP block because of a bad IP neighbour or a shared CGNAT server.

[1]: https://github.com/awesomekling/ladybird

[2]: https://serenityos.org/

superkuh · on July 28, 2022

I tried via 3 different routable IPv4s from different netblocks. I tried the same browser on 3 different physical computers and OS installs.

I get that "It works for me." for some of you with non-corporate browsers. But please understand "It doesn't work for me." and it's not because I have some weird antivirus or packet mangling or a bad IP. It's because Cloudflare's heuteristics are biased against browsers that implement some, but not the latest, JS features. That's cloudflare and dreamwidth's fault, not mine, and they are in the wrong.

Blocking is bad by default and they must justify and adapt, not the users.

jeroenhd · on July 28, 2022

"It works for me" is as useless as "it blocks all non-corporate browsers".

It doesn't block all non-corporate browsers. It apparently blocks your browser, whatever that may be, running from your system, communicating from your network. I don't know what happened to make Cloudflare hate your browser, but my blanket statements are as useful as yours.

They seem to be blocking elinks from non-residential (server) networks. I don't know why so I don't know if it's warranted or not. With the amount of bots Cloudflare has to deal with and the extreme minority of elinks2 users, I can imagine blocking them is a worthwhile tradeoff.

Either way, Cloudflare only provides the defaults, the website operator is responsible for its configuration. In my opinion, a website should be allowed to inconvenience the long tail of weird visitors for any reason they want. I understand that you disagree, but you'll have to convince support@dreamwidth.org if you want to improve the situation, not me.

Rediscover · on July 28, 2022

FWIW, it works on lynx version 2.8.9rel.1 which possibly could be considered non-corporate and slightly old (this version is from 20180708).

1vuio0pswjnm7 · on July 27, 2022

Here is the response I got.

   the route "/11840.html" is not recognized

Internet Archive works for Dreamwidth sites. For me, I add one line to a text file and the localhost forward proxy prefixes the URLs automatically.

https://web.archive.org/web/20220719195142if_/https://diziet...

FWIW, I use a non-corporate browser.

marttt · on July 28, 2022

Thanks for your comment. I realized now that achive.org's "archive string" (here 20220719195142if_) is updated automatically. So if I use this string + some other URL, then I get redirected to a current snapshot of that other site, e.g.

https://web.archive.org/web/20220719195142if_/http://ranprie...

points me to

https://web.archive.org/web/20220723235055if_/https://ranpri...

I suppose the string consists of date + time in hhmmss format + if_? Anyhow, looks like arbitrary strings (e.g. 19991230225818if_) also get redirected to the next existing snapshot counting from that string. This is really nice and simple for text browser scripts.

Is there some straightforward way to list all of archive.org's snapshots (of a particular site) without a javascript-enabled browser?

jwilk · on July 28, 2022

> Is there some straightforward way to list all of archive.org's snapshots (of a particular site) without a javascript-enabled browser?

I use https://github.com/jsvine/waybackpack.

  $ waybackpack --list https://diziet.dreamwidth.org/11840.html
  ...
  https://web.archive.org/web/20220727234836/https://diziet.dreamwidth.org/11840.html
  https://web.archive.org/web/20220728045504/https://diziet.dreamwidth.org/11840.html
  https://web.archive.org/web/20220728084126/https://diziet.dreamwidth.org/11840.html

1vuio0pswjnm7 · on July 28, 2022

"Is there some straightforward way to list all of archive.org's snapshots (of a particular site) without a javascript-enabled browser?"

https://archive.org/services/docs/api/wayback-cdx-server.htm...

FWIW, below is a quick and dirty script I use for a variety of purposes, such as accessing www search result URLs so I do not have to (a) use sites that do not support TLS1.3, (b) use sites that require SNI or (c) use DNS. I will call this script "www".

Example usage:

    alias links="links -no-connect"
    x=https://ranprieur.com
    # retrieve first 5 snapshots (default)
    echo $x|www >1.htm
    # retrieve first 3 snapshots
    echo $x|www 3 >1.htm
    # retrieve last 3 snapshots
    echo $x|www -3 >1.htm 
    # retrieve all snapshots
    echo $x|www 0 >1.htm
    links 1.htm

    #!/bin/sh

    LIMIT=${1-5}; 
    read x0; 
    x0=$(echo $x0|sed 's/%/&&/g');
    x1=web.archive.org;
    curl -A "" "https://$x1/cdx/search/cdx?url=$x0&fl=timestamp,original&limit=$LIMIT&showDupeCount=true" \
    |(echo "<h2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;${x0}</h2><ol><pre>";sed -n "/[0-9]\{14\} [hf]/{s/\(.* \)\(.*\)/<li><a href=https:\/\/${x1}\/web\/\1if_\/\2>\1<\/a>/;s/ </</;s/ //2p;}");
    echo "</ol></pre>";

NB. I used curl here because this is an example for HN. That does not mean I am a curl user.

I also have a small script I use for the Common Crawl archives. They also use CDX but the results are WARC files compressed with gzip. I wrote a small program in C to extract the gzip'd results after HTTP/1.1 pipelining. For retrieving results without pipelining (i.e., many TCP connections), I modified tnftp to accept a Range header.

marttt · on July 28, 2022

This is excellent, many thanks for sharing.

I don't know how I didn't think that Wayback Machine might maybe also have an API. :/ Also, lots of interesting stuff for things like the above on Common Crawl: https://commoncrawl.org/the-data/examples/

I guess my text-only browsing just got a bunch of extra batteries (thus far simply w3m + a few wget-etc scripts).

easrng · on July 28, 2022

The number is a timestamp, and the if_ just hides the toolbar, it's optional (It presumably stands for IFrame, since that's what it's used for (rewriting iframe src attributes so they don't show the toolbar))

jonathantf2 · on July 27, 2022

Reading this from a Debian bullseye system just fine.

dvfjsdhgfv · on July 27, 2022

This is very wrong. If someone at DW is reading this, please don't do that.

acdha · on July 27, 2022

I’d want confirmation that it’s true first. People can configure custom UA blocking rules on Cloudflare but based on this I’d bet the problem is some custom configuration or plug-in interfering with the normal human activity challenge.

https://www.webpagetest.org/result/220727_AiDcRG_FMH/