Hacker News new | past | comments | ask | show | jobs | submit login

As the article notes, encryption does make a huge difference here. For global web use we've reached something like 70% of sessions

https://letsencrypt.org/stats/#percent-pageloads

However, that doesn't cover 70% of bytes. For example, software updates are often downloaded over HTTP (hopefully with digital signatures!). Debian and Debian-derived distributions distribute most of their package updates over HTTP, authenticated with PGP signatures.

Most of those packages are nonetheless compressed, which increases the variety in bit sequences, but then most of the downloads are of identical compressed files, which decreases the variety.

On the other hand, video streaming is often encrypted now, but still sometimes not encrypted. But even when it's not encrypted for confidentiality or authenticity, it's often encrypted in order to impose DRM restrictions. In any case, it's usually also heavily compressed, which again increases the variety of bit sequences transmitted even for unencrypted video.

To give a rough guess, I think the combination of encryption and compression means that the author is roughly correct with regard to information being transmitted today. Even when different people watch the same video on YouTube, YouTube is separately encrypting each stream, and nonrandom packet headers outside of the TLS sessions (and other mostly nonrandom associated traffic like DNS queries and replies) represent a very small fraction of this traffic.

It might be interesting to try to repeat the calculation if we made different assumptions so that compression was in wide use (hence the underlying messages are nearly random) but encryption wasn't (hence very large numbers of messages are byte-for-byte identical to each other). Can anybody give a rough approach to estimating the impact of that assumption?




Even if every single bit of traffic was encrypted, a huge portion of that would still be identical - TCP headers, for one thing, would share a lot of common bits for each packet. Although bandwidth reporting often ignores those, so I am not sure about the data source for world bandwidth.


Wouldn’t some combination of unique session keys, PFS algorithms, or block encryption modes make this false? When would the headers encrypt to the same ciphertext unless you were using ECB mode with the same key?


I think you misinterpreted cortesoft's point, which I could try to make clearer:

"Even if every single bit of payload traffic was encrypted, a huge portion of the traffic actually sent over the wire would still be identical - TCP headers, for one thing, would share a lot of common bits for each packet."

So I think you're in agreement here.


The original question was “sent over the Internet” not “sent over the wire”.


Anything in the IP header (perhaps excluding the TTL and checksum) and below is sent over the Internet.


If you’re looking at TCP headers, you might as well go all the way to the PHY layer. Your TCP headers will get encoded by an error correcting code (good, they’ll look random again) but then the data will be split into frames, and each frame begins with an identical preamble sequence of bits.


> good, they’ll look random again

They might look random, but they contain the same amount of entropy as the original data and are longer. Also, the encoding is deterministic.


The code itself is deterministic but usually bits are are interleaved and scrambled with a long time varying pseudorandom sequence. If you pass a short repetive sequence in, it will look random coming out.


That's interesting. I know a bit about ECC in general, but not too much about what the current state of implementations is. Do you know what the codes in use are called?


The scrambler's actually a separate block, usually immediately before or shortly after the encoder. It's common in wireless modems, which I'm more familiar with, but it seems like it's rare in wireline PHYs. Ethernet apparently uses an 8b10b encoding to maintain a similar number of 0s and 1s.

I can't speak on entropy, but turbo codes use an internal interleaver and the codewords can be quite long, so a short repetitive input sequence turns into a very random looking output sequence. As you mentioned though, two identical input sequences would still map to identical output sequences.


All of this just means that the author's estimate is an upper-bound estimate.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: