Shortest bit sequence that's never been sent over the Internet (2017)

panic · on April 19, 2018

> We're looking for a specific sequence, though, not a specific number of heads in a row. We don't even know what the sequence is since it hasn't been sent yet. Is that a problem? Not at all! We're looking for some sequence of length n, and given that both 0 and 1 are equally likely, the sequence 00110 is equally likely as 11111.

Interestingly enough, this isn't true!

First, let's test on a small example: how likely are the substrings "11" and "10" to appear in binary strings of length 3? Here's a table with the matches marked.

        "11" "10"
    000
    001
    010        *
    011   *
    100        *
    101        *
    110   *    *
    111   *

"10" can appear in four ways, but "11" can only appear in three. Why is this?

Say you're scanning through a bit string, looking for "11111". You've seen "111" so far -- that's three matches. Now you encounter a "0". Your counter resets to zero matches until you see a "1" again.

Now say you're looking for "00110". You've seen "001" so far. Just like before, you encounter a "0". You still need to reset your counter, but this time the "0" may actually be the start of a new "00110". So your counter resets to one, not zero. This means "00110" matches are easier to find, and happen more frequently!

bhouston · on April 19, 2018

> This means "00110" matches are easier to find, and happen more frequently!

They do not happen more frequently in a random bit stream. You are absolutely and completely wrong about this.

They only happen more frequently if you switch the problem from counting occurances to counting bitstreams that contain these occurrences.

The reason this is is because the sequence 111 contains two substrings of 11. Thus if this happens in a bitstream and you are counting bitstreams you only get a count of 1. Where as with the counting frequency you would still get two.

This will occur any time a sequence can be overlapped with itself.

panic · on April 19, 2018

You're right, I didn't word that very clearly. I meant "more strings contain this substring", not "this substring occurs more times total" (which, as you say, would be the usual meaning of "matches happen more frequently").

Why care about strings containing a particular substring versus total number of occurrences? The article referenced this PDF: https://www.cs.cornell.edu/~ginsparg/physics/INFO295/mh.pdf, which is about how many coin flips it takes to see a run of N heads. That's really what the argument in the second half of my post is about, but it applies to the "how many strings contain this substring" problem too, and that one seemed simpler to draw a table for.

zwischenzug · on April 19, 2018

You've answered a different question:

How likely are the substrings "11" and "10" to appear in binary strings of length 3?

rather than:

Is the sequence of '111' as equally likely to appear as '010' given that 1s and 0s are equally likely to appear?

zwischenzug · on April 19, 2018

Reminds me of a question an old tutor used to ask potential thesis candidates:

If I throw two dice, what's the probability I throw at least one six?

pure-awesome · on April 19, 2018

Am I being stupid, or is the answer 11/36? I.e., slightly less than 1/3?

Out of 36 possible throws of a pair of dice, 10 of them have a single six and 1 more has both sixes.

tomsmeding · on April 19, 2018

Seems correct. The probability of at least one 6 is one minus the probability that both aren't a 6, which is 5/6 * 5/6 = 25/36; so the result is (36-25)/36 = 11/36.

rtkwe · on April 19, 2018

Why wouldn't it just be the probability of either rolling at least one 6 which would be 1/6 + 1/6 = 1/3? Genuinely curious, I was never super good at probability calculations.

edit: clarification of problem.

Retric · on April 19, 2018

Picture two coins and 2 flips and count the H's there (2x2) = 4 options and ((2x2) x 1/2 x 2(flips)) = 4 Heads.

HH, HT, TH, TT

However, if you count the sequences with at least one H there are (4-1) = 3 with at least 1 H (HH, HT, HT) as the HH eats up 2 H's.

Now, Picture 3 sided dice, that's (3x3) = 9 options. You will see ((3x3) x 1/3 x 2(flips)) = 6 2's.

00, 01, 02, 10, 11, 12, 20, 21, 22

However if you count there are (6 - 1) = 5 sequences with at least one 2 you get 02, 12, 20, 21, 22 as the 22 eats up an extra 2.

Now, what do you think happens with an N sided dice? What if N is 6.

PS: This seems less clear as I added more info...

rtkwe · on April 19, 2018

Yeah doing a brute force count of the possibilities the (6,6) option is double counted by the 1/6 + 1/6 method as two distinct possibilities. Or that's what it looks like is happening. Been too long since my college combinatorics class.

pkaye · on April 19, 2018

Imagine doing that with 7 dice... 7/6 probability!

escherplex · on April 19, 2018

Which suggests the probability of the event not happening would be -1/6, which suggests the quasiprobability of the event not not happening. Obviously no ambiguity there. /s

oaisdjoaisjdoi · on April 19, 2018

Because those are just numbers that work. The only way to roll a 2 (for example) is to roll 2 1's - which is obviously 1/36, not 1/3.

rtkwe · on April 19, 2018

Sorry the problem was probability of rolling at least one 6 not of rolling a 6. Edited my original to make that clearer without the original post.

0x0 · on April 19, 2018

Because you're not guaranteed to roll a 6, even if you roll 6 times: 1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 6/6 = 100% chance? Nope. :)

zwischenzug · on April 19, 2018

Nope, you're being clever.

uryga · on April 19, 2018

could you explain? I got the same thing...

jwilk · on April 19, 2018

I think it was meant to be a direct answer to "Am I being stupid [...]?"

zwischenzug · on April 19, 2018

Yup... shows the value of HN votes...

emj · on April 19, 2018

Is the tutor fair and unbiased?

daveFNbuck · on April 19, 2018

I wrote a quick Python snippet to compute the probabilities for all sequences of length 3 over 20 flips:

  [(x, sum(1 for i in range(2**20) if x in '{:020b}'.format(i)) / 2.0**20) for x in map('{:03b}'.format, range(8))]

The output shows a 79% chance for "000" and a 91% chance for "010". "000" and "111" are actually the least likely substrings, with "010" and "101" in second place.

Furthermore, "001" has a 97% chance of occurring. It's pretty easy to see why "001" is so much more likely than the others. Unless your string perfectly alternates 0s and 1s, you're going to get repeating pairs. What happens after you get a "00"? If there's ever a 1 again, there were at least two 0s before it.

So you're almost guaranteed a "001" after "00", unless it's near the very end of the string. You only have a 50% chance of getting "000" from these though. A 1 completely resets the search for "000", but a 0 keeps the search alive for "001".

zwischenzug · on April 19, 2018

Again: that looks for whether that sequence is in each tested string, which is a different problem.

It tests if 'x' is 'in', rather than the number of times it is seen.

daveFNbuck · on April 20, 2018

But the question being tested is whether a given string is in the bits that have been transferred over the internet. The number of times it is seen is not what we're talking about here.

freyir · on April 19, 2018

Yes, but the sequence of bits that has been transmitted over the Internet is finite. That makes panic’s argument technically true.

Since we’re dealing with a sequence that’s zettabytes long, I’d guess it falls under the category of technically true but irrelevant.

comex · on April 19, 2018

Wow. That's really nonintuitive to me. Like the Monty Hall problem: you have something that's obviously an irrelevant detail that can't affect the probability… except… somehow it does.

plopilop · on April 19, 2018

There are problems that I find even more counterintuitive than Monty Hall, such as the sister's paradox.

    Your neighbours has two children, one of which is a girl.
    What is the probability of the other child to be a girl?

A. One may say 1/2, but actually, using basic Bayes theorem: P(2 girls | at least 1 girl) = P(2 girls AND 1 at least girl) / P(at least 1 girl) = (1/4) / (3/4) = 1/3.

I think this problem is actually equivalent to the Monty Hall problem. But it can get much weirder.

    Your neighbour has 2 children, one is a girl.
    You know the other child is born on a Sunday,
    what is the probability that this other child is a girl?

A. Similarly, counting all possible cases, one does not get 1/2 or 1/3, but rather 13/27.

DavidSJ · on April 19, 2018

  Your neighbours has two children, one of which is a girl.
  What is the probability of the other child to be a girl?

This is oft-mentioned but with this wording it's ambiguous. Here are three possible interpretations:

1. Your neighbor has two children. We observed both and found that there was exactly one girl. What is the probability that both are girls? (Answer: 0)

2. Your neighbor has two children. We observed one and found that she was a girl. What is the probability that the other (unobserved) child is a girl? (Answer: 1/2)

3. Your neighbor has two children. We observed both and found that there was at least one girl. What is the probability that both are girls? (Answer: 1/3)

Interpretation #1 is kind of silly but it serves to illustrate the ambiguity of the plain English wording. Interpretation #2 is how most people interpret the problem, and it's bolstered by the use of the word "other" in your prompt. For there to be an "other" child, there must have been a specific girl observed originally, rather than a general observation that there was at least one girl. Interpretation #3 is how you intended the problem.

plopilop · on April 20, 2018

yup, you're right.

cjslep · on April 19, 2018

You are horribly butchering Bayes' Theorem. Both your examples are incorrect because your two events ('2 girls' and 'at least 1 girl') are not oberved independently of one another, which is a necessary condition to using Bayes' Theorem with a valid result. Great tool but wrong job.

plopilop · on April 20, 2018

What? Bayes' theorem does not require anything about the events, even more so it is useful when the events are not independent. If events are independent, Bayes' theorem has no usefulness. https://en.wikipedia.org/wiki/Bayes%27_theorem#Statement_of_...

Edit: however now that I look into it, I should rather have used "conditional probabilities" name rather than Bayes'. I always mix both of them, my bad.

alecmev · on April 19, 2018

Looks like this is the Boy or Girl paradox [0], which has ambiguity problems.

[0] https://en.wikipedia.org/wiki/Boy_or_Girl_paradox

grkvlt · on April 20, 2018

I love how Wikipedia helpfully informs us that 'Biologically, according to studies of the human sex ratio, it is more likely for any one person to be born into the male sex than the female sex.' as though it is somehow relevant or has anything to do with the actual paradox being discussed...

NOTE I've since edited the Wiki page, and deleted the sentence - since it isn't relevant and there was no citation for the claim!

olavk · on April 19, 2018

> Your neighbours has two children, one of which is a girl. What is the probability of the other child to be a girl?

The way you have phrased the question, the probability is 1/2.

Gh0stRAT · on April 19, 2018

Actually, it's ever so slightly higher than 1/2 because there is a non-zero chance of them being identical twins. If we assume an identical twin incidence rate of 0.4%, then the actual probability would be 63/125 that the second child is also a girl.

kqr · on April 19, 2018

In the first case, I guess what's happening is that when we pick two people randomly with uniform distribution, we expect, with equal probability, any combination of boys and girls: MM MF FF FM. When we observe a girl, we can exclude the case of two boys, so the remaining cases to consider are MF FF FM, where obviously only 1/3 of the cases is two girls.

So I guess what makes this counterintuitive is that we think of the example as independent variables actually affecting each other, but what happens is really that we're drawing from a collection without replacement, and obviously the mix of the collection is going to change as we do so.

This should apply also on a humankind, world-wide scale: if you have met three women in a row, the fourth person you meet is more likely to be a man. Gambler's fallacy! Or just drawing without replacement. A roulette table is drawing with replacement. :)

KozmoNau7 · on April 19, 2018

But MF and FM in this case are the same?

cecilpl2 · on April 19, 2018

No because children are distinguishable.

If you flip two coins, you get HH, HT, TH, and TT with equal probability. Try it.

KozmoNau7 · on April 19, 2018

But the sequence is irrelevant.

HT is equal to TH, because in this situation we only care whether the two coins are equal to each other or different from each other.

Dylan16807 · on April 20, 2018

It's the percent chance that matters, so it doesn't matter if you can distinguish them. The cases still exist with equal chance.

KozmoNau7 · on April 20, 2018

To elaborate, the situation as stated in the question is currently Hx, so HH and HT are the only possible results.

Dylan16807 · on April 20, 2018

You can calculate 50/50 but it's not by ignoring the difference between HT and TH.

KozmoNau7 · on April 20, 2018

The first child is not an influencing factor on the second child. It's either a boy or a girl, 50/50.

rkangel · on April 19, 2018

My favourite quote from the wonderful David Spiegelhalter, Winton Professor of the Public Understanding of Statistics:

> Why do people find probability unintuitive and difficult? Because it is unintuitive and difficult.

speleding · on April 19, 2018

Another counter intuitive probabilistic property that may be relevant here:

- If you make a distribution of the first digit of prices in a webshop you will notice that 1 appears more frequently than other digits.

- If you convert the prices to another currency and redo the distribution, the most frequent first digit will still be 1!

So it's not a given that you can assume the properties of a random bit stream will hold for the calculation in the article.

(As an aside: this weird property explains why you need slightly fewer coins in a currency that has quarters than those with 20 cent coins)

_2xvp · on April 19, 2018

The logic behind this is more or less the same idea behind the Knuth-Morris-Pratt algorithm (https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93P...), which might help with your intuition.

trowftd · on April 19, 2018

Aren't you miscounting though? 111 contains the 11 sequence twice.

azernik · on April 19, 2018

Which is probably the key intuition here - the expected number of matches is exactly the same, but because you can have more matches for the purely-repeated sequence in the same string the total number of strings containing those matches is smaller.

panic · on April 19, 2018

Sure, but the question is "how likely" not "how many times". Counting the expected number of times is a much easier problem because you can use the linearity of expectation.

jstanley · on April 19, 2018

But you're trying to find a bit-substring of length 2 that is least likely to have been transmitted.

If the bit strings of length 3 are chosen uniformly, 11 is less likely to appear than 10.

bhouston · on April 19, 2018

You are totally right. He is incorrect.

_dcwr · on April 19, 2018

This is (almost) how Knuth–Morris–Pratt string search algorithm works: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93P...

bigiain · on April 19, 2018

Another possible detail this misses - I learnt about De Bruijn sequences from Samy Kamkar here:

http://samy.pl/opensesame/

(scroll about 1/3rd way down to the "(U) The OpenSesame Attack" section for the bits relevant to this discussion...)

You don't need to enumerate every n-bit sequence, you just need to enumerate a (shorter - by 62% in Samy's 12 bit case) sequence that contains all the n-bit sequences.

saagarjha · on April 19, 2018

Interestingly, there was an old USAMTS problem that had De Bruijn sequences in it, though I wasn't aware of it at the time: http://www.usamts.org/Tests/Problems_25_1.pdf

gumby · on April 19, 2018

This is the heart of the Boyer-Moore string searching algorithm too.

hnuser1234 · on April 19, 2018

Does this still apply when your search space is on the order of thousands of petabytes, and the smallest unseen sequences are, according to the linked post, around 75 bits?

acobster · on April 19, 2018

Wow, how surprising and, in a strange way, kind of beautiful! Thanks for this.

improv32 · on April 19, 2018

I'm reminded of constructing DFAs that represent regular expressions for class.

panic · on April 19, 2018

Yeah, you can see finite automata explicitly used in the PDF that's linked in the article itself: https://www.cs.cornell.edu/~ginsparg/physics/INFO295/mh.pdf

anonlastname · on April 19, 2018

Yes, google "penney ante"

schoen · on April 19, 2018

As the article notes, encryption does make a huge difference here. For global web use we've reached something like 70% of sessions

https://letsencrypt.org/stats/#percent-pageloads

However, that doesn't cover 70% of bytes. For example, software updates are often downloaded over HTTP (hopefully with digital signatures!). Debian and Debian-derived distributions distribute most of their package updates over HTTP, authenticated with PGP signatures.

Most of those packages are nonetheless compressed, which increases the variety in bit sequences, but then most of the downloads are of identical compressed files, which decreases the variety.

On the other hand, video streaming is often encrypted now, but still sometimes not encrypted. But even when it's not encrypted for confidentiality or authenticity, it's often encrypted in order to impose DRM restrictions. In any case, it's usually also heavily compressed, which again increases the variety of bit sequences transmitted even for unencrypted video.

To give a rough guess, I think the combination of encryption and compression means that the author is roughly correct with regard to information being transmitted today. Even when different people watch the same video on YouTube, YouTube is separately encrypting each stream, and nonrandom packet headers outside of the TLS sessions (and other mostly nonrandom associated traffic like DNS queries and replies) represent a very small fraction of this traffic.

It might be interesting to try to repeat the calculation if we made different assumptions so that compression was in wide use (hence the underlying messages are nearly random) but encryption wasn't (hence very large numbers of messages are byte-for-byte identical to each other). Can anybody give a rough approach to estimating the impact of that assumption?

cortesoft · on April 19, 2018

Even if every single bit of traffic was encrypted, a huge portion of that would still be identical - TCP headers, for one thing, would share a lot of common bits for each packet. Although bandwidth reporting often ignores those, so I am not sure about the data source for world bandwidth.

wskinner · on April 19, 2018

Wouldn’t some combination of unique session keys, PFS algorithms, or block encryption modes make this false? When would the headers encrypt to the same ciphertext unless you were using ECB mode with the same key?

schoen · on April 19, 2018

I think you misinterpreted cortesoft's point, which I could try to make clearer:

"Even if every single bit of payload traffic was encrypted, a huge portion of the traffic actually sent over the wire would still be identical - TCP headers, for one thing, would share a lot of common bits for each packet."

So I think you're in agreement here.

votepaunchy · on April 19, 2018

The original question was “sent over the Internet” not “sent over the wire”.

dooglius · on April 19, 2018

Anything in the IP header (perhaps excluding the TTL and checksum) and below is sent over the Internet.

freyir · on April 19, 2018

If you’re looking at TCP headers, you might as well go all the way to the PHY layer. Your TCP headers will get encoded by an error correcting code (good, they’ll look random again) but then the data will be split into frames, and each frame begins with an identical preamble sequence of bits.

bmm6o · on April 19, 2018

> good, they’ll look random again

They might look random, but they contain the same amount of entropy as the original data and are longer. Also, the encoding is deterministic.

freyir · on April 19, 2018

The code itself is deterministic but usually bits are are interleaved and scrambled with a long time varying pseudorandom sequence. If you pass a short repetive sequence in, it will look random coming out.

bmm6o · on April 19, 2018

That's interesting. I know a bit about ECC in general, but not too much about what the current state of implementations is. Do you know what the codes in use are called?

freyir · on April 22, 2018

The scrambler's actually a separate block, usually immediately before or shortly after the encoder. It's common in wireless modems, which I'm more familiar with, but it seems like it's rare in wireline PHYs. Ethernet apparently uses an 8b10b encoding to maintain a similar number of 0s and 1s.

I can't speak on entropy, but turbo codes use an internal interleaver and the codewords can be quite long, so a short repetitive input sequence turns into a very random looking output sequence. As you mentioned though, two identical input sequences would still map to identical output sequences.

caf · on April 19, 2018

All of this just means that the author's estimate is an upper-bound estimate.

x1798DE · on April 19, 2018

I like this general approach, but presumably you don't want the point where 50% of the sequences have never been transmitted (on average, assuming randomly distributed data), you want the first point where at least 1 hasn't been transmitted, so you should be solving for the point where the probability of a sequence of that length not appearing times the number of sequences of that length is equal to 1.

That said it's a logarithmic scale so the length is probably not that much shorter than the one he came up with.

nkurz · on April 19, 2018

the length is probably not that much shorter than the one he came up with

What's your intuition for what "not that much shorter" means here? I'm not seeing the math clearly, but it seems like solving for 1/2 vs solving for 1/(2^many) would make a big difference when many is large double digits. I'm not sure how big "big" is, but I'd guess it would be large enough to dwarf the chosen two decimal places and 2-bit difference over a 5 year projection.

Another factor that would move things in the same direction (of a shorter string being the correct answer) is that if we are presuming random bits, I think we can multiply the total number bits transmitted by the length of the bit string. This would be to account for the fact that the match doesn't have to start on any particular boundary, rather the target can be "swept through" the entire corpus bit-by-bit.

mirimir · on April 19, 2018

How can you ever do better than a probabilistic estimate for the shortest sequence length where one has never been transmitted?

tzs · on April 19, 2018

Note: I'm going to use bit strings rather than bit sequences.

Given a total of B = 3.4067 x 10^22 bits sent over the internet, I don't think there is any reasonable way to say for sure what the length of the smallest string that has never been sent is, but we can say that there is definitely a 75 bit string that has never been sent.

Take all of the internet transmissions, and concatenate them, giving a string S of B bits.

Every string that has been transmitted is a substring of S.

There are at most B substrings of length n in S, and hence at most B distinct strings of length n that have been transmitted.

If we pick n such that 2^n > B, there must be at least one string of length n that has not been transmitted. 2^n > B whenever n >= 75.

Hence, if the value of B is correct, then there must be a 75 bit string that has never been sent.

ghayes · on April 19, 2018

This reasoning seems to better match the question as it was originally posed.

evanb · on April 19, 2018

If you concatenate to get S you may introduce substrings that have never been themselves transmitted, though of course I agree that every sequence that has been transmitted is a substring of S.

cup-of-tea · on April 19, 2018

Yeah which means 74 is an upper bound on the length of bit sequence which might have been transmitted already.

pelario · on April 19, 2018

Neat! This analysis is consistent (and more accurate) than the OP; he assumed that the bitstring is random; therefore he arrives to the lower bound. If, somehow we account for the bias in the sequence, we may find a larger n.

sgentle · on April 19, 2018

Relatedly, there must be a (much longer) shortest sequence of bits that it is impossible to send over the internet, because beyond a certain length the payload must be split over multiple packets, requiring additional header bits in between.

In theory, the largest IPv4 packet that can be sent is 65,535 bytes. IPv6 is the same, but also allows for "jumbograms" that can be up to 4GB long, yeesh. However, the practical limit is the maximum frame size of the underlying link. (Standard) Ethernet and PPP both top out at 1500, ATM can handle 4456. Non-standard Ethernet jumbo frames can go up as high as 9216 bytes.

But assuming you consider "send over the internet" to imply that most destinations on the open internet would be able to receive it, that means a frame size of 1500 bytes, giving you 12000 bits to play with. This leads to the curious result that it's impossible to send a sequence of more than 11870 "1"s.

The shortest header you could have is a 160-bit IPv4 header, of which the last 32 bits (the destination address) can't all be 1 because 255.255.255.255 isn't routable, and actually 127.255.255.255 isn't either, so the earliest place you can start is bit 130. After 11870 "1"s, you reach the 4-bit version field of the next packet, which can't start with a 1 because it would indicate IPv8 or higher, and I don't think that exists... or at least it hasn't seen widespread adoption.

orangea · on April 19, 2018

The question that this person posed is

> At what value for X is there a 50% chance there [exists] a sequence of X-bits in length that hasn't been transmitted yet?

But if I understand correctly, the question that they answered is "At what value for X is there a 50% chance that, given some specific sequence of X bits, that sequence hasn't been transmitted yet?" Wouldn't these two questions have different answers?

Edit (now that I've thought about it more): On top of that, the expected value of a random variable isn't necessarily a value that it has a 50% chance to be. So even my interpretation of this result is wrong.

greeneggs · on April 19, 2018

2^n is half of 2^(n+1), so you'd expect that getting 50% coverage of (n+1)-bit sequences would mean that you have 100% coverage of n-bit sequences. In other words, the answer to your question is at most 1 bit lower than the answer given.

But it is slightly more complicated, because this is an example of the Coupon Collector's Problem [1]. It takes about N ln N random samples to cover a set of size N. If there have been 2^74 samples chosen, then we haven't quite gotten enough to cover N = 2^68.

[1] https://en.wikipedia.org/wiki/Coupon_collector%27s_problem

joshuamorton · on April 19, 2018

>Wouldn't these two questions have different answers?

Yes, though in practice they're close.

To answer the actual question asked. We can say that there is a length n, such that all sequences S, of length n, cannot be contained in our total bits transferred, B. This is a consequence of the pigeonhole principle.

Naively, this would be B = n * S, and S = 2^n. So B = n * 2^n. Note that we can actually compress things more though, if we assume that in the worst case every window in the internet bit-corpus is unique.

That this is possible isn't immediately obvious, but consider

01, 00110, 0001011100, 0000100110101111000, 000001000110010100111010110111110000 (which, interestingly, isn't in OEIS, but related sequences: https://oeis.org/A052944 and https://oeis.org/A242323 are).

These bitstrings contain, perfectly overlapping, every bitstring of length n, for n from 1-5. In general, it takes 2^n + n - 1 bits to convey every possible bitstring, if you manage to overlap them perfectly (if someone can prove this, please do. I thought it was grey code/hamiltonian path over hypercube related, but I unconvinced myself of that).

EDIT: Someone else mentioned De Brujin sequences, which these are. And they are based on hamiltonian paths, although not over hypercubes :(. And my sequence is on OEIS, as https://oeis.org/A166315, just converted to decimal. /EDIT

So the real answer is just B = 2^n + n - 1, but for n > 15 or so, n - 1 is so small we can ignore it. In other words, our length n is just log(B). Given the assumption in the article of 3.4067e22 bits, the base 2 log of that is...74.85. This is exactly 1 more than what the article says is the point where you'll have seen half the messages. This isn't a coincidence.

Which raises an interesting question that I leave as an exercise to the reader:

Why is the length for which you cannot have conveyed all bit strings exactly 1 bit more than the point where you've conveyed about half of them?

freyir · on April 19, 2018

You’re giving the range for a 50% to 100% chance of non-transmission.

But in the spirit of the original question, I’d suppose we want to find n such that the probability of non-transmission is 2^-n, and the expected number of such transmissions is 1? Is that number similarly close?

larl · on April 19, 2018

Back of the envelope to double check:

Some estimates put the internet traffic at 2^70 bits per year and growing. For any short sequence of N bits, there are approximately that 2^70 such sequences transmitted per year. Let's assume that, since they are encrypted, they are uniformly distributed. So what is the sequence length for which our p(seen all) = 50%? Seems like 71-bits is about right.

There traffic growth curve is probably ~2x per year, so each year we add 1 more bit each year. The sum of all internet traffic for all time, at present, is therefore about 1 more bit.

So ~73 bits.

Math checks out.

dosycorp · on April 19, 2018

The way I think of this is:

- nearly everything is compressed

- many things are encrypted

Therefore,

For any given "random" ( "high entropy" ) string of length X, there's some non-negligible chance it's already been sent.

But it's far less likely that a ( partially degraded ) non random string is sent. Why ?

Consider this:

"the cat sat on the hat" ( probably sent )

"the cut sat on the hat" ( still probably sent )

"thx cut set19n the mkt" ( waaaay less likely to be sent )

"thKxc8t suts n x4e m-t" ( probably never sent ... until now :) )

My reasoning is like, all random strings are ( happy / random ) in the same way. They all look alike. High entropy, but low organization / structure. Because of compression and encryption, any random string probably has as good a chance to be sent as any other, so looking at random strings doesn't really get us anywhere ( but I do make a very very rough calculation at the end that says probably all 7 byte strings have been sent ).

It's going to be far easier in my opinion to find an "almost-language" string ( partially degraded, like the above examples ) that's never been sent.

Remember, Google whacks? 1 search result. One tactic was putting together uncommon words. Another was misspellings.

Basically the intuition / intuitive idea I'm trying to convey is : pick any random high entropy string of given short length, and pick any language string of given short length, and they are both, in my opinion, more likely to have been sent than an "almost language" string of same length. The more degraded you make it ( up to a point, heh ) the less likely it was ever sent.

Very rough calculation about random strings

So, assuming the question is for what X is p > 0.5, and assume that 1 zettabyte has been sent through the net through it's entire history, so 10^18*8 bits, or roughly 2^(63.8), so roughly every 58 bit string has been sent.

So roughly every 7 byte string ever possible has been sent on the internet. Probably.

gtrubetskoy · on April 19, 2018

It'd be interesting to know the length of a sequence that will never be sent over the Internet. In B. Schneier's "Applied Cryptography" there is a discussion of 256-bit numbers, the minimal amount of energy and mass required to guess them and conclusion that "brute-force attacks against 256-bit keys will be infeasible until computers are built from something other than matter and occupy something other than space."

The length of a never-transmitted-on-the-internet sequence is probably much shorter than 256 bits, even 128 bits (as the article mentions).

More details here:

https://security.stackexchange.com/a/25392

candiodari · on April 19, 2018

>>> len(''.join(map(lambda x: bin(ord(x))[2:], "I'm wrong")))

61

schoen · on April 19, 2018

Hee hee!

But your joke is probably more precise if you write it as

8 * len("I'm wrong")

since the leading 0 bits still get sent on printable ASCII characters.

james_a_craig · on April 19, 2018

Come on, you need to encode it if you're counting bytes not characters, even if that's equivalent for this particular string. :)

8*len("I'm wrong".encode('utf-8'))

_dcwr · on April 19, 2018

It's a bit ironic in a special paradoxical math like way (like [0]) that if someone discovers it and posts it online it'll instantly (ca. HTTPS vs. HTTP, other encryption, whether or not someone sees the page, etc.) be wrong by definition. It's like figuring what do you call an UFO that has been identified.

[0] - https://en.wikipedia.org/wiki/Russell%27s_paradox

coroxout · on April 19, 2018

I am suddenly nostalgic for the Googlewhack fad: https://en.wikipedia.org/wiki/Googlewhack

"A Googlewhack is a contest for finding a Google search query consisting of exactly two words without quotation marks that returns exactly one hit. A Googlewhack must consist of two actual words found in a dictionary.

Published googlewhacks are short-lived, since when published to a web site, the new number of hits will become at least two, one to the original hit found, and one to the publishing site."

mywittyname · on April 19, 2018

Does anyone remember a comedian that did a comedy special many, many years ago centered around Googlewhacks? I remember he got a tattoo of a Texas ID on his arm.

newman8r · on April 19, 2018

Interesting to think about. Would probably make a good interview question just to see how people try to solve it.

PaulRobinson · on April 19, 2018

If the method stands up and considered fine, that means the shortest string is increasing at a rate of ~1 bit every 3 years.

That implies that if the Internet does not grow any more, we should be confident of a > 50% chance of a UUID not being universally unique in about 150 years.

Is this the next Y2K/2038 problem? :)

C4stor · on April 19, 2018

If the traffic doesn't go up significanly, the shortest string increase will drop exponentially. Even with only the last 5 years, you can observe it's slowing down, and that's with an hefty amount of traffic increase to back it up. 128 bits is super safe for a long long time !

Dylan16807 · on April 20, 2018

Only if you expect the entirety of traffic on the internet to be delivering UUIDs. It doesn't matter when non-UUID data matches a UUID.

replicatfied · on April 19, 2018

  This is how my intuition went: it's probably less than 128 
  bits because UUIDs are 128 bits, and they're universally 
  unique.

But what's in a name? There's no natural law constraining the UUID standard, such that they must be actually universally unique. And 128 bits isn't such an incredible bit space.

MD5 hashes are 128 bits, and prone to manipulating in favor of collisions.

Don't get me wrong, 3.402823669209385*10^38 is a huge number, and we haven't used enough passwords to occupy every value in that key space, but I still don't imagine 128 bits provides truly universally unique coverage, but really just pretty okay uniqueness coverage.

_dcwr · on April 19, 2018

MD5 is prone to manipulation due to its logic, not the size of the output. It also has many very efficient collision attacks but no feasible preimage attack (AFAIK) and they shave just few bits off 2^128 (which is still a lot): https://crypto.stackexchange.com/a/41865

readams · on April 19, 2018

Md5 collisions are not the result of 2^128 being too small. 2^128 is hilariously big.

psyc · on April 19, 2018

At one time, UUIDs were made by combining a fine timestamp with MAC-style hardware ids, giving a value that was conceptually unique in time and space. IIRC, it was eventually realized that choosing a value at random (e.g. using true RNG) was just as good.

2^128 is about 300 undecillion. Roughly 800 billion values per nanosecond for the age of the universe.

pacaro · on April 19, 2018

Version 4 UUIDs are preferred not because random is as good, indeed for some usecases it is worse, but typically because leaking information about time of creation and MAC address of the creator is undesirable

dannyw · on April 19, 2018

Why not just hash it then?

dfox · on April 19, 2018

Because the probability of random collision for hash function is same as probability of generating two colliding random bit strings of same length as the hash.

X-Istence · on April 19, 2018

I think you should take a gander at this: http://nowiknow.com/shuffled/

sjcsjc · on April 19, 2018

Or maybe this https://czep.net/weblog/52cards.html

dfox · on April 19, 2018

One thing that this shows is that many accepted cryptographicaly secure RNGs are in fact insufficient for shuffling decks of cards. I'm not sure how one would go about exploiting the resulting bias, but it is probably somehow exploitable.

bmm6o · on April 19, 2018

"This matched my intuition nicely!"

His intuition was that it would be between 48 and 128, and he's patting himself on the back that his calculation resulted in a number between those? Those goalposts are super far apart!

mywittyname · on April 19, 2018

That's pretty close for what was basically an off-the-cuff estimation. He chose those two numbers because their use in the real world gives him a reasonable point of reference.

The calculated value of 74 bits is pretty close to the mid-point of his estimation.

croddin · on April 19, 2018

How about this question: If all matter in the universe was dedicated to producing new bit sequences until the heat death of the universe, what is the length of the shortest sequence that would never be generated?

threepipeproblm · on April 19, 2018

Reminds me of the https://en.wikipedia.org/wiki/Berry_paradox

grkvlt · on April 20, 2018

Yeah, I sort of assumed that the Berry paradox was going to be the point too - Anyone who found the shortest bit sequence could never display it on a web page since that would cause its transmission over the Internet, making it ineligible...

Dylan16807 · on April 20, 2018

Displaying a series of bits on a web page doesn't cause that series of bits to be transmitted.

grkvlt · on April 23, 2018

Yeah, I realise that, but what if there was a link to a binary with those bits, etc.

dzdt · on April 19, 2018

I'm a little surprised it is as high as it is. I am used to thinking "64 bits is enough for anyone."

samfriedman · on April 19, 2018

Enough for anyone, but not necessarily enough for everyone.

shujito · on April 19, 2018

hmm... I was expecting the answer to be zero.