> We're looking for a specific sequence, though, not a specific number of heads in a row. We don't even know what the sequence is since it hasn't been sent yet. Is that a problem? Not at all! We're looking for some sequence of length n, and given that both 0 and 1 are equally likely, the sequence 00110 is equally likely as 11111.
Interestingly enough, this isn't true!
First, let's test on a small example: how likely are the substrings "11" and "10" to appear in binary strings of length 3? Here's a table with the matches marked.
"10" can appear in four ways, but "11" can only appear in three. Why is this?
Say you're scanning through a bit string, looking for "11111". You've seen "111" so far -- that's three matches. Now you encounter a "0". Your counter resets to zero matches until you see a "1" again.
Now say you're looking for "00110". You've seen "001" so far. Just like before, you encounter a "0". You still need to reset your counter, but this time the "0" may actually be the start of a new "00110". So your counter resets to one, not zero. This means "00110" matches are easier to find, and happen more frequently!
> This means "00110" matches are easier to find, and happen more frequently!
They do not happen more frequently in a random bit stream. You are absolutely and completely wrong about this.
They only happen more frequently if you switch the problem from counting occurances to counting bitstreams that contain these occurrences.
The reason this is is because the sequence 111 contains two substrings of 11. Thus if this happens in a bitstream and you are counting bitstreams you only get a count of 1. Where as with the counting frequency you would still get two.
This will occur any time a sequence can be overlapped with itself.
You're right, I didn't word that very clearly. I meant "more strings contain this substring", not "this substring occurs more times total" (which, as you say, would be the usual meaning of "matches happen more frequently").
Why care about strings containing a particular substring versus total number of occurrences? The article referenced this PDF: https://www.cs.cornell.edu/~ginsparg/physics/INFO295/mh.pdf, which is about how many coin flips it takes to see a run of N heads. That's really what the argument in the second half of my post is about, but it applies to the "how many strings contain this substring" problem too, and that one seemed simpler to draw a table for.
Seems correct. The probability of at least one 6 is one minus the probability that both aren't a 6, which is 5/6 * 5/6 = 25/36; so the result is (36-25)/36 = 11/36.
Why wouldn't it just be the probability of either rolling at least one 6 which would be 1/6 + 1/6 = 1/3? Genuinely curious, I was never super good at probability calculations.
Yeah doing a brute force count of the possibilities the (6,6) option is double counted by the 1/6 + 1/6 method as two distinct possibilities. Or that's what it looks like is happening. Been too long since my college combinatorics class.
Which suggests the probability of the event not happening would be -1/6, which suggests the quasiprobability of the event not not happening. Obviously no ambiguity there. /s
I wrote a quick Python snippet to compute the probabilities for all sequences of length 3 over 20 flips:
[(x, sum(1 for i in range(2**20) if x in '{:020b}'.format(i)) / 2.0**20) for x in map('{:03b}'.format, range(8))]
The output shows a 79% chance for "000" and a 91% chance for "010". "000" and "111" are actually the least likely substrings, with "010" and "101" in second place.
Furthermore, "001" has a 97% chance of occurring. It's pretty easy to see why "001" is so much more likely than the others. Unless your string perfectly alternates 0s and 1s, you're going to get repeating pairs. What happens after you get a "00"? If there's ever a 1 again, there were at least two 0s before it.
So you're almost guaranteed a "001" after "00", unless it's near the very end of the string. You only have a 50% chance of getting "000" from these though. A 1 completely resets the search for "000", but a 0 keeps the search alive for "001".
But the question being tested is whether a given string is in the bits that have been transferred over the internet. The number of times it is seen is not what we're talking about here.
Wow. That's really nonintuitive to me. Like the Monty Hall problem: you have something that's obviously an irrelevant detail that can't affect the probability… except… somehow it does.
There are problems that I find even more counterintuitive than Monty Hall, such as the sister's paradox.
Your neighbours has two children, one of which is a girl.
What is the probability of the other child to be a girl?
A. One may say 1/2, but actually, using basic Bayes theorem: P(2 girls | at least 1 girl) = P(2 girls AND 1 at least girl) / P(at least 1 girl) = (1/4) / (3/4) = 1/3.
I think this problem is actually equivalent to the Monty Hall problem. But it can get much weirder.
Your neighbour has 2 children, one is a girl.
You know the other child is born on a Sunday,
what is the probability that this other child is a girl?
A. Similarly, counting all possible cases, one does not get 1/2 or 1/3, but rather 13/27.
Your neighbours has two children, one of which is a girl.
What is the probability of the other child to be a girl?
This is oft-mentioned but with this wording it's ambiguous. Here are three possible interpretations:
1. Your neighbor has two children. We observed both and found that there was exactly one girl. What is the probability that both are girls? (Answer: 0)
2. Your neighbor has two children. We observed one and found that she was a girl. What is the probability that the other (unobserved) child is a girl? (Answer: 1/2)
3. Your neighbor has two children. We observed both and found that there was at least one girl. What is the probability that both are girls? (Answer: 1/3)
Interpretation #1 is kind of silly but it serves to illustrate the ambiguity of the plain English wording. Interpretation #2 is how most people interpret the problem, and it's bolstered by the use of the word "other" in your prompt. For there to be an "other" child, there must have been a specific girl observed originally, rather than a general observation that there was at least one girl. Interpretation #3 is how you intended the problem.
You are horribly butchering Bayes' Theorem. Both your examples are incorrect because your two events ('2 girls' and 'at least 1 girl') are not oberved independently of one another, which is a necessary condition to using Bayes' Theorem with a valid result. Great tool but wrong job.
What? Bayes' theorem does not require anything about the events, even more so it is useful when the events are not independent. If events are independent, Bayes' theorem has no usefulness. https://en.wikipedia.org/wiki/Bayes%27_theorem#Statement_of_...
Edit: however now that I look into it, I should rather have used "conditional probabilities" name rather than Bayes'. I always mix both of them, my bad.
I love how Wikipedia helpfully informs us that 'Biologically, according to studies of the human sex ratio, it is more likely for any one person to be born into the male sex than the female sex.' as though it is somehow relevant or has anything to do with the actual paradox being discussed...
NOTE I've since edited the Wiki page, and deleted the sentence - since it isn't relevant and there was no citation for the claim!
Actually, it's ever so slightly higher than 1/2 because there is a non-zero chance of them being identical twins. If we assume an identical twin incidence rate of 0.4%, then the actual probability would be 63/125 that the second child is also a girl.
In the first case, I guess what's happening is that when we pick two people randomly with uniform distribution, we expect, with equal probability, any combination of boys and girls: MM MF FF FM. When we observe a girl, we can exclude the case of two boys, so the remaining cases to consider are MF FF FM, where obviously only 1/3 of the cases is two girls.
So I guess what makes this counterintuitive is that we think of the example as independent variables actually affecting each other, but what happens is really that we're drawing from a collection without replacement, and obviously the mix of the collection is going to change as we do so.
This should apply also on a humankind, world-wide scale: if you have met three women in a row, the fourth person you meet is more likely to be a man. Gambler's fallacy! Or just drawing without replacement. A roulette table is drawing with replacement. :)
Which is probably the key intuition here - the expected number of matches is exactly the same, but because you can have more matches for the purely-repeated sequence in the same string the total number of strings containing those matches is smaller.
Sure, but the question is "how likely" not "how many times". Counting the expected number of times is a much easier problem because you can use the linearity of expectation.
(scroll about 1/3rd way down to the "(U) The OpenSesame Attack" section for the bits relevant to this discussion...)
You don't need to enumerate every n-bit sequence, you just need to enumerate a (shorter - by 62% in Samy's 12 bit case) sequence that contains all the n-bit sequences.
Does this still apply when your search space is on the order of thousands of petabytes, and the smallest unseen sequences are, according to the linked post, around 75 bits?
However, that doesn't cover 70% of bytes. For example, software updates are often downloaded over HTTP (hopefully with digital signatures!). Debian and Debian-derived distributions distribute most of their package updates over HTTP, authenticated with PGP signatures.
Most of those packages are nonetheless compressed, which increases the variety in bit sequences, but then most of the downloads are of identical compressed files, which decreases the variety.
On the other hand, video streaming is often encrypted now, but still sometimes not encrypted. But even when it's not encrypted for confidentiality or authenticity, it's often encrypted in order to impose DRM restrictions. In any case, it's usually also heavily compressed, which again increases the variety of bit sequences transmitted even for unencrypted video.
To give a rough guess, I think the combination of encryption and compression means that the author is roughly correct with regard to information being transmitted today. Even when different people watch the same video on YouTube, YouTube is separately encrypting each stream, and nonrandom packet headers outside of the TLS sessions (and other mostly nonrandom associated traffic like DNS queries and replies) represent a very small fraction of this traffic.
It might be interesting to try to repeat the calculation if we made different assumptions so that compression was in wide use (hence the underlying messages are nearly random) but encryption wasn't (hence very large numbers of messages are byte-for-byte identical to each other). Can anybody give a rough approach to estimating the impact of that assumption?
Even if every single bit of traffic was encrypted, a huge portion of that would still be identical - TCP headers, for one thing, would share a lot of common bits for each packet. Although bandwidth reporting often ignores those, so I am not sure about the data source for world bandwidth.
Wouldn’t some combination of unique session keys, PFS algorithms, or block encryption modes make this false? When would the headers encrypt to the same ciphertext unless you were using ECB mode with the same key?
I think you misinterpreted cortesoft's point, which I could try to make clearer:
"Even if every single bit of payload traffic was encrypted, a huge portion of the traffic actually sent over the wire would still be identical - TCP headers, for one thing, would share a lot of common bits for each packet."
If you’re looking at TCP headers, you might as well go all the way to the PHY layer. Your TCP headers will get encoded by an error correcting code (good, they’ll look random again) but then the data will be split into frames, and each frame begins with an identical preamble sequence of bits.
The code itself is deterministic but usually bits are are interleaved and scrambled with a long time varying pseudorandom sequence. If you pass a short repetive sequence in, it will look random coming out.
That's interesting. I know a bit about ECC in general, but not too much about what the current state of implementations is. Do you know what the codes in use are called?
The scrambler's actually a separate block, usually immediately before or shortly after the encoder. It's common in wireless modems, which I'm more familiar with, but it seems like it's rare in wireline PHYs. Ethernet apparently uses an 8b10b encoding to maintain a similar number of 0s and 1s.
I can't speak on entropy, but turbo codes use an internal interleaver and the codewords can be quite long, so a short repetitive input sequence turns into a very random looking output sequence. As you mentioned though, two identical input sequences would still map to identical output sequences.
I like this general approach, but presumably you don't want the point where 50% of the sequences have never been transmitted (on average, assuming randomly distributed data), you want the first point where at least 1 hasn't been transmitted, so you should be solving for the point where the probability of a sequence of that length not appearing times the number of sequences of that length is equal to 1.
That said it's a logarithmic scale so the length is probably not that much shorter than the one he came up with.
the length is probably not that much shorter than the one he came up with
What's your intuition for what "not that much shorter" means here? I'm not seeing the math clearly, but it seems like solving for 1/2 vs solving for 1/(2^many) would make a big difference when many is large double digits. I'm not sure how big "big" is, but I'd guess it would be large enough to dwarf the chosen two decimal places and 2-bit difference over a 5 year projection.
Another factor that would move things in the same direction (of a shorter string being the correct answer) is that if we are presuming random bits, I think we can multiply the total number bits transmitted by the length of the bit string. This would be to account for the fact that the match doesn't have to start on any particular boundary, rather the target can be "swept through" the entire corpus bit-by-bit.
Note: I'm going to use bit strings rather than bit sequences.
Given a total of B = 3.4067 x 10^22 bits sent over the internet, I don't think there is any reasonable way to say for sure what the length of the smallest string that has never been sent is, but we can say that there is definitely a 75 bit string that has never been sent.
Take all of the internet transmissions, and concatenate them, giving a string S of B bits.
Every string that has been transmitted is a substring of S.
There are at most B substrings of length n in S, and hence at most B distinct strings of length n that have been transmitted.
If we pick n such that 2^n > B, there must be at least one string of length n that has not been transmitted. 2^n > B whenever n >= 75.
Hence, if the value of B is correct, then there must be a 75 bit string that has never been sent.
If you concatenate to get S you may introduce substrings that have never been themselves transmitted, though of course I agree that every sequence that has been transmitted is a substring of S.
Neat!
This analysis is consistent (and more accurate) than the OP; he assumed that the bitstring is random; therefore he arrives to the lower bound. If, somehow we account for the bias in the sequence, we may find a larger n.
Relatedly, there must be a (much longer) shortest sequence of bits that it is impossible to send over the internet, because beyond a certain length the payload must be split over multiple packets, requiring additional header bits in between.
In theory, the largest IPv4 packet that can be sent is 65,535 bytes. IPv6 is the same, but also allows for "jumbograms" that can be up to 4GB long, yeesh. However, the practical limit is the maximum frame size of the underlying link. (Standard) Ethernet and PPP both top out at 1500, ATM can handle 4456. Non-standard Ethernet jumbo frames can go up as high as 9216 bytes.
But assuming you consider "send over the internet" to imply that most destinations on the open internet would be able to receive it, that means a frame size of 1500 bytes, giving you 12000 bits to play with. This leads to the curious result that it's impossible to send a sequence of more than 11870 "1"s.
The shortest header you could have is a 160-bit IPv4 header, of which the last 32 bits (the destination address) can't all be 1 because 255.255.255.255 isn't routable, and actually 127.255.255.255 isn't either, so the earliest place you can start is bit 130. After 11870 "1"s, you reach the 4-bit version field of the next packet, which can't start with a 1 because it would indicate IPv8 or higher, and I don't think that exists... or at least it hasn't seen widespread adoption.
> At what value for X is there a 50% chance there [exists] a sequence of X-bits in length that hasn't been transmitted yet?
But if I understand correctly, the question that they answered is "At what value for X is there a 50% chance that, given some specific sequence of X bits, that sequence hasn't been transmitted yet?" Wouldn't these two questions have different answers?
Edit (now that I've thought about it more): On top of that, the expected value of a random variable isn't necessarily a value that it has a 50% chance to be. So even my interpretation of this result is wrong.
2^n is half of 2^(n+1), so you'd expect that getting 50% coverage of (n+1)-bit sequences would mean that you have 100% coverage of n-bit sequences. In other words, the answer to your question is at most 1 bit lower than the answer given.
But it is slightly more complicated, because this is an example of the Coupon Collector's Problem [1]. It takes about N ln N random samples to cover a set of size N. If there have been 2^74 samples chosen, then we haven't quite gotten enough to cover N = 2^68.
>Wouldn't these two questions have different answers?
Yes, though in practice they're close.
To answer the actual question asked. We can say that there is a length n, such that all sequences S, of length n, cannot be contained in our total bits transferred, B. This is a consequence of the pigeonhole principle.
Naively, this would be B = n * S, and S = 2^n. So B = n * 2^n. Note that we can actually compress things more though, if we assume that in the worst case every window in the internet bit-corpus is unique.
That this is possible isn't immediately obvious, but consider
01, 00110, 0001011100, 0000100110101111000, 000001000110010100111010110111110000 (which, interestingly, isn't in OEIS, but related sequences: https://oeis.org/A052944 and https://oeis.org/A242323 are).
These bitstrings contain, perfectly overlapping, every bitstring of length n, for n from 1-5. In general, it takes 2^n + n - 1 bits to convey every possible bitstring, if you manage to overlap them perfectly (if someone can prove this, please do. I thought it was grey code/hamiltonian path over hypercube related, but I unconvinced myself of that).
EDIT: Someone else mentioned De Brujin sequences, which these are. And they are based on hamiltonian paths, although not over hypercubes :(. And my sequence is on OEIS, as https://oeis.org/A166315, just converted to decimal. /EDIT
So the real answer is just B = 2^n + n - 1, but for n > 15 or so, n - 1 is so small we can ignore it. In other words, our length n is just log(B). Given the assumption in the article of 3.4067e22 bits, the base 2 log of that is...74.85. This is exactly 1 more than what the article says is the point where you'll have seen half the messages. This isn't a coincidence.
Which raises an interesting question that I leave as an exercise to the reader:
Why is the length for which you cannot have conveyed all bit strings exactly 1 bit more than the point where you've conveyed about half of them?
You’re giving the range for a 50% to 100% chance of non-transmission.
But in the spirit of the original question, I’d suppose we want to find n such that the probability of non-transmission is 2^-n, and the expected number of such transmissions is 1? Is that number similarly close?
Some estimates put the internet traffic at 2^70 bits per year and growing. For any short sequence of N bits, there are approximately that 2^70 such sequences transmitted per year. Let's assume that, since they are encrypted, they are uniformly distributed. So what is the sequence length for which our p(seen all) = 50%? Seems like 71-bits is about right.
There traffic growth curve is probably ~2x per year, so each year we add 1 more bit each year. The sum of all internet traffic for all time, at present, is therefore about 1 more bit.
For any given "random" ( "high entropy" ) string of length X, there's some non-negligible chance it's already been sent.
But it's far less likely that a ( partially degraded ) non random string is sent. Why ?
Consider this:
"the cat sat on the hat" ( probably sent )
"the cut sat on the hat" ( still probably sent )
"thx cut set19n the mkt" ( waaaay less likely to be sent )
"thKxc8t suts n x4e m-t" ( probably never sent ... until now :) )
My reasoning is like, all random strings are ( happy / random ) in the same way. They all look alike.
High entropy, but low organization / structure. Because of compression and encryption, any random
string probably has as good a chance to be sent as any other, so looking at random strings doesn't really get us anywhere ( but I do make a very very rough calculation at the end that says probably all 7 byte strings have been sent ).
It's going to be far easier in my opinion to find an "almost-language" string ( partially degraded, like the above examples ) that's never been sent.
Remember, Google whacks? 1 search result. One tactic was putting together uncommon words. Another was misspellings.
Basically the intuition / intuitive idea I'm trying to convey is : pick any random high entropy string of given short length, and pick any language string of given short length, and they are both, in my opinion, more likely to have been sent than an "almost language" string of same length. The more degraded you make it ( up to a point, heh ) the less likely it was ever sent.
Very rough calculation about random strings
So, assuming the question is for what X is p > 0.5, and assume that 1 zettabyte has been sent through the net through it's entire history, so 10^18*8 bits, or roughly 2^(63.8), so roughly every 58 bit string has been sent.
So roughly every 7 byte string ever possible has been sent on the internet. Probably.
It'd be interesting to know the length of a sequence that will never be sent over the Internet. In B. Schneier's "Applied Cryptography" there is a discussion of 256-bit numbers, the minimal amount of energy and mass required to guess them and conclusion that "brute-force attacks against 256-bit keys will be infeasible until computers are built from something other than matter and occupy something other than space."
The length of a never-transmitted-on-the-internet sequence is probably much shorter than 256 bits, even 128 bits (as the article mentions).
It's a bit ironic in a special paradoxical math like way (like [0]) that if someone discovers it and posts it online it'll instantly (ca. HTTPS vs. HTTP, other encryption, whether or not someone sees the page, etc.) be wrong by definition. It's like figuring what do you call an UFO that has been identified.
"A Googlewhack is a contest for finding a Google search query consisting of exactly two words without quotation marks that returns exactly one hit. A Googlewhack must consist of two actual words found in a dictionary.
Published googlewhacks are short-lived, since when published to a web site, the new number of hits will become at least two, one to the original hit found, and one to the publishing site."
Does anyone remember a comedian that did a comedy special many, many years ago centered around Googlewhacks? I remember he got a tattoo of a Texas ID on his arm.
If the method stands up and considered fine, that means the shortest string is increasing at a rate of ~1 bit every 3 years.
That implies that if the Internet does not grow any more, we should be confident of a > 50% chance of a UUID not being universally unique in about 150 years.
If the traffic doesn't go up significanly, the shortest string increase will drop exponentially. Even with only the last 5 years, you can observe it's slowing down, and that's with an hefty amount of traffic increase to back it up. 128 bits is super safe for a long long time !
This is how my intuition went: it's probably less than 128
bits because UUIDs are 128 bits, and they're universally
unique.
But what's in a name? There's no natural law constraining the UUID standard, such that they must be actually universally unique. And 128 bits isn't such an incredible bit space.
MD5 hashes are 128 bits, and prone to manipulating in favor of collisions.
Don't get me wrong, 3.402823669209385*10^38 is a huge number, and we haven't used enough passwords to occupy every value in that key space, but I still don't imagine 128 bits provides truly universally unique coverage, but really just pretty okay uniqueness coverage.
MD5 is prone to manipulation due to its logic, not the size of the output. It also has many very efficient collision attacks but no feasible preimage attack (AFAIK) and they shave just few bits off 2^128 (which is still a lot): https://crypto.stackexchange.com/a/41865
At one time, UUIDs were made by combining a fine timestamp with MAC-style hardware ids, giving a value that was conceptually unique in time and space. IIRC, it was eventually realized that choosing a value at random (e.g. using true RNG) was just as good.
2^128 is about 300 undecillion. Roughly 800 billion values per nanosecond for the age of the universe.
Version 4 UUIDs are preferred not because random is as good, indeed for some usecases it is worse, but typically because leaking information about time of creation and MAC address of the creator is undesirable
Because the probability of random collision for hash function is same as probability of generating two colliding random bit strings of same length as the hash.
One thing that this shows is that many accepted cryptographicaly secure RNGs are in fact insufficient for shuffling decks of cards. I'm not sure how one would go about exploiting the resulting bias, but it is probably somehow exploitable.
His intuition was that it would be between 48 and 128, and he's patting himself on the back that his calculation resulted in a number between those? Those goalposts are super far apart!
That's pretty close for what was basically an off-the-cuff estimation. He chose those two numbers because their use in the real world gives him a reasonable point of reference.
The calculated value of 74 bits is pretty close to the mid-point of his estimation.
How about this question: If all matter in the universe was dedicated to producing new bit sequences until the heat death of the universe, what is the length of the shortest sequence that would never be generated?
Yeah, I sort of assumed that the Berry paradox was going to be the point too - Anyone who found the shortest bit sequence could never display it on a web page since that would cause its transmission over the Internet, making it ineligible...
Interestingly enough, this isn't true!
First, let's test on a small example: how likely are the substrings "11" and "10" to appear in binary strings of length 3? Here's a table with the matches marked.
"10" can appear in four ways, but "11" can only appear in three. Why is this?Say you're scanning through a bit string, looking for "11111". You've seen "111" so far -- that's three matches. Now you encounter a "0". Your counter resets to zero matches until you see a "1" again.
Now say you're looking for "00110". You've seen "001" so far. Just like before, you encounter a "0". You still need to reset your counter, but this time the "0" may actually be the start of a new "00110". So your counter resets to one, not zero. This means "00110" matches are easier to find, and happen more frequently!