on the principle that high entropy means it compresses badly. However, that uses each line as the dictionary, rather than the entire file, so it has a little trouble with very short lines which compress badly.
which is valid code but indeed seems kind of high in entropy. I was also able to fool it to not detect a high-entropy line by adding a comment of natural English to it.
I'm on the go but it would be interesting to see comparisons between the Perl command and this tool. The benefit of the Perl command is that it would run out of the box on any non-Windows machine so it might not need to be as powerful to gain adoption.
I learned Go many years ago doing some advent of code problems. As I solved each problem, my housemate pestered me for a look and then rewrote my solutions (each needing 10-50 lines of go) into Ruby one-liners. All the while making fun of Go and my silly programs. I wasn’t intending to, but I ended up learning a lot Ruby that night too.
I guess you could take all lines in the file except the one you're testing and measure the filesize, then add the line and measure again. The delta should then be more fair. You could even do this by concatenating all code files and then testing line by line across the entire repo, but that would probably be too slow.
> best compression and therefore best entropy estimate
That's a good point. But the Hutter Prize is for compressing a 1 GB file. On inputs as short as a line of code, gzip doesn't do so badly. For a longer line:
Are there any command-line tools for zip or similar that allow you to predefine a dictionary over one or more files, and then use that dictionary to compress small files?
Which would require the dictionary as a separate input when decompressing, of course?
gzip (or really DEFLATE) does actually come with a small predefined dictionary (the "fixed Huffman codes" in the RFC) which is somewhat optimised for latin letters in UTF-8, but I have not verified that this is indeed what ends up being used when compressing individual lines of source code.
You can use LLMs as compressors, and I wonder how it would go with that.
The approach is simple: Turn the file into a stream of tokens. For each token, ask a language model to generate the full set of predictions based on context, and sort based on likelihood. Look where the actual token appears in the sorted list. Low entropy symbols will be near the start of the list, and high entropy tokens near the end.
I suspect most language models would deal with your alphabet example just fine, while still correctly spotting passwords and API keys. It would be a fun experiment to try!
I mean, it certainly has a low Kolmogorov complexity (which is what I would really want to be measuring somehow for this tool... note that I am not claiming that is possible: just an ideal); I am unsure whether how that affects the related bounds on Shannon entropy, though.
A quick Google search suggests English has about 10 bits of entropy per word. Having a long password like that can still have high total entropy I suppose, but it has a low entropy density.
Maybe 10 bits is the average over the dictionary – which is what matters here, but over normal text it is significantly less. Our best current estimation for relatively high-level text (texts published by the EU) is 6 bits per word[1].
However, as our methods of predicting text improve, this number is revised down. LLMs ought to have made a serious dent in it, but I haven't looked up any newer results.
Anyway, all of this to say is that which words are chosen matters, but how they are put together matters perhaps more.
The diceware method is supposed to generate totally random words, so it should fundamentally be unpredictable unless there's a flaw in the source of randomness.
Is there any good posts about the use of entropy for tasks like that? I am wondering for quite some time of how do people actually use it and if it is any effective, but never actually got to investigating the problem myself.
First of all, how to define "entropy" for text is a bit unclear in the first place. Here it's as simple as `-Sum(x log(x))` where x = countOccurences(char) / len(text). And that raises a lot of questions about how good this actually works. How long string needs to be for this to work? Is there a ≈constant entropy for natural languages? Is there a better approach? I mean, it seems there must be: "obviously" "vorpal" must have lower "entropy" than "hJ6&:a". You and I both "know" that because 1) the latter "seems" to use much larger character set than natural language; 2) even if it didn't, the ordering of characters matters, the former just "sounds" like a real word, despite being made up by Carroll. Yet this "entropy" everybody seems to use has no idea about any of it. Both will have exactly the same "entropy". So, ok, maybe this does work good enough for yet-another-github-password-searcher. But is there anything better? Is there more meaningful metric of randomness for text?
Dozens of projects like this, everybody using "entropy" as if it's something obvious, but I've never seen a proper research on the subject.
Entropy is a measure of complexity or disorder of a signal. The interesting part is that the disorder is with respect to the proper basis or dictionary. Something can look complex in one encoding but be low entropy in the right encoding. You need to know the right basis, or figure it out from the context, to accurately determine the entropy of a signal. A much stronger way of building a tool like the OP is to have a few pre-computed dictionaries for a range of typical source texts (source code, natural language), then encode the string against each dictionary, comparing the compressibility of the string. A high entropy string like a secret will compress poorly against all available dictionaries.
I only briefly browsed the code, but this seems to be roughly what yelp/detect-secrets does.
Anyway, that doesn't really answer my question. To summarize answers in this thread, I think PhilipRoman has captured the essence of it: strictly speaking, the idea of entropy of a known string is nonsense. So, as I suspected, information theory definition isn't meaningfully applicable to the problem. And as other commenters like you mentioned, what we are really trying to measure is basically Kolmogorov complexity, which, strictly speaking, is incomputable, but measuring the compression rate for some well-known popular compression algorithm (allegedly) seems to be good enough estimate, empirically.
But I think it's still an interesting linguistic question. Meaningful or not, but it's well defined: so does it appear to work? Are there known constants for different kinds of text for any of these (or other) metrics? I would suspect this should have been explored already, but neither me, nor anybody in this thread apparently has ever stumbled upon such article.
bookmarking to think about later... does this hold for representing numbers as one base compared to another?
Regarding a prime as having higher entropy / less structure than say a perfect square or highly divisible number
a prime is a prime in any base, but the number of divisors will differ in non-primes, if the number is divisible by the base then it may appear to have more structure (smaller function necessary to derive, kolmogorov style),
does prime factorization have anything to do with this? i can almost imagine choosing a large non-prime whose divisibity is only obvious with a particular base such that the base becomes the secret key - base of a number is basically specifying your dictionary, no?
I don't think there's any interesting difference with different bases. Usually the base you represent stuff in is a relatively small number (because using very large bases is already wildly inefficient). I think it only usually makes sense to consider constant or logarithmic bases. If your base is scaling linearly with your number then things are going to be weird.
The problem of finding factors is only complex when you're asking about relatively big factors. If you're looking for constant or log sized factors you can just do trial division and find them.
Changing the base of number representation with a random basis feels like XORing a string with a random string, which is to say you're adding entropy equal to the random string. My thinking is that for any number representation M, you can get any other number representation N given a well-chosen base. So when presented with the encoded N, the original number could be any other number with the same number of digits. But once you put reasonable bounds on the base, you lose that flexibility and end up adding negligible entropy.
Entropy of a particular string isn't a rigorous mathematical idea, since by definition the string which is known can only take one value, the "entropy" is therefore zero bits. The reason why we can distinguish non-random data from random is that only a small subset of all possible states are considered useful for humans, and since we have an idea what that subset looks like, we can try to estimate what process was used to generate a particular string.
There are of course statistical tests like https://en.wikipedia.org/wiki/Diehard_tests, which are good enough for distinguishing low entropy and high entropy data, but current pseudo-random number generators have no problem passing all of those, even though their actual "entropy" is just the seed plus approximate the complexity of the algorithm.
If you’re looking for a rigorous mathematical idea, what people are trying to measure is the Kolmogorov complexity of the code. Measuring the compressed length is a rough estimate of that value.
Yes, although (and here my understanding of Kolmogorov complexity ends) it still depends heavily on the choice of language and it seems to me like "aaaaaaaaa" is only less complex than "pSE+4z*K58" due to assuming a sane, human-centric language which is very different from the "average" of all possible languages. Which then leads me to wonder how to construct an adversarial turing-complete language which has unintuitive Kolmogorov complexities.
There’s no requirement that the K-complexity is measured in a human centric language. Arguably all compression formats are languages too, which can be executed to produce the decompressed result. They are not designed to be human centric at all, and yet they do a surprisingly decent job at providing an estimate (well, upper bound) on Kolmogorov complexity. - As we can see in this program.
Kolmogorov complexity conventionally refers to the Turing machine as the base for implementation. This indeed makes repeated letters significantly less complex than that other string. (If you want intuition for how much code is needed to do something on a Turing machine, learn and play around a bit with Brainfuck. It's actually quite nice for that.)
I think these solutions are all much better for finding secrets than something naive based on entropy. Yes, entropy is more general but these are well established tools that have been through the fire of many, many data sets.
It would be useful if it also trawled through the full git history of the project - a secret could have been checked in and later removed, but still exist in the history.
Because it's a security tool so trusting a binary upfront defeats the purpose. With source you at least have the option to inspect what it really does.
Certain tools are more likely to be used by people working in spaces where they should/must be less trusting.
If there was a tool (there is) to scan my platform deployment against some NCSC/NSA guidance for platform security, and I wanted to use it, I'm likely operating in a space that should consider being cautious about running random tools I find on the internet.
This is argumentum ad absurdum - there is a reason why trusting your kernel and compiler is a reasonable compromise, even though there might be security issues in them, but random pieces of software downloaded from the Internet is not.
Wait ... you download random compilers from the internet? Or are you asserting equivalence between getting go from Google or Xcode from Apple and an random home brew install?
also if you're not trying to improve the security of your product by running random binaries from the internet. I'm concerned at the inability to separate the concepts of "what it does" and "what it says it does".
The idea that whether or not it needs scrutiny is impacted by your goals with the software is... creative
Uh? OP just released a docker image and wants to release a homebrew thingy. Even assuming that was you say is somehow sensible, it's not the reason, no. You're just grasping at straws.
The docker image is nothing more than “FROM scratch” and then copying in the statically linked binary. If there are 0 other dependencies, I think it would be equally easy to distribute the binary through GitHub releases. There is no need for brew.
If people want to run it isolated, the docker image is of course still a nice to have and docker hub is a convenient distribution mechanism. At the same time, an equivalent image can equally easily be created by the consumer.
Personally, due to trust, I would anyway still build it from source and run in a container, to be on the safe side.
I guess a language model like Llama 3 could model surprise on a token-by-token basis and detect the areas that are most surprising, i.e. highest entropy. Because as one example mentioned, the entire alphabet may have high entropy in some regards, but it should be very unsurprising to a code-aware language model that in a codebase you have the Base62 alphabet as a constant.
Another way to do this would be to compress the file and compare the compressed size to the uncompressed size.
Encrypted files do not compress well compared to code, I saw a phd thesis that postulated an inverse ratio of compression efficiency to performance data mining, this would be the opposite
Entropy of information is basically how well it can be compressed. Random noise usually doesn't compress much at all and thus has high entropy, whereas written natural language can usually be compressed quite a bit. Since many passwords and tokens will be randomly generated or at least nonsense, looking for high entropy might pick up on them.
This package seems to be measuring entropy by counting the occurrences of each character in each line, and ranking lines with a high proportion of repeated characters as having low entropy. I don't know how closely this corresponds with the precise definition.
Source: https://github.com/EwenQuim/entropy/blob/f7543efe130cfbb5f0a...
Of course, this heuristic fails for weak passwords.
And it fails for passphrases like 'correct battery horse staple', which have a large enough total entropy to be good passwords, but have a low entropy per character.
4 diceware words is hardly a good password. It's ~51 bits of entropy, about the same as 8 random ascii symbols. It could be trivially cracked in less than an hour. Your average variable name assigned to the result of an object name with a method name called with a couple parameter names has much more entropy.
If you can crack a single 52bit password in an hour, that's suggesting you can crack a 40bit password every second. That's 1 trillion hashes per second.
350B H/s was achieved in 2012 on consumer hardware. That's over 12 years ago, and several lifetimes of GPU improvements ago. 4
diceware words is simply not appropriate for anything remotely confidential, and it is bad for the community to pretend otherwise.
If you read the sources, that's 350B _sha1_ hashes per second... While you can't be sure what hash system is being used for your passwords, any respectable system using a modern password hash is not even close to being that fast. OWASP's recommended 600000 rounds of pbkdf2 performs 1.2 million sha2 block rounds IIRC. If we assume that sha1 and sha2 are equivalent in performance, then you're looking at only 290,000 password attempts a second.
If the password system uses argon2 with a high memory requirement, you're in an even better position
Certainly if we assume the system under question was deigned with heightened security in mind, we will determine ourselves to be in a more secure system.
But go on, use a 52 bit password – see what I care. But don't come crying to me when an institution with the smallest amount of funding was able to crack your vault.
So you do random capital words, random punctuation and add a number somewhere and you’re at 60. Add more for whatever threat model you’re trying to be secure against.
Not sure; you can use the same character instead of a space and still get a few bits. Of course different ones would be better, but again, depends on how many bits you actually need.
I thought the point was to construct a password that's secure enough _and_ easy to remember for humans.
Adding random punctuation helps with the former, but might interfere with the latter. (In the extreme case, you just generate completely random strings character for character. That's the most secure, but the least memorable.)
But it didn't. It perpetuated the exceedingly common myth that 52 bits is somehow enough. This has been considered bad practice for well over a decade now. https://theworld.com/~reinhold/dicewarefaq.html
Note that in an adversarial setting this will only be effective against careless opponents.
If you properly encode your secret it will have the entropy of its surroundings.
For example you can hide a string of entropy (presumably something encrypted) in text as a biased output of an LLM. To recover it you would use the same LLM and measure deviations from next-token probabilities. This will also fool humans examining it as the sentence will be coherent.
I think the opponent in the proposed use case for this tool is the gun you’re pointing at your foot, and this tool prevents you from pulling the trigger.
What you described sounds like a very cool idea - LLM-driven text steganography, basically - but intentional obfuscation is not the problem this tool is trying to solve. To your point about secrets with entropy similar to the surrounding text, however, I wonder if this can pick up BIP39 Seed Phrases or if whole word entropy fades into the background.
I imagine a social media site full of bots chatting about nonsense. Hidden in the nonsense are humans chatting about different nonsense. This way, server costs get paid for by advertisers, but its really only bots that see the ads anyway.
I'm not really concerned with whether it would work out in a way that was beneficial for the ad network or their customers. Either they figure out a way to make it work such that they continue to be a useful pipe, or they don't and then maybe they'll have to shut down and find something more productive to do with their time. We'll have done the public a service either way.
Unfortunately I beg to differ. I worked for several companies where we the management clearly saw that the results were very poor (for Facebook ads, for example) but continued to invest because there is a defined budget for it and so on. It was like this last year and 20 years ago.
these companies should be outcompeted by firms that don't blow a million dollars a month paying out to click fraudsters but alas the market is not perfectly competitive
is it a cargo cult? it works for coca cola so maybe if we just spend a little more we'll see returns...
The weights of the LLM become the private key (so it better be a pinned version of a model with open weights), and for most practical applications (i.e. unless you're willing to complicate your setup with fancy applied statistics and error correction) you'd have to use a temperature of 0 as baseline.
Then, having done all that, such steganography may be detectable using this very tool by encoding the difference between the LLM's prediction and ground truth, but searching for substrings with low entropy instead!
Use some LLM, the weights need to be know to both parties in the communication.
Producing text with the LLM means repeatedly feeding the LLM with the text-so-far to produce a probability distribution for the next token. You then use a random number generator to pick a token from that distribution.
If you want to turn this into steganography, you first take your cleartext and encrypt it with any old encryption system. The resulting bistream should be random-looking, if your encryption ain't broken. Now you take the LLM-mechanism I described above, but instead of sampling via a random number generator, you use your ciphertext as the source of entropy. (You need to use something like arithmetic coding to convert between your uniformly random-looking bitstream and the heavily weighted choices you make to sample your LLM. See https://en.wikipedia.org/wiki/Arithmetic_coding)
Almost any temperature will work, as long as it is known to both sender and receiver. (The 'temperature' parameter can be used to change the distribution, but it's still effectively a probability distribution at the end. And that's all that's required.)
I was imagining the message encoded in clear text, not encrypted form, because given the lengths required to coordinate protocol, keys, weights, and so on, I assumed there would be more efficient ways to disguise a message than a novel form of steganography. As such, I approached it as a toy problem, and considered detection by savvy parties to be a feature, not a bug; I imagined something more like a pirate broadcast than a secure line, and intentionally ignored the presumption about the message being encrypted first.
That being said, yes, some of my assumptions were incorrect, mainly regarding temperature. For practical reasons I was envisioning this being implemented with a third party LLM (i.e. OpenAI's,) but I didn't realize those could have their RNG seeded as well. There is the security/convenience tradeoff to consider, however, and simply setting the temperature to 0 is a lot easier to coordinate between sender and receiver than adding two arbitrary numbers for temperature and seed.
I misspoke, or at least left myself open to misinterpretation when I referred to the LLM's weights as a "secret key"; I didn't mean the weights themselves had to be kept under wraps, but rather I meant that either the weights had to be possessed by both parties (with the knowledge of which weights to use being the "secret") or they'd have to use a frozen version of a third party LLM, in which case the knowledge about which version to use would become the secret.
As for how I might take a first stab at this if I were to try implementing it myself, I might encode the message using a low base (let's say binary or ternary) and make the first most likely token a 0, the second a 1, and so on, and to offset the risk of producing pure nonsense I would perhaps skip tokens with too large a gulf between the probabilities for the 1st and 2nd most common tokens.
> I was imagining the message encoded in clear text, not encrypted form, [...]
I was considering that, but I came to the conclusion that it would be an exceedingly poor choice.
Steganography is there to hide that a message has been sent at all. If you make it do double duty as a poor-man's encryption, you are going to have a bad time.
> As such, I approached it as a toy problem, and considered detection by savvy parties to be a feature, not a bug; I imagined something more like a pirate broadcast than a secure line, and intentionally ignored the presumption about the message being encrypted first.
That's an interesting toy problem. In that case, I would still suggest to compress the message, to reduce redundancy.
> If you make it do double duty as a poor-man's encryption, you are going to have a bad time.
For the serious use cases you evidently have in mind, yes, it's folly to have it do double duty, but at the end of the day steganography is an obfuscation technique orthogonal to encryption, so the question of whether to use encryption or not is a nuanced one. Anyhow, I don't think it's fair to characterize this elaborate steganography tech as a poor-man's encryption — LLM tokens are expensive!
Haha, sure, you can call it that if you want, but foolish is cousin to fun, so one application of this tech would be as a comically overwrought way of communicating subtext to an adversary who may not be able to read between the lines otherwise. Imagine using all this highly sophisticated and expensive technology just to write "you're an asshole" to some armchair intelligence analyst who spent their afternoon and monthly token quota decoding your secret message.
This is very cool, but I have a thought - I see this as a last line of defense, and I am concerned that this would this give a false sense of security leading people to be more reckless with secrets.
You could make the same argument for any tool that does not provide high security. In fact security is layered, and no single tool should be relied upon to be your one security tool. You said as much yourself: "I see this as a last line of defense," but I don't see how you conclude that this would inherently cause people to be more reckless with secrets.
The pie in the sky goal for any security org is to have a cred rotation process that is so smooth that you’re able to not worry about leaked creds because it’s so fast and easy to rotate them. If the rotation is automated and if it’s cheap and frictionless to do so, heck why not just rotate them multiple times a day.
Ehhh considering how low the security bar is, I think it is better than nothing. If you inherit a code base, make it a quick initial action to see how much pain you can expect. In practice, I expect a tool like this has so many false positives you cannot keep it as an always running action. More a manual review you run occasionally.
I hope that more secrets adopt a GitHub like convention where they are prefaced with an identifier string so that you do not require heuristics to detect them.
I didn't know what entropy means in software, so here's the definition[0]:
----
Software entropy is a measure of the disorder or complexity of a software system. It is a natural tendency for software entropy to increase over time, as new features are added and the codebase becomes more complex.
High entropy in software development means that the code is difficult to understand, maintain, and extend. It is often characterized by:
Duplicated code: The same code or functionality is repeated in multiple places, which can make it difficult to find and fix bugs.
Complex logic: The code is difficult to follow and understand, which can make it difficult to add new features or fix bugs without introducing new ones.
Poor documentation: The code is not well-documented, which can make it difficult for new developers to understand and contribute to the codebase.
Technical debt: The code has been patched and modified over time without proper refactoring, which can lead to a tangled and cluttered codebase.
Low entropy in software development means that the code is well-organized, easy to understand, and maintain. It is often characterized by:
Well-designed architecture: The code is structured in a logical way, with clear separation of concerns.
Consistent coding style: The code follows a consistent coding style, which makes it easy to read and understand.
Comprehensive documentation: The code is well-documented, with clear explanations of the code's purpose and functionality.
Minimal technical debt: The code has been refactored regularly to remove technical debt, which makes it easy to add new features and fix bugs without introducing new ones.
It did react to this line
which is valid code but indeed seems kind of high in entropy. I was also able to fool it to not detect a high-entropy line by adding a comment of natural English to it.I'm on the go but it would be interesting to see comparisons between the Perl command and this tool. The benefit of the Perl command is that it would run out of the box on any non-Windows machine so it might not need to be as powerful to gain adoption.