Entropy, a CLI that scans files to find high entropy lines (might be secrets)

kqr · 2024-06-05T07:42:19 1717573339

Interesting. If I had to do this, I would have done something like

    perl -lne 'next unless $_; $z = qx(echo "$_" | gzip | wc -c); printf "%5.2f    %s\n", $z/length($_), $_'

on the principle that high entropy means it compresses badly. However, that uses each line as the dictionary, rather than the entire file, so it has a little trouble with very short lines which compress badly.

It did react to this line

    return map { $_ > 1 ? 1 : ($_ < 0 ? 0 : $_) } @vs;

which is valid code but indeed seems kind of high in entropy. I was also able to fool it to not detect a high-entropy line by adding a comment of natural English to it.

I'm on the go but it would be interesting to see comparisons between the Perl command and this tool. The benefit of the Perl command is that it would run out of the box on any non-Windows machine so it might not need to be as powerful to gain adoption.

josephg · 2024-06-05T13:21:08 1717593668

I learned Go many years ago doing some advent of code problems. As I solved each problem, my housemate pestered me for a look and then rewrote my solutions (each needing 10-50 lines of go) into Ruby one-liners. All the while making fun of Go and my silly programs. I wasn’t intending to, but I ended up learning a lot Ruby that night too.

Thankyou for continuing the tradition.

blixt · 2024-06-05T10:33:47 1717583627

I guess you could take all lines in the file except the one you're testing and measure the filesize, then add the line and measure again. The delta should then be more fair. You could even do this by concatenating all code files and then testing line by line across the entire repo, but that would probably be too slow.

GuB-42 · 2024-06-05T10:41:22 1717584082

I would use a better compressor than gzip but I have done this trick several times.

xz or zstd may be better choices, or you can look at Hutter Prize [1] winners for best compression and therefore best entropy estimate.

[1] http://prize.hutter1.net/

nequo · 2024-06-05T16:23:31 1717604611

> best compression and therefore best entropy estimate

That's a good point. But the Hutter Prize is for compressing a 1 GB file. On inputs as short as a line of code, gzip doesn't do so badly. For a longer line:

  $ INPUT='    bool isRegPair() const { return kind() == RegisterPair || kind() == LateRegisterPair || kind() == SomeLateRegisterPair; }'
  $ echo "$INPUT" | gzip | wc -c
  95
  $ echo "$INPUT" | bzip2 | wc -c
  118
  $ echo "$INPUT" | xz -F xz | wc -c
  140
  $ echo "$INPUT" | xz -F lzma | wc -c
  97
  $ echo "$INPUT" | zstd | wc -c
  92

For a shorter line:

  $ INPUT='        ASSERT(regHi().isGPR());'
  $ echo "$INPUT" | gzip | wc -c
  48
  $ echo "$INPUT" | bzip2 | wc -c
  73
  $ echo "$INPUT" | xz -F xz | wc -c
  92
  $ echo "$INPUT" | xz -F lzma | wc -c
  51
  $ echo "$INPUT" | zstd | wc -c
  46

crazygringo · 2024-06-05T16:44:19 1717605859

Are there any command-line tools for zip or similar that allow you to predefine a dictionary over one or more files, and then use that dictionary to compress small files?

Which would require the dictionary as a separate input when decompressing, of course?

Too · 2024-06-06T05:34:33 1717652073

zstd supports shared dictionaries easily.

It also has vastly superior compression and performance, compared to gzip, even without.

kqr · 2024-06-05T18:26:08 1717611968

gzip (or really DEFLATE) does actually come with a small predefined dictionary (the "fixed Huffman codes" in the RFC) which is somewhat optimised for latin letters in UTF-8, but I have not verified that this is indeed what ends up being used when compressing individual lines of source code.

eternityforest · 2024-06-07T04:07:09 1717733229

I use Yelp's secret scanner as a pre-commit hook, they have pretty easy setup be pre-commit's install mechanism

cowsaymoo · 2024-06-05T04:03:50 1717560230

I transcend this problem by making all my database passwords 'abcd'

kgeist · 2024-06-05T07:09:13 1717571353

The tool found "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890" in our codebase as a high entropy line :)

josephg · 2024-06-05T13:30:52 1717594252

You can use LLMs as compressors, and I wonder how it would go with that.

The approach is simple: Turn the file into a stream of tokens. For each token, ask a language model to generate the full set of predictions based on context, and sort based on likelihood. Look where the actual token appears in the sorted list. Low entropy symbols will be near the start of the list, and high entropy tokens near the end.

I suspect most language models would deal with your alphabet example just fine, while still correctly spotting passwords and API keys. It would be a fun experiment to try!

g15jv2dp · 2024-06-05T07:14:35 1717571675

Well, it is...

saurik · 2024-06-05T10:21:53 1717582913

I mean, it certainly has a low Kolmogorov complexity (which is what I would really want to be measuring somehow for this tool... note that I am not claiming that is possible: just an ideal); I am unsure whether how that affects the related bounds on Shannon entropy, though.

ngneer · 2024-06-05T12:55:12 1717592112

Then use it as your password ;)

jraph · 2024-06-05T12:27:19 1717590439

…a very verbose way to match alphanumeric characters :-)

nvy · 2024-06-05T16:46:10 1717605970

Username: postgres

Password: postgres

randomtoast · 2024-06-05T07:14:33 1717571673

Reminds me of https://xkcd.com/936/ I think "correct horse battery staple" has a low entropy, since it is just ordinary looking words (strings).

josephg · 2024-06-05T13:23:58 1717593838

A quick Google search suggests English has about 10 bits of entropy per word. Having a long password like that can still have high total entropy I suppose, but it has a low entropy density.

kqr · 2024-06-05T18:33:14 1717612394

Maybe 10 bits is the average over the dictionary – which is what matters here, but over normal text it is significantly less. Our best current estimation for relatively high-level text (texts published by the EU) is 6 bits per word[1].

However, as our methods of predicting text improve, this number is revised down. LLMs ought to have made a serious dent in it, but I haven't looked up any newer results.

Anyway, all of this to say is that which words are chosen matters, but how they are put together matters perhaps more.

[1]: http://arxiv.org/pdf/1606.06996

soraminazuki · 2024-06-07T14:44:02 1717771442

The diceware method is supposed to generate totally random words, so it should fundamentally be unpredictable unless there's a flaw in the source of randomness.

krick · 2024-06-05T04:41:06 1717562466

Is there any good posts about the use of entropy for tasks like that? I am wondering for quite some time of how do people actually use it and if it is any effective, but never actually got to investigating the problem myself.

First of all, how to define "entropy" for text is a bit unclear in the first place. Here it's as simple as `-Sum(x log(x))` where x = countOccurences(char) / len(text). And that raises a lot of questions about how good this actually works. How long string needs to be for this to work? Is there a ≈constant entropy for natural languages? Is there a better approach? I mean, it seems there must be: "obviously" "vorpal" must have lower "entropy" than "hJ6&:a". You and I both "know" that because 1) the latter "seems" to use much larger character set than natural language; 2) even if it didn't, the ordering of characters matters, the former just "sounds" like a real word, despite being made up by Carroll. Yet this "entropy" everybody seems to use has no idea about any of it. Both will have exactly the same "entropy". So, ok, maybe this does work good enough for yet-another-github-password-searcher. But is there anything better? Is there more meaningful metric of randomness for text?

Dozens of projects like this, everybody using "entropy" as if it's something obvious, but I've never seen a proper research on the subject.

hackinthebochs · 2024-06-05T05:21:45 1717564905

Entropy is a measure of complexity or disorder of a signal. The interesting part is that the disorder is with respect to the proper basis or dictionary. Something can look complex in one encoding but be low entropy in the right encoding. You need to know the right basis, or figure it out from the context, to accurately determine the entropy of a signal. A much stronger way of building a tool like the OP is to have a few pre-computed dictionaries for a range of typical source texts (source code, natural language), then encode the string against each dictionary, comparing the compressibility of the string. A high entropy string like a secret will compress poorly against all available dictionaries.

krick · 2024-06-05T19:18:13 1717615093

I only briefly browsed the code, but this seems to be roughly what yelp/detect-secrets does.

Anyway, that doesn't really answer my question. To summarize answers in this thread, I think PhilipRoman has captured the essence of it: strictly speaking, the idea of entropy of a known string is nonsense. So, as I suspected, information theory definition isn't meaningfully applicable to the problem. And as other commenters like you mentioned, what we are really trying to measure is basically Kolmogorov complexity, which, strictly speaking, is incomputable, but measuring the compression rate for some well-known popular compression algorithm (allegedly) seems to be good enough estimate, empirically.

But I think it's still an interesting linguistic question. Meaningful or not, but it's well defined: so does it appear to work? Are there known constants for different kinds of text for any of these (or other) metrics? I would suspect this should have been explored already, but neither me, nor anybody in this thread apparently has ever stumbled upon such article.

jazzyjackson · 2024-06-05T08:51:44 1717577504

bookmarking to think about later... does this hold for representing numbers as one base compared to another?

Regarding a prime as having higher entropy / less structure than say a perfect square or highly divisible number

a prime is a prime in any base, but the number of divisors will differ in non-primes, if the number is divisible by the base then it may appear to have more structure (smaller function necessary to derive, kolmogorov style),

does prime factorization have anything to do with this? i can almost imagine choosing a large non-prime whose divisibity is only obvious with a particular base such that the base becomes the secret key - base of a number is basically specifying your dictionary, no?

eigenket · 2024-06-05T10:03:44 1717581824

I don't think there's any interesting difference with different bases. Usually the base you represent stuff in is a relatively small number (because using very large bases is already wildly inefficient). I think it only usually makes sense to consider constant or logarithmic bases. If your base is scaling linearly with your number then things are going to be weird.

The problem of finding factors is only complex when you're asking about relatively big factors. If you're looking for constant or log sized factors you can just do trial division and find them.

hackinthebochs · 2024-06-05T13:56:12 1717595772

Changing the base of number representation with a random basis feels like XORing a string with a random string, which is to say you're adding entropy equal to the random string. My thinking is that for any number representation M, you can get any other number representation N given a well-chosen base. So when presented with the encoded N, the original number could be any other number with the same number of digits. But once you put reasonable bounds on the base, you lose that flexibility and end up adding negligible entropy.

GTP · 2024-06-05T15:43:51 1717602231

> So when presented with the encoded N, the original number could be any other number with the same number of digits

Not necessarily the same number of digits, when changing the base the number of digits may change as well. E.g., decimal 8 becomes 1000 in binary.

maCDzP · 2024-06-05T12:07:56 1717589276

Also bookmarking to think about it.

My mind drifted towards Fourier transform. Using the transform as a way of describing a system with less entropy?

Or am I butchering all of mathematics by making this comparison?

hackinthebochs · 2024-06-05T13:57:47 1717595867

There's some precedence for that. I'm pretty sure wavelets are SOTA for compression.

GTP · 2024-06-05T15:40:59 1717602059

> the number of divisors will differ in non-primes

Could you please present an example of this?

PhilipRoman · 2024-06-05T13:21:10 1717593670

Entropy of a particular string isn't a rigorous mathematical idea, since by definition the string which is known can only take one value, the "entropy" is therefore zero bits. The reason why we can distinguish non-random data from random is that only a small subset of all possible states are considered useful for humans, and since we have an idea what that subset looks like, we can try to estimate what process was used to generate a particular string.

There are of course statistical tests like https://en.wikipedia.org/wiki/Diehard_tests, which are good enough for distinguishing low entropy and high entropy data, but current pseudo-random number generators have no problem passing all of those, even though their actual "entropy" is just the seed plus approximate the complexity of the algorithm.

josephg · 2024-06-05T13:37:34 1717594654

If you’re looking for a rigorous mathematical idea, what people are trying to measure is the Kolmogorov complexity of the code. Measuring the compressed length is a rough estimate of that value.

https://en.m.wikipedia.org/wiki/Kolmogorov_complexity

PhilipRoman · 2024-06-05T14:20:11 1717597211

Yes, although (and here my understanding of Kolmogorov complexity ends) it still depends heavily on the choice of language and it seems to me like "aaaaaaaaa" is only less complex than "pSE+4z*K58" due to assuming a sane, human-centric language which is very different from the "average" of all possible languages. Which then leads me to wonder how to construct an adversarial turing-complete language which has unintuitive Kolmogorov complexities.

josephg · 2024-06-05T22:24:39 1717626279

> due to assuming a sane, human-centric language

There’s no requirement that the K-complexity is measured in a human centric language. Arguably all compression formats are languages too, which can be executed to produce the decompressed result. They are not designed to be human centric at all, and yet they do a surprisingly decent job at providing an estimate (well, upper bound) on Kolmogorov complexity. - As we can see in this program.

kqr · 2024-06-05T18:29:51 1717612191

Kolmogorov complexity conventionally refers to the Turing machine as the base for implementation. This indeed makes repeated letters significantly less complex than that other string. (If you want intuition for how much code is needed to do something on a Turing machine, learn and play around a bit with Brainfuck. It's actually quite nice for that.)

wwalexander · 2024-06-05T10:05:22 1717581922

The Kolmogorov complexity of an arbitrary string is uncomputable.

BeefWellington · 2024-06-04T23:13:48 1717542828

See also:

- trufflehog: https://github.com/trufflesecurity/trufflehog

- detect-secrets: https://github.com/Yelp/detect-secrets

- semgrep secrets: https://semgrep.dev/products/semgrep-secrets -- (Paid, but may be included in existing licenses in some cases

bbno4 · 2024-06-05T07:08:42 1717571322

Also see PyWhat for both interesting strings and secrets https://github.com/bee-san/pyWhat

jonstewart · 2024-06-05T12:59:55 1717592395

noseyparker is another good one: https://github.com/praetorian-inc/noseyparker

I think these solutions are all much better for finding secrets than something naive based on entropy. Yes, entropy is more general but these are well established tools that have been through the fire of many, many data sets.

DLA · 2024-06-04T20:08:04 1717531684

This looks like a very handy CLI tool. Nice Go code also. Thanks.

lanfeust · 2024-06-04T20:44:44 1717533884

thanks!

thomascountz · 2024-06-05T17:11:58 1717607518

Thank you DrJones for asking what a high entropy string is several years ago[0] and linking to a good article on it.[1]

[0] https://news.ycombinator.com/item?id=13304641

[1] https://www.splunk.com/en_us/blog/security/random-words-on-e...

seethishat · 2024-06-05T10:18:04 1717582684

This reminds me of the program 'ent' (which I have used for a very long time)

https://fourmilab.ch/random/

weipe-af · 2024-06-05T12:27:33 1717590453

It would be useful if it also trawled through the full git history of the project - a secret could have been checked in and later removed, but still exist in the history.

g15jv2dp · 2024-06-05T07:15:46 1717571746

Why would I need to install go to run this tool? I thought one advantage of go was that devs could just distribute a single binary file that works...

benterix · 2024-06-05T08:12:14 1717575134

Because it's a security tool so trusting a binary upfront defeats the purpose. With source you at least have the option to inspect what it really does.

menacingly · 2024-06-05T08:31:44 1717576304

does the stated purpose of the tool influence whether or not you can trust it?

alias_neo · 2024-06-05T09:00:25 1717578025

I think that question is a little backwards.

Certain tools are more likely to be used by people working in spaces where they should/must be less trusting.

If there was a tool (there is) to scan my platform deployment against some NCSC/NSA guidance for platform security, and I wanted to use it, I'm likely operating in a space that should consider being cautious about running random tools I find on the internet.

menacingly · 2024-06-06T18:25:07 1717698307

right, but in that scenario I'd assume you'd also want to take a look at your ostensibly unrelated tools

spoonjim · 2024-06-05T08:48:06 1717577286

If you're trying to improve the security of your product by running random binaries from the Internet you're going to have a bad time

saagarjha · 2024-06-05T09:00:42 1717578042

That's how most people run compilers

benterix · 2024-06-05T10:46:58 1717584418

This is argumentum ad absurdum - there is a reason why trusting your kernel and compiler is a reasonable compromise, even though there might be security issues in them, but random pieces of software downloaded from the Internet is not.

Ensorceled · 2024-06-05T12:22:45 1717590165

Wait ... you download random compilers from the internet? Or are you asserting equivalence between getting go from Google or Xcode from Apple and an random home brew install?

menacingly · 2024-06-06T18:27:54 1717698474

also if you're not trying to improve the security of your product by running random binaries from the internet. I'm concerned at the inability to separate the concepts of "what it does" and "what it says it does".

The idea that whether or not it needs scrutiny is impacted by your goals with the software is... creative

g15jv2dp · 2024-06-05T13:20:21 1717593621

Uh? OP just released a docker image and wants to release a homebrew thingy. Even assuming that was you say is somehow sensible, it's not the reason, no. You're just grasping at straws.

lanfeust · 2024-06-05T09:12:13 1717578733

I'd love to have it on homebrew but my PR is denied so I'll have to create my own brew tap or convince them to accept it.

I'll also create a docker image.

I just didn't expect this much popularity so the repo isn't 100% ready te be honest

Too · 2024-06-06T05:54:09 1717653249

The docker image is nothing more than “FROM scratch” and then copying in the statically linked binary. If there are 0 other dependencies, I think it would be equally easy to distribute the binary through GitHub releases. There is no need for brew.

If people want to run it isolated, the docker image is of course still a nice to have and docker hub is a convenient distribution mechanism. At the same time, an equivalent image can equally easily be created by the consumer.

Personally, due to trust, I would anyway still build it from source and run in a container, to be on the safe side.

drexlspivey · 2024-06-05T09:49:26 1717580966

Making a tap is super easy, you just upload a file with 5 LoC to github. I wouldn’t even bother with brew core.

lanfeust · 2024-06-05T11:29:30 1717586970

Oh ok I'll try then

lanfeust · 2024-06-05T11:29:11 1717586951

The docker container is now ready to use and documented on the home page

diggan · 2024-06-05T12:00:35 1717588835

Just awaiting the Kubernetes setup/Helm charts now and soon almost anyone can use it!

g15jv2dp · 2024-06-05T13:16:05 1717593365

The docker container... For a simple CLI tool... What?!

blixt · 2024-06-05T12:13:18 1717589598

I guess a language model like Llama 3 could model surprise on a token-by-token basis and detect the areas that are most surprising, i.e. highest entropy. Because as one example mentioned, the entire alphabet may have high entropy in some regards, but it should be very unsurprising to a code-aware language model that in a codebase you have the Base62 alphabet as a constant.

MarkMarine · 2024-06-05T10:56:05 1717584965

Another way to do this would be to compress the file and compare the compressed size to the uncompressed size.

Encrypted files do not compress well compared to code, I saw a phd thesis that postulated an inverse ratio of compression efficiency to performance data mining, this would be the opposite

baryphonic · 2024-06-05T22:35:02 1717626902

Neat tool.

Would be cool if this CLI could have a flag to read .gitignore and exclude all of the contents automatically.

Also it might be cool to have different strategies for detecting secrets, e.g. Kolmogorov complexity as other comments have noted.

saagarjha · 2024-06-05T09:03:19 1717578199

I assume this will have a bad time on compressed files?

lanfeust · 2024-06-05T09:04:43 1717578283

.zip extension is ignored by default along with other binary formats :)

saagarjha · 2024-06-05T09:07:09 1717578429

Right but like .tar.gz, etc. are also a thing

lanfeust · 2024-06-05T11:32:23 1717587143

You can just add your extensions to ignore with --ignore-ext. But I'll add .tar.gz and .tar.bz2 since they are widely used.

frumiousirc · 2024-06-05T11:50:26 1717588226

Or, have the tool recursively read the .tar files' contents.

crazypython · 2024-06-05T07:58:09 1717574289

It would be interesting to see a variant of this that used a small language model to measure entropy.

saagarjha · 2024-06-05T09:03:45 1717578225

Why would you do that when measuring entropy is easy to do with a normal program

icapybara · 2024-06-05T02:35:28 1717554928

Gonna have to explain how a “high entropy line” is calculated and why it might be secrets.

daemonologist · 2024-06-05T02:59:39 1717556379

Entropy of information is basically how well it can be compressed. Random noise usually doesn't compress much at all and thus has high entropy, whereas written natural language can usually be compressed quite a bit. Since many passwords and tokens will be randomly generated or at least nonsense, looking for high entropy might pick up on them.

This package seems to be measuring entropy by counting the occurrences of each character in each line, and ranking lines with a high proportion of repeated characters as having low entropy. I don't know how closely this corresponds with the precise definition. Source: https://github.com/EwenQuim/entropy/blob/f7543efe130cfbb5f0a...

More: https://en.wikipedia.org/wiki/Entropy_(information_theory)

eru · 2024-06-05T04:02:26 1717560146

Of course, this heuristic fails for weak passwords.

And it fails for passphrases like 'correct battery horse staple', which have a large enough total entropy to be good passwords, but have a low entropy per character.

dumbo-octopus · 2024-06-05T05:03:24 1717563804

4 diceware words is hardly a good password. It's ~51 bits of entropy, about the same as 8 random ascii symbols. It could be trivially cracked in less than an hour. Your average variable name assigned to the result of an object name with a method name called with a couple parameter names has much more entropy.

conradludgate · 2024-06-05T05:10:40 1717564240

If you can crack a single 52bit password in an hour, that's suggesting you can crack a 40bit password every second. That's 1 trillion hashes per second.

dumbo-octopus · 2024-06-05T12:24:28 1717590268

350B H/s was achieved in 2012 on consumer hardware. That's over 12 years ago, and several lifetimes of GPU improvements ago. 4 diceware words is simply not appropriate for anything remotely confidential, and it is bad for the community to pretend otherwise.

https://theworld.com/~reinhold/dicewarefaq.html

conradludgate · 2024-06-06T06:18:50 1717654730

If you read the sources, that's 350B _sha1_ hashes per second... While you can't be sure what hash system is being used for your passwords, any respectable system using a modern password hash is not even close to being that fast. OWASP's recommended 600000 rounds of pbkdf2 performs 1.2 million sha2 block rounds IIRC. If we assume that sha1 and sha2 are equivalent in performance, then you're looking at only 290,000 password attempts a second.

If the password system uses argon2 with a high memory requirement, you're in an even better position

dumbo-octopus · 2024-06-06T07:28:04 1717658884

Certainly if we assume the system under question was deigned with heightened security in mind, we will determine ourselves to be in a more secure system.

But go on, use a 52 bit password – see what I care. But don't come crying to me when an institution with the smallest amount of funding was able to crack your vault.

otabdeveloper4 · 2024-06-05T06:20:34 1717568434

Salts and timeouts made that password cracking technique obsolete anyways.

dumbo-octopus · 2024-06-05T13:04:03 1717592643

Only for online access. Offline access is still a thing, and in no way "obsolete".

baq · 2024-06-05T06:44:32 1717569872

So you do random capital words, random punctuation and add a number somewhere and you’re at 60. Add more for whatever threat model you’re trying to be secure against.

https://beta.xkpasswd.net/

eru · 2024-06-05T08:43:48 1717577028

The random punctuation sort-of defeats the point, doesn't it?

Otherwise, I agree.

baq · 2024-06-05T09:54:23 1717581263

Not sure; you can use the same character instead of a space and still get a few bits. Of course different ones would be better, but again, depends on how many bits you actually need.

eru · 2024-06-05T10:06:05 1717581965

I thought the point was to construct a password that's secure enough _and_ easy to remember for humans.

Adding random punctuation helps with the former, but might interfere with the latter. (In the extreme case, you just generate completely random strings character for character. That's the most secure, but the least memorable.)

baq · 2024-06-05T13:44:47 1717595087

> enough

key word here, I think we agree ;)

eru · 2024-06-05T05:25:07 1717565107

Just imagine my example used 8 words.

dumbo-octopus · 2024-06-05T12:29:03 1717590543

But it didn't. It perpetuated the exceedingly common myth that 52 bits is somehow enough. This has been considered bad practice for well over a decade now. https://theworld.com/~reinhold/dicewarefaq.html

ngonch · 2024-06-05T02:54:16 1717556056

For example: https://complexity-calculator.com/

trulyhnh · 2024-06-04T23:06:31 1717542391

GoDaddy open sourced something similar https://github.com/godaddy/tartufo

upg1979 · 2024-06-05T00:44:25 1717548265

See also:

https://github.com/gitleaks/gitleaks

tonyabracadabra · 2024-06-05T04:03:27 1717560207

interesting! can the similar measurement be applied to finding redundant code (like low entropy) with extra works?

coppsilgold · 2024-06-04T23:10:44 1717542644

Note that in an adversarial setting this will only be effective against careless opponents.

If you properly encode your secret it will have the entropy of its surroundings.

For example you can hide a string of entropy (presumably something encrypted) in text as a biased output of an LLM. To recover it you would use the same LLM and measure deviations from next-token probabilities. This will also fool humans examining it as the sentence will be coherent.

dools · 2024-06-05T01:53:58 1717552438

I think the opponent in the proposed use case for this tool is the gun you’re pointing at your foot, and this tool prevents you from pulling the trigger.

textninja · 2024-06-05T00:41:31 1717548091

What you described sounds like a very cool idea - LLM-driven text steganography, basically - but intentional obfuscation is not the problem this tool is trying to solve. To your point about secrets with entropy similar to the surrounding text, however, I wonder if this can pick up BIP39 Seed Phrases or if whole word entropy fades into the background.

thephyber · 2024-06-05T02:35:24 1717554924

The LLM adds no value here. Procedural generation in a loop until some fitness function (perhaps a frequency analysis metric) is satisfied.

eru · 2024-06-05T03:50:38 1717559438

The LLM is the fitness function.

__MatrixMan__ · 2024-06-05T01:32:37 1717551157

I imagine a social media site full of bots chatting about nonsense. Hidden in the nonsense are humans chatting about different nonsense. This way, server costs get paid for by advertisers, but its really only bots that see the ads anyway.

spullara · 2024-06-05T01:46:27 1717551987

if the ads aren't effective people won't buy them

eviks · 2024-06-05T03:42:43 1717558963

People have been buying ineffective ads since the invention of ads

spullara · 2024-06-05T05:33:08 1717565588

zero clicks is a little different

eviks · 2024-06-05T07:55:24 1717574124

Bots do click in real ad fraud, so your moved goalpost isn't all that solid

spullara · 2024-06-05T16:17:40 1717604260

sorry, conversions is really what I meant. if the bots are also buying the stuff then it would work.

__MatrixMan__ · 2024-06-06T02:07:47 1717639667

I'm not really concerned with whether it would work out in a way that was beneficial for the ad network or their customers. Either they figure out a way to make it work such that they continue to be a useful pipe, or they don't and then maybe they'll have to shut down and find something more productive to do with their time. We'll have done the public a service either way.

otabdeveloper4 · 2024-06-05T06:37:24 1717569444

Not really, advertising is really the only field of human endeavour that is both data-driven and results-oriented.

(Doesn't still stop smart people from committing fraud, but that is a different story.)

benterix · 2024-06-05T08:16:29 1717575389

Unfortunately I beg to differ. I worked for several companies where we the management clearly saw that the results were very poor (for Facebook ads, for example) but continued to invest because there is a defined budget for it and so on. It was like this last year and 20 years ago.

otabdeveloper4 · 2024-06-05T13:56:49 1717595809

Yes, most fraud is inside the corporate structure. Not shady "hacker" types in Romania.

jazzyjackson · 2024-06-05T08:57:24 1717577844

these companies should be outcompeted by firms that don't blow a million dollars a month paying out to click fraudsters but alas the market is not perfectly competitive

is it a cargo cult? it works for coca cola so maybe if we just spend a little more we'll see returns...

benterix · 2024-06-05T10:43:42 1717584222

Yes, I feel it might be cargo cult, at least in part. The argument I usually heard was that "But other companies are doing that, too".

j16sdiz · 2024-06-05T02:15:22 1717553722

It's called Twitter.

It's no nonsense, just catvideo and porns.

__MatrixMan__ · 2024-06-05T05:11:05 1717564265

Hmm yes, sensical things those.

Are you proposing that they're really only posted as a medium for encoding something else that we're not privy to? If so, somebody took my idea.

textninja · 2024-06-05T00:52:15 1717548735

The weights of the LLM become the private key (so it better be a pinned version of a model with open weights), and for most practical applications (i.e. unless you're willing to complicate your setup with fancy applied statistics and error correction) you'd have to use a temperature of 0 as baseline.

Then, having done all that, such steganography may be detectable using this very tool by encoding the difference between the LLM's prediction and ground truth, but searching for substrings with low entropy instead!

eru · 2024-06-05T03:58:30 1717559910

You seem to be making some weird assumptions?

Here's how I would do this:

Use some LLM, the weights need to be know to both parties in the communication.

Producing text with the LLM means repeatedly feeding the LLM with the text-so-far to produce a probability distribution for the next token. You then use a random number generator to pick a token from that distribution.

If you want to turn this into steganography, you first take your cleartext and encrypt it with any old encryption system. The resulting bistream should be random-looking, if your encryption ain't broken. Now you take the LLM-mechanism I described above, but instead of sampling via a random number generator, you use your ciphertext as the source of entropy. (You need to use something like arithmetic coding to convert between your uniformly random-looking bitstream and the heavily weighted choices you make to sample your LLM. See https://en.wikipedia.org/wiki/Arithmetic_coding)

Almost any temperature will work, as long as it is known to both sender and receiver. (The 'temperature' parameter can be used to change the distribution, but it's still effectively a probability distribution at the end. And that's all that's required.)

textninja · 2024-06-05T09:10:07 1717578607

I was imagining the message encoded in clear text, not encrypted form, because given the lengths required to coordinate protocol, keys, weights, and so on, I assumed there would be more efficient ways to disguise a message than a novel form of steganography. As such, I approached it as a toy problem, and considered detection by savvy parties to be a feature, not a bug; I imagined something more like a pirate broadcast than a secure line, and intentionally ignored the presumption about the message being encrypted first.

That being said, yes, some of my assumptions were incorrect, mainly regarding temperature. For practical reasons I was envisioning this being implemented with a third party LLM (i.e. OpenAI's,) but I didn't realize those could have their RNG seeded as well. There is the security/convenience tradeoff to consider, however, and simply setting the temperature to 0 is a lot easier to coordinate between sender and receiver than adding two arbitrary numbers for temperature and seed.

I misspoke, or at least left myself open to misinterpretation when I referred to the LLM's weights as a "secret key"; I didn't mean the weights themselves had to be kept under wraps, but rather I meant that either the weights had to be possessed by both parties (with the knowledge of which weights to use being the "secret") or they'd have to use a frozen version of a third party LLM, in which case the knowledge about which version to use would become the secret.

As for how I might take a first stab at this if I were to try implementing it myself, I might encode the message using a low base (let's say binary or ternary) and make the first most likely token a 0, the second a 1, and so on, and to offset the risk of producing pure nonsense I would perhaps skip tokens with too large a gulf between the probabilities for the 1st and 2nd most common tokens.

eru · 2024-06-05T09:48:49 1717580929

> I was imagining the message encoded in clear text, not encrypted form, [...]

I was considering that, but I came to the conclusion that it would be an exceedingly poor choice.

Steganography is there to hide that a message has been sent at all. If you make it do double duty as a poor-man's encryption, you are going to have a bad time.

> As such, I approached it as a toy problem, and considered detection by savvy parties to be a feature, not a bug; I imagined something more like a pirate broadcast than a secure line, and intentionally ignored the presumption about the message being encrypted first.

That's an interesting toy problem. In that case, I would still suggest to compress the message, to reduce redundancy.

textninja · 2024-06-05T10:31:52 1717583512

> If you make it do double duty as a poor-man's encryption, you are going to have a bad time.

For the serious use cases you evidently have in mind, yes, it's folly to have it do double duty, but at the end of the day steganography is an obfuscation technique orthogonal to encryption, so the question of whether to use encryption or not is a nuanced one. Anyhow, I don't think it's fair to characterize this elaborate steganography tech as a poor-man's encryption — LLM tokens are expensive!

eru · 2024-06-05T10:38:31 1717583911

> Anyhow, I don't think it's fair to characterize this elaborate steganography tech as a poor-man's encryption — LLM tokens are expensive!

I guess it's a "rich fool's encryption".

textninja · 2024-06-05T12:15:58 1717589758

Haha, sure, you can call it that if you want, but foolish is cousin to fun, so one application of this tech would be as a comically overwrought way of communicating subtext to an adversary who may not be able to read between the lines otherwise. Imagine using all this highly sophisticated and expensive technology just to write "you're an asshole" to some armchair intelligence analyst who spent their afternoon and monthly token quota decoding your secret message.

Seed for the message above is 42 by the way.

(Just kidding!)

buildbot · 2024-06-05T00:41:48 1717548108

In general (for those unaware) this is called stenography. You can hide an image in the lower bits of another image for example too.

dragonwriter · 2024-06-05T00:58:04 1717549084

Steganography; stenography is completely different.

buildbot · 2024-06-05T02:50:50 1717555850

Thanks

p0w3n3d · 2024-06-05T11:03:56 1717585436

xkcd.com/936/

xedeon · 2024-06-05T02:05:58 1717553158

ggshield from GitGuardian has been great for us.

Their free service can also auto detect and notify you of leaked secrets, passwords or high entropy lines from your online repos.

https://github.com/GitGuardian/ggshield

thsksbd · 2024-06-04T23:10:19 1717542619

This is very cool, but I have a thought - I see this as a last line of defense, and I am concerned that this would this give a false sense of security leading people to be more reckless with secrets.

kmoser · 2024-06-05T03:38:43 1717558723

You could make the same argument for any tool that does not provide high security. In fact security is layered, and no single tool should be relied upon to be your one security tool. You said as much yourself: "I see this as a last line of defense," but I don't see how you conclude that this would inherently cause people to be more reckless with secrets.

bongodongobob · 2024-06-05T02:02:57 1717552977

No, it's a way to audit and see the modes that your security policy is failing. At least that's how I look at it.

alexchantavy · 2024-06-05T01:18:09 1717550289

The pie in the sky goal for any security org is to have a cred rotation process that is so smooth that you’re able to not worry about leaked creds because it’s so fast and easy to rotate them. If the rotation is automated and if it’s cheap and frictionless to do so, heck why not just rotate them multiple times a day.

0cf8612b2e1e · 2024-06-04T23:31:35 1717543895

Ehhh considering how low the security bar is, I think it is better than nothing. If you inherit a code base, make it a quick initial action to see how much pain you can expect. In practice, I expect a tool like this has so many false positives you cannot keep it as an always running action. More a manual review you run occasionally.

I hope that more secrets adopt a GitHub like convention where they are prefaced with an identifier string so that you do not require heuristics to detect them.

lanfeust · 2024-06-05T09:13:30 1717578810

Indeed. I open-sourced `entropy` after we discovered an actual secret leak in our client codebase

lm411 · 2024-06-05T05:28:37 1717565317

Welcome to information security :)

hamasho · 2024-06-05T03:03:45 1717556625

I didn't know what entropy means in software, so here's the definition[0]:

----

  Software entropy is a measure of the disorder or complexity of a software system. It is a natural tendency for software entropy to increase over time, as new features are added and the codebase becomes more complex.

  High entropy in software development means that the code is difficult to understand, maintain, and extend. It is often characterized by:

      Duplicated code: The same code or functionality is repeated in multiple places, which can make it difficult to find and fix bugs.
      Complex logic: The code is difficult to follow and understand, which can make it difficult to add new features or fix bugs without introducing new ones.
      Poor documentation: The code is not well-documented, which can make it difficult for new developers to understand and contribute to the codebase.
      Technical debt: The code has been patched and modified over time without proper refactoring, which can lead to a tangled and cluttered codebase.

  Low entropy in software development means that the code is well-organized, easy to understand, and maintain. It is often characterized by:

      Well-designed architecture: The code is structured in a logical way, with clear separation of concerns.
      Consistent coding style: The code follows a consistent coding style, which makes it easy to read and understand.
      Comprehensive documentation: The code is well-documented, with clear explanations of the code's purpose and functionality.
      Minimal technical debt: The code has been refactored regularly to remove technical debt, which makes it easy to add new features and fix bugs without introducing new ones.

[0] https://www.kisphp.com/python/high-and-low-entropy-in-softwa...

itemize · 2024-06-05T03:10:13 1717557013

thanks for the search. this is textual entropy however, I am not sure if definition is applicable

eru · 2024-06-05T04:01:07 1717560067

Yes, it's not applicable. See https://en.wikipedia.org/wiki/Entropy_(information_theory) for something more applicable.

hamasho · 2024-06-05T09:03:13 1717578193

Oh, thanks for pointing out.