G-3PO: A protocol droid for Ghidra, or GPT-3 for reverse-engineering

AlbertoGP · on Jan 4, 2023

A few days ago this went mostly ignored (https://news.ycombinator.com/item?id=34161642) and I was asked to re-submit it (https://news.ycombinator.com/item?id=34250150) so that it gets a second chance.

That’s a script for the reverse-engineering tool Ghidra that uses GPT-3 to de-compile machine code and to write plain English explanations of what a piece of code does.

The article is quite detailed and describes both its capabilities and its limitations. That G-3PO script is open source, MIT license: https://github.com/tenable/ghidra_tools/tree/main/g3po

There was also another HN story about what at first sight looks like an alternative implementation of the same idea: “GptHidra – Ghidra plugin that asks OpenAI Chat GPT to explain functions”

https://news.ycombinator.com/item?id=34165291

This one is more recent and lacks that good write-up mentioned above. The script is smaller and it seems to have fewer features.

I suggest checking both of them.

2OEH8eoCRo0 · on Jan 5, 2023

Incredible. I had this exact idea rolling around in my head. Could something be trained to decompile binaries to source in a readable way? We have vast source code available and can build the binaries.

ta988 · on Jan 5, 2023

Ghidra and ida pro pseudo code are already pretty good.

mdaniel · on Jan 5, 2023

Ghidra may be pretty good for some binaries (or for some audience?), but my experience trying to get it to reverse both golang and rust-lang binaries has been abysmal. It fails to correctly identify string literals (which is my #1 go-to for finding "points of interest") and the decompilation output is ... well, maybe it's helpful to someone but not to me. I regret that I let my Binary Ninja license lapse in order to see what it would have to say about the same binaries, and I've never had an IDA license to know what that's like

As a point of comparison, I fed 10.2.2 a copy of gojq 0.12.11 that I had lying around and this is pretty representative of its output

  if (DAT_0075c6c8 == (code *)0x0) {
    ppuStack_38 = (undefined **)0x45ecd7;
    FUN_00462ee0(&DAT_0075da28,local_10,iVar3,iVar4);
    *(undefined8 *)(in_FS_OFFSET + -8) = 0x123;
    if (DAT_0075da28 != 0x123) {
      ppuStack_38 = (undefined **)0x45ecf8;
      FUN_00460dc0();
    }
  }

for further comparison, I fed it actual jq and it did much better about the string literals

    if ((((((iVar6 == 0) || (DAT_00108018 = DAT_00108018 | 1, local_58 != 0)) &&
          ((iVar6 = FUN_001045a0(pFVar20,0x72,"raw-output",pcVar13), iVar6 == 0 ||
           (DAT_00108018 = DAT_00108018 | 8, local_58 != 0)))) &&
         ((iVar6 = FUN_001045a0(pFVar20,99,"compact-output",pcVar13), iVar6 == 0 ||
          (local_5c = local_5c & 0xfffff8be, local_58 != 0)))) &&
        ((iVar6 = FUN_001045a0(pFVar20,0x43,"color-output",pcVar13), iVar6 == 0 ||
         (DAT_00108018 = DAT_00108018 | 0x40, local_58 != 0)))) &&
       (((iVar6 = FUN_001045a0(pFVar20,0x4d,"monochrome-output",pcVar13), iVar6 == 0 ||
         (DAT_00108018 = DAT_00108018 | 0x80, local_58 != 0)) &&
        ((iVar6 = FUN_001045a0(pFVar20,0x61,"ascii-output",pcVar13), iVar6 == 0 ||
         (DAT_00108018 = DAT_00108018 | 0x20, local_58 != 0)))))) {
      iVar6 = FUN_001045a0(pFVar20,0,"unbuffered",pcVar13);

unnouinceput · on Jan 5, 2023

IDA has a free variant. Did you tried?

mdaniel · on Jan 5, 2023

I somehow thought that IDA Free was missing the decompiler, but I just downloaded 8.2.221216 macOS x86_64 and while it did a much better job at identifying the symbols in the rust binary, regrettably it then consumed 100% of the CPU and effectively locked up. So ... better, I guess? :-/

unnouinceput · on Jan 5, 2023

Myeah, that sound like a bug in this variant. File a bug report with them, probably they'll release a better free one then you can properly test your theory. Good luck

jki275 · on Jan 5, 2023

They are "pretty good" in the sense that they define the sequence of assembly instructions according to some C type code that may have produced them. This is a tool that gives you essentially, "this function might be doing MD5".

I haven't tried it yet, but I can see it being useful if it's somewhat accurate (that's a big if), and quite different from what Ghidra gives you in pseudo code.

brnt · on Jan 5, 2023

> to write plain English explanations of what a piece of code does.

I could use this for a regular project for which I have the source.

mdaniel · on Jan 4, 2023

Wow, I wouldn't have expected Tenable to shell out to curl, especially when the curl only adds two headers and they omitted the "--fail" that would cause non-200 responses to return a non-zero exit code :-(

https://github.com/tenable/ghidra_tools/blob/main/g3po/g3po....

obliviasimplex · on Jan 6, 2023

Fair point. It was a quick and dirty workaround, in the absence of `requests` in Ghidra's Jython distribution, but it turns out that `httplib` is available, and the latest commit uses that to do the HTTP request instead.

m3affan · on Jan 4, 2023

This is big. I wonder how many subprojects will spin off from it.

denzil · on Jan 5, 2023

Very nice. And it makes me wonder what would be the result if the GPT was asked to point out security problems in the code.

obliviasimplex · on Jan 6, 2023

You can actually try adding "and indicate what security vulnerabilities are present in the code, if any" or something to that effect to the prompt, by tweaking the `EXTRA` global variable defined near the head of the script. My experience with this so far is that it tends to spew out infosec truisms that aren't closely connected with the code, and that most interesting vulnerabilities require a bit more contextual awareness to notice than this tool has available to it, but ymmv, and it's definitely worth taking a bit of time to see if you can massage the prompt to finagle useful bughunting output from the tool.

obliviasimplex · on Jan 6, 2023

Thanks for boosting this!

bri3d · on Jan 4, 2023

I'm partial to Gepetto for IDA, which includes an especially hilarious trick in which it instructs ChatGPT to phrase its responses in JSON, and then uses this JSON directly to name variables in the decompilation. If the JSON is incorrect, it politely asks ChatGPT to please fix its JSON output, which usually works.

https://github.com/JusticeRage/Gepetto/blob/main/gepetto.py#...

guwop · on Jan 5, 2023

woah, that's insane... how accurate is this in your experience?

saagarjha · on Jan 4, 2023

The real question is how a human should merge these results with their own reversing, honestly. I can't really trust GPT-3 to be accurate like I would actually trust the decompiler (and, as any reverser knows, you don't trust the decompiler). I think I would treat the output of this as I might a suggestion from a friend who I let glance over the code: "hmm, that might be a SHA-1?" and then I go confirm the results for myself.

jcul · on Jan 4, 2023

I think this plugin is not to decompile, but rather to explain decompiled code.

They mention that the AI's decompilation is about as good as Ghidra's (though of course less trustworthy).

The benefit is in explaining the decompiled code, they give an example where the prompt to the AI is something like "here is some code decompiled with ghidra, explain it in detail".

From the article:

"the paraphrase of disassembled or decompiled code into high-level commentary, can be assisted by automated tooling as well.

And this is just what the G-3PO Ghidra script does.

The output of such a tool, of course, would have to be carefully checked. Taking its soundness for granted would be a mistake, just as it would be a mistake to put too much faith in the decompiler. We should trust such a tool, backed as it is by an opaque LLM, far less than we trust decompilers, in fact. Fortunately reverse engineering is the sort of domain where we don’t need to trust much at all. It’s an essentially skeptical craft."

xvilka · on Jan 4, 2023

At least most of the decompiler logic comes from formal methods, thus reducing possible edge cases compared to statistics. There's a room for AI in reversing, but it should be a specifically trained model with a carefully extracted features from binary, not only disassembly output: graphs, debug and demangling info, types, IL analysis results, etc.

All these GPT-based plugins are just toys. There's more serious research like this[1][2][3]

[1] https://keenlab.tencent.com/zh/2019/12/10/Tencent-Keen-Secur...

[2] https://keenlab.tencent.com/zh/2020/11/03/neurips-2020-camer...

[3] https://keenlab.tencent.com/zh/2021/08/11/2021-binaryai-publ...

trenchgun · on Jan 4, 2023

Exactly. GPT-3 shines where we can make it solve hard problems into a format where they are easy to verify

popinman322 · on Jan 4, 2023

I've been waiting to see something like this. There's certainly room to fine-tune an LLM for this task; in that vein, I wonder whether Ghidra's pcode would produce better results? It's a bit better suited to this task in that the model wouldn't need to be tuned for each possible instruction set. Training on code compiled at different optimization levels might also produce interesting results.

You could probably also take the explanations from the LLM, convert those into embeddings, and then do semantic search over all functions in a binary. For example, searching for "get process handle and inject dll" and getting a list of prospects. It's less useful in an obfuscated binary, but for things like modding games or extending end-of-life software it could be very useful.

TOMDM · on Jan 4, 2023

I'd never considered semantic search for code vulnerabilities.

Maybe this is the next generation of automated code scanning.

Next feature on github: "Our LLM has scanned your code and found a potential buffer overflow. Please mark as a bug or a false report"

popinman322 · on Jan 4, 2023

I know there's some active work on this (using LLMs, not traditional methods), not on the binary side but on the source analysis side. See https://grit.io/, which tries to detect bugs (and maybe vulnerabilities?) and automatically submits PRs to patch them for you. I think morgante is their contact on HN.

It feels like it'd be difficult to acquire a large corpus of vulnerabilities to train on.

ta988 · on Jan 5, 2023

There are already systems for that in github enterprise. It works decently if you can massage your build to fit in their system. It is still more or less experimental for some languages.

comon · on Jan 5, 2023

Semmle (now GitHub CodeQL)

ShiftLeft

Oversecured

etc.

There are a few companies doing this.

hoosieree · on Jan 5, 2023

My PhD research is about classifying functions in obfuscated binaries, and when I saw this I immediately wondered if it will make my work obsolete. I suspect obfuscation will give LLMs a hard time for a little while, at least until they start training on obfuscated code.

But there is still the issue of whether companies doing this kind of RE want to send their code to OpenAI's servers. If you're reverse engineering in order to determine whether you should sue another company for copyright infringement, you are probably cautious about sharing code in the first place.

For this use case there will always be a need for some kind of alternative that you can self-host without sharing your data with another party.

moyix · on Jan 5, 2023

Very neat! I also worked on something that uses GPT-3 for reverse engineering last week. The basic idea is that right now GPT-3 is limited in how much context it can see at once. So instead, to summarize a function in context, I use the call graph to find all of its dependencies, and summarize them one by one, providing the summaries of the callees when summarizing the caller:

https://github.com/moyix/gpt-wpre

obliviasimplex · on Jan 6, 2023

This is a cool idea! Definitely going to try giving it a whirl.

int3 · on Jan 5, 2023

Relevant paper: https://arxiv.org/pdf/2202.01142.pdf

obliviasimplex · on Jan 9, 2023

Hey guys, I'm the one who wrote the post and the Ghidra plugin. Really delighted to see it get so much traction here! I just merged a couple of PRs which should improve the tool somewhat -- one from eShuttleworth, which uses GPT-3's feedback to automatically rename the function and global variables, and another from me, which does the same for what Ghidra internally refers to as HighVariables (variables visible in the decompiled code listing, as opposed to just the assembly listing). Turns out these two things only look like similar operations from the perspective of the GUI.

Here's the public github repo: <https://github.com/tenable/ghidra_tools> Please feel free to fork it and send me PRs with any features or fixes you think might be useful.

29athrowaway · on Jan 5, 2023

GPT gets confused very easily. Try this prompt:

    Explain the following C++ code:


    int sum(int n) {
       int result = 0;
       for (int i = 0; i < n; i++) {
          if (true || false) {
             return 42;
          }
       }
       for (int i = 0; i < n * 2; i++) {
           i++;
       }
       return n;
    }

Then ask it what is the result of that function when n is 5, and then when n is -1.

Note that the function looks trivial but it has certain tricks to confuse it. e.g.: "result" is never returned, instead, either 42 or n is returned.

alar44 · on Jan 5, 2023

Nope, first try:

Explain the following C++ code:

    int sum(int n) {
       int result = 0;
       for (int i = 0; i < n; i++) {
          if (true || false) {
             return 42;
          }
       }
       for (int i = 0; i < n * 2; i++) {
           i++;
       }
       return n;
    }

This C++ function is called sum and it takes an integer argument n. It has a local variable result which is initialized to 0 but is not used in the function.

The function has two for loops. The first for loop iterates over the range 0 to n-1. On each iteration, it checks if true || false is true. Since true || false is always true, it immediately returns the value 42. This means that the first for loop will only execute once.

The second for loop iterates over the range 0 to 2n-1. On each iteration, it increments i by 2 (since i++ increments i by 1 and i is incremented again on the next line). This means that the second for loop will only execute n times.

After the second for loop, the function returns the value of n.

What is the result when n=5?

If n is 5, the function will immediately return 42 when the first for loop is executed, so the function will not reach the second for loop or the return statement at the end. Therefore, the result of calling sum(5) would be 42.

What is the result when n=-1?

If n is negative, the first for loop will not be executed at all because the loop condition i < n will be false from the start. The second for loop will also not be executed because the loop variable i will never be less than 2n, which is a negative number in this case.

Therefore, the result of calling sum(-1) would be -1, which is the value of n that was returned at the end of the function.

trekkie1024 · on Jan 5, 2023

When I tried it with the same prompt, it explained the second for loop as

The second for loop iterates over the values of i starting at 0 and going up to, but not including, n * 2. Within this loop is a statement that increments i by 1. This statement has no effect on the loop because i is already being incremented by the loop itself.

which is incorrect.

pifm_guy · on Jan 5, 2023

There are some programming languages where that would be correct - for example a `for i in range(n):` from python.

29athrowaway · on Jan 5, 2023

It's not fully deterministic. A minor change in the prompt will cause different results.

In my case, it gave me wrong results.

thatannoyingguy · on Jan 5, 2023

That's awesome! I also wanted to combine Ghidra and GPT3, but I was unsure about its capabilities to explain code. I made a plugin where you highlight part of the listing view and it explains what is going on with the help of ChatGPT. Link for the curious: https://github.com/SourceDiver42/Ghidra-ChatGPT

eternalban · on Jan 4, 2023

> temperature=0.19

A new age of programming voodoo is upon us again. When knobs become IP ..

applecrazy · on Jan 5, 2023

Just because you don’t understand the knob doesn’t mean it’s voodoo or proprietary. LMs of every kind can incorporate temperature into their next-token inference and it’s a common term when describing models like these:

https://ai.stackexchange.com/questions/32477/what-is-the-tem...

turtleyacht · on Jan 5, 2023

Something about machine spirits and technopriests

Edit: tech-priests. But technopriests [1] also looks fun.

[1] https://en.m.wikipedia.org/wiki/The_Technopriests

mensetmanusman · on Jan 4, 2023

Simulations have been like this for years.

eternalban · on Jan 4, 2023

Sure. But this is api level mystic number game. At scale. For fun and profit.

lfodofod · on Jan 5, 2023

If you’re going to include a bunch of screenshots of text, it might be a good idea to not compress them till they’re unreadable. Maybe even use PNGs?

nigamanth · on Jan 5, 2023

Sheesh, wonder how no-one thought of this before.