Python for Reverse Engineering 1: ELF Binaries

saagarjha · on March 22, 2019

> I’m not sure why it uses puts here? I might be missing something; perhaps printf calls puts.

It’s because you passed a constant string to printf, so the compiler decided it was not worth making the call and used puts instead.

icy · on March 22, 2019

Thanks! I’d actually figured that out a little while after publishing.

billfruit · on March 22, 2019

In general though, dealing with binary data in python isn't particularly intuitive.Also many python tutorials and books fails to mentioned how to manipulate binary data. I feel that is one of the places where the standard library is not that rich.

civility · on March 22, 2019

I disagree. The struct and array (not Numpy) modules are pretty great at cutting up binary data. You provide a format string and it just works.

vram22 · on March 22, 2019

Might be useful for beginners to binary file I/O in Python and reverse engineering of data formats:

DBFReader.py [1], which is part of my xtopdf toolkit [2] is a program that uses struct.unpack() multiple times to decode the fields in a DBF (XBASE) file.

[1] https://bitbucket.org/vasudevram/xtopdf/src/default/DBFReade...

[2] http://slides.com/vasudevram/xtopdf

I had first written a Pascal program to read and dump DBF file data after reverse-engineering the DBF format, based on some sketchy info I had access to, years ago. Later wrote the same program in C, and later still in Python, i.e. DBFReader.py .

jgalt212 · on March 22, 2019

your statement and his statement are both True (Python spelling variant). Of course, I don't think there are many books or tutorials geared towards beginners that deal with manipulating binary data.

> Also many python tutorials and books fails to mentioned how to manipulate binary data

billfruit · on March 22, 2019

I thought, the format string is unintutive if there are nested binary structures or if there are arrays of nested binary structures.

civility · on March 22, 2019

I do wish they were combined. It would be nice to handle arrays of structs and structs of arrays more gracefully, and it's unfortunate how the format strings almost (but not really) agree with each other.

And so long as I'm asking for ponies, it would also be nice if they handled complex numbers gracefully.

hultner · on March 22, 2019

Is it just for me or is the scroll on this site horrible broken? Shame because the content looks great.

bhargav · on March 22, 2019

Default behaviour seems to be overridden. I read the article and would recommend you look past the scrolling. If you are on an iDevixe, reader mode will help!

Edit: Spelling

RayDonnelly · on March 22, 2019

If you haven't seen it, also checkout Project LIEF. It is very good indeed. We use it for a lot of post-build binary verification in the conda ecosystem.

Windows, macOS and Linux are all supported.

https://lief.quarkslab.com/

icy · on March 22, 2019

Hi, I’m the author of this post. Feel free to ask questions, if any.

matmann2001 · on March 22, 2019

Hey. In your C code, you write to memory beyond what you malloc'd. You malloc'd 9 bytes for 'pw', but later do "pw[9] = '\0'", which accesses the 10th byte, which doesn't belong to you.

blattimwind · on March 22, 2019

malloc allocates aligned memory [1], so technically it's correct that he writes past the allocated memory, but technically it's also impossible for that write to fail or for that write to overwrite something else.

[1] bonus point: for what kind of alignment? (The minimum is quite well specified, for C standards)

spieglt · on March 22, 2019

https://www.gnu.org/software/libc/manual/html_node/Aligned-M...

"The address of a block returned by malloc or realloc in GNU systems is always a multiple of eight (or sixteen on 64-bit systems)."

I was about to say, "what if they're on a 32-bit system and so were only allocated one 8-byte block?" but then realized that since they'd requested 9 bytes, they'd be given two 8-byte blocks, or one 16-byte block on a 64-bit system. Is that right?

spieglt · on March 22, 2019

Well, I guess alignment doesn't say anything about how large of a block is allocated.... And this is the clearest source I can find, which says 32 bytes. https://prog21.dadgum.com/179.html

blattimwind · on March 22, 2019

> Well, I guess alignment doesn't say anything about how large of a block is allocated

It tells you where something can't be, and because virtual memory is allocated in whole pages the "padding" so to speak will always be accessible.

There's also the obvious truism that if you can access something in a cache line, all addresses in the cache line are safe to access. (Vectorized algorithms frequently implicitly rely on this for short reads, IOW there is no way reading a 128 or 256 bit vector can fault if just reading the first lane would not fault).

saagarjha · on March 22, 2019

> Vectorized algorithms frequently implicitly rely on this for short reads

This is extremely processor-dependent and you should not be writing C if you’re relying on this.

blattimwind · on March 23, 2019

> This is extremely processor-dependent

No, it's not.

> you should not be writing C if you’re relying on this.

Luckily you are in no position to tell anyone what they should or shouldn't do.

saagarjha · on March 24, 2019

Sorry, I misunderstood the context of that statement and was thought you were talking about vectorized algorithms exploiting out-of-bounds reads in general, which is pretty dependent on the processor as to when it will work (depending on how page boundaries and cache lines are set up). And I didn't really mean my statement about using C in the prescriptive way you seem to have taken it: I was merely trying to say that you should probably be using assembly in this case, because you are relying on details of your processor that your compiler is likely to be unaware about and may penalize you for. For example, the vectorized string routines in libSystem do overshoot the end of the string because they use pcmpeqb, and it is written in assembly because it relies on alignment guarantees that are difficult to express in C. Plus it guarantees vectorization ;)

blattimwind · on March 24, 2019

Ah, true, it is my turn to apologize then for interpreting your post in a rather uncharitable way.

jmts · on March 23, 2019

Then one day you come back and resize the array to a multiple of the memory alignment, and BAM! Off-by-one errors, or even vulnerabilities.

Or you enable more strict build settings and BAM! You have to go back and deal with all the places your code allows you to write off the end of a buffer because you just didn't give a damn before.

saagarjha · on March 22, 2019

For Glibc on Linux, I believe this is 32 bytes. I think musl does 16 bytes, as does libSystem on macOS.

sgillen · on March 22, 2019

Still feels dirty though doesn't it? Would never want to rely on this fact..

saagarjha · on March 22, 2019

Yeah, this is undefined behavior and your compiler might bite you for it.

w0mbat · on March 22, 2019

Yes, that jumped off the page at me too, and distracted me from the rest of the article.

matmann2001 · on March 22, 2019

Especially given the topic, I kept jumping around to see if it was intentional. Like maybe they would use these RE tools to exploit it.

icy · on March 22, 2019

Ah my bad. I’ll make sure to fix it. Sorry about that.

75dvtwin · on March 22, 2019

if you could briefly outline the space/position of this framework, relative to others (eg https://github.com/cea-sec/miasm ). Would very much appreciate.

Also, besides security aspect (eg intrusion/virus detection), I was looking at these frameworks as a 'higher-level than assembler, and less hardware architecture dependent than LLVM IR) -- is there an angle where reverse engineering tools, have a separate live an better-than-assembler toolchain for low level programming?

icy · on March 22, 2019

For starters, the purpose of this post was never to build an entire framework, like the one you’ve linked, but rather a small set of scripts to try and understand what disassemblers do under the hood. These scripts can also be tossed into some kind of automation pipeline of sorts, something like a CI/CD perhaps. There's a lot you can (potentially) do.

And your second question, I'm not sure I understand what you're attempting to convey.

monocasa · on March 22, 2019

Neat!

You can see some similar code I wrote in Rust here: https://github.com/monocasa/exeutils

icy · on March 22, 2019

Nice. I’ve been planning to rewrite `readelf(1)` in Nim, I’ll check out your code for some pointers :)

monocasa · on March 22, 2019

Word, you should check out the backing library I wrote too then.

https://github.com/monocasa/exefmt

qaq · on March 22, 2019

Wonder why security topics never get much interest on HN. It's a huge industry with a ton of VC funding going to security startups.

daeken · on March 22, 2019

Eh, it depends on the topic. Binary reversing stuff rarely gets much love, but there frankly just aren't too many people doing that stuff. Web security things get lots of love, usually -- I both launched and sold a web security class via HN, very successfully -- because there are just so many people who are interested in it; it's the bread and butter of the industry nowadays. And anything privacy-oriented or seriously pwned always gets clicks and upvotes.

But yeah, this stuff is good content but doesn't have much reach.

dang · on March 22, 2019

I'd have said it's of consistently high interest. What makes you say it isn't?

qaq · on March 22, 2019

I might be really off but it seems they rarely get more than 50 comments (unless it's some major breach).

dang · on March 22, 2019

Did you see https://news.ycombinator.com/item?id=19315273? It's just one data point but you might find it interesting.

There is a pattern where highly specialized technical posts don't get as many comments, relative to votes. Possibly reverse engineering and other security-related specialties fall into this. One can see the same thing in e.g. articles about type theory: people are interested, but don't necessarily feel qualified to add to the discussion. That's probably good if it prevents the dumb sorts of comment from getting posted, but maybe the threads would be more valuable if more users would ask questions. Then the users who know could explain, and more learning would take place.

saagarjha · on March 22, 2019

Then again Ghidra was hyped for months prior to its release.

qaq · on March 22, 2019

Good point might be because it was NSA tool

pjc50 · on March 22, 2019

Upvotes and comments are very different; even commenting beyond a certain limit counts negatively towards the article's front page position. If you want a lot of comments start an argument.

z3phyr · on March 22, 2019

Binary, firmware and hardware level security topics are academically most satisfying and fun to me. But there is a lot of mystery in these topics, given the inherent negativity and legal grey areas people have to deal with. I guess that is one of the reasons..

rhexs · on March 22, 2019

For one, the article seems to be impossible to read on an iPhone via safari.

danmg · on March 22, 2019

Add this to your iphone's rss reader:

https://damng.github.io/hackernews-rss-with-inlined-content/...

Content will be made readable and inlined into the feed.

kiddico · on March 22, 2019

It seems to break in a different way every time I reload the page.

icy · on March 22, 2019

Mind telling me which model? Could be my my piss poor CSS acting up at that resolution.

xrisk · on March 22, 2019

What's the issue? I'll message the author.

benj111 · on March 22, 2019

I got here via the front page, which would seem to discredit your theory.

Anyway VC funding doesn't necessarily equate to being interesting.