Hacker News new | past | comments | ask | show | jobs | submit login
Python for Reverse Engineering 1: ELF Binaries (icyphox.sh)
173 points by xrisk on March 22, 2019 | hide | past | favorite | 48 comments



> I’m not sure why it uses puts here? I might be missing something; perhaps printf calls puts.

It’s because you passed a constant string to printf, so the compiler decided it was not worth making the call and used puts instead.


Thanks! I’d actually figured that out a little while after publishing.


In general though, dealing with binary data in python isn't particularly intuitive.Also many python tutorials and books fails to mentioned how to manipulate binary data. I feel that is one of the places where the standard library is not that rich.


I disagree. The struct and array (not Numpy) modules are pretty great at cutting up binary data. You provide a format string and it just works.


Might be useful for beginners to binary file I/O in Python and reverse engineering of data formats:

DBFReader.py [1], which is part of my xtopdf toolkit [2] is a program that uses struct.unpack() multiple times to decode the fields in a DBF (XBASE) file.

[1] https://bitbucket.org/vasudevram/xtopdf/src/default/DBFReade...

[2] http://slides.com/vasudevram/xtopdf

I had first written a Pascal program to read and dump DBF file data after reverse-engineering the DBF format, based on some sketchy info I had access to, years ago. Later wrote the same program in C, and later still in Python, i.e. DBFReader.py .


your statement and his statement are both True (Python spelling variant). Of course, I don't think there are many books or tutorials geared towards beginners that deal with manipulating binary data.

> Also many python tutorials and books fails to mentioned how to manipulate binary data


I thought, the format string is unintutive if there are nested binary structures or if there are arrays of nested binary structures.


I do wish they were combined. It would be nice to handle arrays of structs and structs of arrays more gracefully, and it's unfortunate how the format strings almost (but not really) agree with each other.

And so long as I'm asking for ponies, it would also be nice if they handled complex numbers gracefully.


Is it just for me or is the scroll on this site horrible broken? Shame because the content looks great.


Default behaviour seems to be overridden. I read the article and would recommend you look past the scrolling. If you are on an iDevixe, reader mode will help!

Edit: Spelling


If you haven't seen it, also checkout Project LIEF. It is very good indeed. We use it for a lot of post-build binary verification in the conda ecosystem.

Windows, macOS and Linux are all supported.

https://lief.quarkslab.com/


Hi, I’m the author of this post. Feel free to ask questions, if any.


Hey. In your C code, you write to memory beyond what you malloc'd. You malloc'd 9 bytes for 'pw', but later do "pw[9] = '\0'", which accesses the 10th byte, which doesn't belong to you.


malloc allocates aligned memory [1], so technically it's correct that he writes past the allocated memory, but technically it's also impossible for that write to fail or for that write to overwrite something else.

[1] bonus point: for what kind of alignment? (The minimum is quite well specified, for C standards)


https://www.gnu.org/software/libc/manual/html_node/Aligned-M...

"The address of a block returned by malloc or realloc in GNU systems is always a multiple of eight (or sixteen on 64-bit systems)."

I was about to say, "what if they're on a 32-bit system and so were only allocated one 8-byte block?" but then realized that since they'd requested 9 bytes, they'd be given two 8-byte blocks, or one 16-byte block on a 64-bit system. Is that right?


Well, I guess alignment doesn't say anything about how large of a block is allocated.... And this is the clearest source I can find, which says 32 bytes. https://prog21.dadgum.com/179.html


> Well, I guess alignment doesn't say anything about how large of a block is allocated

It tells you where something can't be, and because virtual memory is allocated in whole pages the "padding" so to speak will always be accessible.

There's also the obvious truism that if you can access something in a cache line, all addresses in the cache line are safe to access. (Vectorized algorithms frequently implicitly rely on this for short reads, IOW there is no way reading a 128 or 256 bit vector can fault if just reading the first lane would not fault).


> Vectorized algorithms frequently implicitly rely on this for short reads

This is extremely processor-dependent and you should not be writing C if you’re relying on this.


> This is extremely processor-dependent

No, it's not.

> you should not be writing C if you’re relying on this.

Luckily you are in no position to tell anyone what they should or shouldn't do.


Sorry, I misunderstood the context of that statement and was thought you were talking about vectorized algorithms exploiting out-of-bounds reads in general, which is pretty dependent on the processor as to when it will work (depending on how page boundaries and cache lines are set up). And I didn't really mean my statement about using C in the prescriptive way you seem to have taken it: I was merely trying to say that you should probably be using assembly in this case, because you are relying on details of your processor that your compiler is likely to be unaware about and may penalize you for. For example, the vectorized string routines in libSystem do overshoot the end of the string because they use pcmpeqb, and it is written in assembly because it relies on alignment guarantees that are difficult to express in C. Plus it guarantees vectorization ;)


Ah, true, it is my turn to apologize then for interpreting your post in a rather uncharitable way.


Then one day you come back and resize the array to a multiple of the memory alignment, and BAM! Off-by-one errors, or even vulnerabilities.

Or you enable more strict build settings and BAM! You have to go back and deal with all the places your code allows you to write off the end of a buffer because you just didn't give a damn before.


For Glibc on Linux, I believe this is 32 bytes. I think musl does 16 bytes, as does libSystem on macOS.


Still feels dirty though doesn't it? Would never want to rely on this fact..


Yeah, this is undefined behavior and your compiler might bite you for it.


Yes, that jumped off the page at me too, and distracted me from the rest of the article.


Especially given the topic, I kept jumping around to see if it was intentional. Like maybe they would use these RE tools to exploit it.


Ah my bad. I’ll make sure to fix it. Sorry about that.


if you could briefly outline the space/position of this framework, relative to others (eg https://github.com/cea-sec/miasm ). Would very much appreciate.

Also, besides security aspect (eg intrusion/virus detection), I was looking at these frameworks as a 'higher-level than assembler, and less hardware architecture dependent than LLVM IR) -- is there an angle where reverse engineering tools, have a separate live an better-than-assembler toolchain for low level programming?


For starters, the purpose of this post was never to build an entire framework, like the one you’ve linked, but rather a small set of scripts to try and understand what disassemblers do under the hood. These scripts can also be tossed into some kind of automation pipeline of sorts, something like a CI/CD perhaps. There's a lot you can (potentially) do.

And your second question, I'm not sure I understand what you're attempting to convey.


Neat!

You can see some similar code I wrote in Rust here: https://github.com/monocasa/exeutils


Nice. I’ve been planning to rewrite `readelf(1)` in Nim, I’ll check out your code for some pointers :)


Word, you should check out the backing library I wrote too then.

https://github.com/monocasa/exefmt


Wonder why security topics never get much interest on HN. It's a huge industry with a ton of VC funding going to security startups.


Eh, it depends on the topic. Binary reversing stuff rarely gets much love, but there frankly just aren't too many people doing that stuff. Web security things get lots of love, usually -- I both launched and sold a web security class via HN, very successfully -- because there are just so many people who are interested in it; it's the bread and butter of the industry nowadays. And anything privacy-oriented or seriously pwned always gets clicks and upvotes.

But yeah, this stuff is good content but doesn't have much reach.


I'd have said it's of consistently high interest. What makes you say it isn't?


I might be really off but it seems they rarely get more than 50 comments (unless it's some major breach).


Did you see https://news.ycombinator.com/item?id=19315273? It's just one data point but you might find it interesting.

There is a pattern where highly specialized technical posts don't get as many comments, relative to votes. Possibly reverse engineering and other security-related specialties fall into this. One can see the same thing in e.g. articles about type theory: people are interested, but don't necessarily feel qualified to add to the discussion. That's probably good if it prevents the dumb sorts of comment from getting posted, but maybe the threads would be more valuable if more users would ask questions. Then the users who know could explain, and more learning would take place.


Then again Ghidra was hyped for months prior to its release.


Good point might be because it was NSA tool


Upvotes and comments are very different; even commenting beyond a certain limit counts negatively towards the article's front page position. If you want a lot of comments start an argument.


Binary, firmware and hardware level security topics are academically most satisfying and fun to me. But there is a lot of mystery in these topics, given the inherent negativity and legal grey areas people have to deal with. I guess that is one of the reasons..


For one, the article seems to be impossible to read on an iPhone via safari.


Add this to your iphone's rss reader:

https://damng.github.io/hackernews-rss-with-inlined-content/...

Content will be made readable and inlined into the feed.


It seems to break in a different way every time I reload the page.


Mind telling me which model? Could be my my piss poor CSS acting up at that resolution.


What's the issue? I'll message the author.


I got here via the front page, which would seem to discredit your theory.

Anyway VC funding doesn't necessarily equate to being interesting.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: