Show HN: Unblob – extraction suite for 30+ file formats

jstrieb · on Jan 19, 2023

Unblob looks cool!

If you're interested in something similar that can put things back together after you've modified them, check out OFRAK:

https://github.com/redballoonsecurity/ofrak

It's designed with embedded systems in mind, but has support for all kinds of other stuff, too. It also has some very advanced binary patching capabilities.

I work on it as part of my day job.

kbd · on Jan 19, 2023

For years and years I've used `dtrx` ("do the right extraction") (https://github.com/dtrx-py/dtrx/). Maybe I should switch to unblob?

It looks like unblob has the right behavior by default that I have to alias for `dtrx`:

    alias dtrx='dtrx --one=inside'

But I'll probably want to create an alias for unblob to change default depth to 1.

nikeee · on Jan 19, 2023

I've been using unp (https://manpages.ubuntu.com/manpages/focal/man1/unp.1.html), which is just a wrapper around standard cli tools for unpacking things (tar, xz, unzip, etc). It seems pretty dated by now, good to see some replacements!

JonathonW · on Jan 19, 2023

If you’re looking for a general-purpose extractor for known, common archive formats, bsdtar is really nice these days; it’s libarchive-based and does way more than just tarballs (extracts zip, rar, and 7z as well as all the common compression formats on top of tar plus a bunch of others).

Not really in the same class of tools as unblob, but handy to have around regardless.

mintplant · on Jan 19, 2023

apack/aunpack from the atool suite for me [0]. Funny how many solutions exist for this problem. Though I think Unblob is aiming more for binwalk's niche [1].

[0] https://www.nongnu.org/atool/

[1] https://github.com/ReFirmLabs/binwalk

qkaiser · on Jan 19, 2023

The atool suite is great but only supports well formatted files. The idea with unblob is to precisely identify valid chunks of data within arbitrary files, carve them out, then decompress/extrat/convert them. You would not believe how many embedded devices vendors custom formats is just a loose aggregation of almost standard archive and compression packed up in a single file :D

wrayjustin · on Jan 19, 2023

Awesome tool/project.

These blind (or "magic"; https://en.m.wikipedia.org/wiki/List_of_file_signatures) extractors can be handy is a very wide range of applications.

From firmware/device image hacking to reverse engineering to data recovery.

kissgyorgy · on Jan 19, 2023

Yes, we developed a framework "unblob core" which can be extended easily [1] to any use-case. For example, a separate Python package could contain format specifications for "forensic analysis" or "game assets" (these two came to my mind, because we already got requests for them).

[1]: https://unblob.org/development/#writing-handlers

wrayjustin · on Jan 19, 2023

The forensic unpacking is such a powerful use case :)

scrollaway · on Jan 19, 2023

Since you're the author and I see the tool is in Python. I'm the original author of UnityPack (https://github.com/hearthsim/unitypack - nowadays, the fork UnityPy is more powerful and maintained: https://github.com/K0lb3/UnityPy).

It's in Python and is able to deserialize Unity archives, treating them as a serialization format rather than a simple archive format. Feel free to email me if you want to integrate something like this or you have questions :)

ukuina · on Jan 19, 2023

Neat!

Possible alternative to Universal Extractor. https://github.com/Bioruebe/UniExtract2

alschwalm · on Jan 18, 2023

Looks nice! Kind of reminds me of binwalk: https://github.com/ReFirmLabs/binwalk

kissgyorgy · on Jan 19, 2023

The main difference is that binwalk goes through a file linearly, searching for patterns like magic bytes, and tries to extract everything it finds.

The problems with this:

- very noisy, finding a lot of false positives (license code, format inside another format, etc).

- very slow, trying to extract irrelevant things

- imprecise, because it finds patterns in the middle of a file, where it's actually not relevant on the first level of extraction

unblob solves these problems by being smarter about the file formats, recognizing them by their specification, for example unpacks format header structs and carves out files based the information in the header (size, offset). See a simple example for NTFS [1].

We also went to great lengths preventing unnecessary work by skipping formats inside another [2]. We are using hyperscan [3] instead of grepping byte sequences with Python, which is orders of magnitudes faster. It can also handle 4Gb+ files because of this which binwalk cannot.

It's used for a year now in production and it's way more precise and faster than binwalk. We are getting less false-positives too, and even if unblob fails to extract everything, we still get meaningful information out of firmwares, where binwalk just failed with no output previously.

[1]: https://github.com/onekey-sec/unblob/blob/main/unblob/handle...

[2]: https://github.com/onekey-sec/unblob/blob/main/unblob/proces...

[3]: https://github.com/intel/hyperscan

kissgyorgy · on Jan 19, 2023

Indeed, it's a smarter alternative for binwalk, which we started because binwalk was not a good fit for us :) Should probably include comparison somewhere in the docs.

wrayjustin · on Jan 19, 2023

Adding a comparison, or at least the clear differentiators, to the documentation would be very helpful.

As someone who uses `binwalk` extensively in a professional setting, with tooling built around `binwalk`, it would be useful to see (a) how `unblob` would integrate and (b) if it could be a replacement or supplemental.

infotogivenm · on Jan 19, 2023

+1 on a comparison. Binwalk has served me faithfully for years but you never know what you don’t know until the day it fails

fellowmartian · on Jan 19, 2023

This looks awesome! Looking forward to trying it over binwalk! I'd be great if you could get it building on aarch64 and non-Linux system, I've tried adding the flake to my nix dotfiles on Mac, but quickly realized you support only x86_64-linux for now.

kissgyorgy · on Jan 19, 2023

hyperscan is supported on Intel 64 bit only, but there is another project wchich supports ARM called vectorscan. My colleague wrote a Python wrapper for vectorscan: https://github.com/vlaci/pyperscan and the initial work they already merged: https://github.com/onekey-sec/unblob/pull/475

phoyd · on Jan 19, 2023

I am not seeing an option to pass a password to the archive extractors. Is the any?

kissgyorgy · on Jan 19, 2023

No, not yet, however the problem came up several times. Feel free to create a GitHub issue if you would like this feature.

enos_feedler · on Jan 19, 2023

Damn I would have loved this for ripping "gamez" back in the day

arcticbull · on Jan 19, 2023

Very cool - something I don't expect to need very often but likely critical when I do. I can't help but feel there was a missed opportunity to name it 'unpackman'

snthpy · on Jan 19, 2023

Very cool!

Does anyone know of something similar for text file formats? In particular something that makes it easy to work with legacy fixed width record file formats?

phs318u · on Jan 19, 2023

Maybe not exactly what you’re after but I’ve used Flat File Extractor with great success.

EDITED to correct autocorrect.

https://ff-extractor.sourceforge.net/

snthpy · on Jan 19, 2023

No, that's pretty much what I was looking for.

I'm going to pretend I didn't see the XML example but I think I could produce CSV files with this.

Thanks!

qkaiser · on Jan 19, 2023

If you're talking about Intel HEX and Motorola S-Records, we developed unblob handlers for them. They're not public at the moment, but I can assure you it works.

saulpw · on Jan 19, 2023

VisiData should work: vd -f fixed file.txt

mixmastamyk · on Jan 19, 2023

Thought it might help with media containers but I guess not. Can anyone recommend a good jpg png mp4 mkv parser lib in python?

kissgyorgy · on Jan 19, 2023

I'm curious, what's your use-case?

mixmastamyk · on Jan 19, 2023

I'm writing a cli photo/video manager and have all of these parsers written but png. Fun, but would be better to replace it all with a mature lib. I think I've seen a python lib that parses everything but can't remember the name.

amelius · on Jan 19, 2023

Can it rename the toplevel directory name before extracting? (My main problem with tar)

qkaiser · on Jan 19, 2023

you can provide the top level directory where it will extract with the -e command line switch

amelius · on Jan 19, 2023

No, sorry for being unclear, I mean if the archive contains a toplevel folder e.g. binutils-1.0.0, that I can rename it to binutils before extracting (so I don't end up with the binutils-1.0.0 on my filesystem), and preferably in such a way that I don't have to know the toplevel folder name before extracting.

jenscow · on Jan 19, 2023

    tar -xf example.tar -C temp-dir && mv temp-dir/* $(echo temp-dir/* | sed -r 's,.*/([^/]+)-[0-9.]+,./\1,')

amelius · on Jan 19, 2023

Yeah, exactly the kind of command-line that I would like to avoid ;)

nerdponx · on Jan 19, 2023

Among other things, it's a shame that there's no dedicated standard-ish CLI tool for manipulating pathnames.

qkaiser · on Jan 19, 2023

Sadly no, unblob cannot do that. I'll look into it tho.

eigenvalue · on Jan 19, 2023

Cool. FYI, there’s a typo on the website—- it says “carving” instead of “craving”.

Centigonal · on Jan 18, 2023

Cool project!

alanbernstein · on Jan 19, 2023

Why use this over aunpack? https://linux.die.net/man/1/aunpack

kissgyorgy · on Jan 19, 2023

There are multiple reasons: 1. it already supports more formats [1] than atool. 2. It's written in Python and easily extendable by anyone interested supporting their own formats. [2] 3. It's probably faster because of hyperscan. 4. More reasons :) [3]

[1]: https://unblob.org/formats/

[2]: https://unblob.org/development/#writing-handlers

[3]: https://unblob.org/#why-unblob

wrayjustin · on Jan 19, 2023

`unblob` is closer to `binwalk` than it is to `aunpack`

See: https://github.com/ReFirmLabs/binwalk