Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Unblob – extraction suite for 30+ file formats (github.com/onekey-sec)
240 points by kissgyorgy on Jan 18, 2023 | hide | past | favorite | 42 comments


Unblob looks cool!

If you're interested in something similar that can put things back together after you've modified them, check out OFRAK:

https://github.com/redballoonsecurity/ofrak

It's designed with embedded systems in mind, but has support for all kinds of other stuff, too. It also has some very advanced binary patching capabilities.

I work on it as part of my day job.


For years and years I've used `dtrx` ("do the right extraction") (https://github.com/dtrx-py/dtrx/). Maybe I should switch to unblob?

It looks like unblob has the right behavior by default that I have to alias for `dtrx`:

    alias dtrx='dtrx --one=inside'
But I'll probably want to create an alias for unblob to change default depth to 1.


I've been using unp (https://manpages.ubuntu.com/manpages/focal/man1/unp.1.html), which is just a wrapper around standard cli tools for unpacking things (tar, xz, unzip, etc). It seems pretty dated by now, good to see some replacements!


If you’re looking for a general-purpose extractor for known, common archive formats, bsdtar is really nice these days; it’s libarchive-based and does way more than just tarballs (extracts zip, rar, and 7z as well as all the common compression formats on top of tar plus a bunch of others).

Not really in the same class of tools as unblob, but handy to have around regardless.


apack/aunpack from the atool suite for me [0]. Funny how many solutions exist for this problem. Though I think Unblob is aiming more for binwalk's niche [1].

[0] https://www.nongnu.org/atool/

[1] https://github.com/ReFirmLabs/binwalk


The atool suite is great but only supports well formatted files. The idea with unblob is to precisely identify valid chunks of data within arbitrary files, carve them out, then decompress/extrat/convert them. You would not believe how many embedded devices vendors custom formats is just a loose aggregation of almost standard archive and compression packed up in a single file :D


Awesome tool/project.

These blind (or "magic"; https://en.m.wikipedia.org/wiki/List_of_file_signatures) extractors can be handy is a very wide range of applications.

From firmware/device image hacking to reverse engineering to data recovery.


Yes, we developed a framework "unblob core" which can be extended easily [1] to any use-case. For example, a separate Python package could contain format specifications for "forensic analysis" or "game assets" (these two came to my mind, because we already got requests for them).

[1]: https://unblob.org/development/#writing-handlers


The forensic unpacking is such a powerful use case :)


Since you're the author and I see the tool is in Python. I'm the original author of UnityPack (https://github.com/hearthsim/unitypack - nowadays, the fork UnityPy is more powerful and maintained: https://github.com/K0lb3/UnityPy).

It's in Python and is able to deserialize Unity archives, treating them as a serialization format rather than a simple archive format. Feel free to email me if you want to integrate something like this or you have questions :)


Neat!

Possible alternative to Universal Extractor. https://github.com/Bioruebe/UniExtract2


Looks nice! Kind of reminds me of binwalk: https://github.com/ReFirmLabs/binwalk


The main difference is that binwalk goes through a file linearly, searching for patterns like magic bytes, and tries to extract everything it finds.

The problems with this:

- very noisy, finding a lot of false positives (license code, format inside another format, etc).

- very slow, trying to extract irrelevant things

- imprecise, because it finds patterns in the middle of a file, where it's actually not relevant on the first level of extraction

unblob solves these problems by being smarter about the file formats, recognizing them by their specification, for example unpacks format header structs and carves out files based the information in the header (size, offset). See a simple example for NTFS [1].

We also went to great lengths preventing unnecessary work by skipping formats inside another [2]. We are using hyperscan [3] instead of grepping byte sequences with Python, which is orders of magnitudes faster. It can also handle 4Gb+ files because of this which binwalk cannot.

It's used for a year now in production and it's way more precise and faster than binwalk. We are getting less false-positives too, and even if unblob fails to extract everything, we still get meaningful information out of firmwares, where binwalk just failed with no output previously.

[1]: https://github.com/onekey-sec/unblob/blob/main/unblob/handle...

[2]: https://github.com/onekey-sec/unblob/blob/main/unblob/proces...

[3]: https://github.com/intel/hyperscan


Indeed, it's a smarter alternative for binwalk, which we started because binwalk was not a good fit for us :) Should probably include comparison somewhere in the docs.


Adding a comparison, or at least the clear differentiators, to the documentation would be very helpful.

As someone who uses `binwalk` extensively in a professional setting, with tooling built around `binwalk`, it would be useful to see (a) how `unblob` would integrate and (b) if it could be a replacement or supplemental.


+1 on a comparison. Binwalk has served me faithfully for years but you never know what you don’t know until the day it fails


This looks awesome! Looking forward to trying it over binwalk! I'd be great if you could get it building on aarch64 and non-Linux system, I've tried adding the flake to my nix dotfiles on Mac, but quickly realized you support only x86_64-linux for now.


hyperscan is supported on Intel 64 bit only, but there is another project wchich supports ARM called vectorscan. My colleague wrote a Python wrapper for vectorscan: https://github.com/vlaci/pyperscan and the initial work they already merged: https://github.com/onekey-sec/unblob/pull/475


I am not seeing an option to pass a password to the archive extractors. Is the any?


No, not yet, however the problem came up several times. Feel free to create a GitHub issue if you would like this feature.


Damn I would have loved this for ripping "gamez" back in the day


Very cool - something I don't expect to need very often but likely critical when I do. I can't help but feel there was a missed opportunity to name it 'unpackman'


Very cool!

Does anyone know of something similar for text file formats? In particular something that makes it easy to work with legacy fixed width record file formats?


Maybe not exactly what you’re after but I’ve used Flat File Extractor with great success.

EDITED to correct autocorrect.

https://ff-extractor.sourceforge.net/


No, that's pretty much what I was looking for.

I'm going to pretend I didn't see the XML example but I think I could produce CSV files with this.

Thanks!


If you're talking about Intel HEX and Motorola S-Records, we developed unblob handlers for them. They're not public at the moment, but I can assure you it works.


VisiData should work: vd -f fixed file.txt


Thought it might help with media containers but I guess not. Can anyone recommend a good jpg png mp4 mkv parser lib in python?


I'm curious, what's your use-case?


I'm writing a cli photo/video manager and have all of these parsers written but png. Fun, but would be better to replace it all with a mature lib. I think I've seen a python lib that parses everything but can't remember the name.


Can it rename the toplevel directory name before extracting? (My main problem with tar)


you can provide the top level directory where it will extract with the -e command line switch


No, sorry for being unclear, I mean if the archive contains a toplevel folder e.g. binutils-1.0.0, that I can rename it to binutils before extracting (so I don't end up with the binutils-1.0.0 on my filesystem), and preferably in such a way that I don't have to know the toplevel folder name before extracting.


    tar -xf example.tar -C temp-dir && mv temp-dir/* $(echo temp-dir/* | sed -r 's,.*/([^/]+)-[0-9.]+,./\1,')


Yeah, exactly the kind of command-line that I would like to avoid ;)


Among other things, it's a shame that there's no dedicated standard-ish CLI tool for manipulating pathnames.


Sadly no, unblob cannot do that. I'll look into it tho.


Cool. FYI, there’s a typo on the website—- it says “carving” instead of “craving”.


Cool project!


Why use this over aunpack? https://linux.die.net/man/1/aunpack


There are multiple reasons: 1. it already supports more formats [1] than atool. 2. It's written in Python and easily extendable by anyone interested supporting their own formats. [2] 3. It's probably faster because of hyperscan. 4. More reasons :) [3]

[1]: https://unblob.org/formats/

[2]: https://unblob.org/development/#writing-handlers

[3]: https://unblob.org/#why-unblob


`unblob` is closer to `binwalk` than it is to `aunpack`

See: https://github.com/ReFirmLabs/binwalk




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: