Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Gah! Another explanation that uses "Huffman Trees"! Nobody uses that, we all use Canonical Huffman, where all you have are the number of symbols per code length, letting you efficiently build tables. Yes, tables are used instead of trees. The trees are a distraction.

https://en.wikipedia.org/wiki/Canonical_Huffman_code



Whether or not you canonicalize your prefix code is orthogonal to whether you think of it as a tree or not. (And any prefix code can be viewed as a tree.) In fact the very article you linked says:

> The advantage of a canonical Huffman tree is that it can be encoded in fewer bits than an arbitrary tree.

To take the example from the Wikipedia article you linked, canonicalizing

    B = 0
    A = 11
    C = 101
    D = 100
into

    B = 0
    A = 10
    C = 110
    D = 111
is conceptually the same as canonicalizing the tree (treating "0" as "left" and "1" as "right"):

            ┌------┐
      ┌-----┤(root)├-----┐
      │     └------┘     │
    ┌-┴-┐              ┌-┴-┐
    │ B │            ┌-┤   ├-┐
    └---┘            │ └---┘ │
                   ┌-┴-┐   ┌-┴-┐
                 ┌-┤   ├-┐ │ A │
                 │ └---┘ │ └---┘
               ┌-┴-┐   ┌-┴-┐
               │ D │   │ C │
               └---┘   └---┘
into

            ┌------┐
      ┌-----┤(root)├-----┐
      │     └------┘     │
    ┌-┴-┐              ┌-┴-┐
    │ B │            ┌-┤   ├-┐
    └---┘            │ └---┘ │
                   ┌-┴-┐   ┌-┴-┐    
                   │ A │ ┌-┤   ├-┐  
                   └---┘ │ └---┘ │  
                       ┌-┴-┐   ┌-┴-┐
                       │ C │   │ D │
                       └---┘   └---┘
So the trees aren't merely a "distraction" IMO: apart from being useful conceptually (e.g. in proving the optimality of the Huffman coding—this is how Knuth does it in TAOCP Vol 1, 2.3.4.5), certain applications of Huffman's algorithm (other than compression) also have the tree structure naturally arise (Knuth gives the example of choosing the optimal order for pairwise merging of N sorted lists of different lengths).

Sure, after using trees for understanding, you don't need to actually represent a tree in your data structures / source code / encoding, but that's another matter.


what did you use to render these beautiful trees?


To retrace my steps: I searched on google for [ascii tree generator] and, alongside many results about generating output for directory trees / folder structures (like that of tree: http://mama.indstate.edu/users/ice/tree/), I found this "Show HN" submission: https://news.ycombinator.com/item?id=21042390 . I installed and tried the linked project (https://github.com/spandanb/ascii_tree) and it kind of works, though I had several issues (posted later at https://github.com/spandanb/ascii_tree/issues/3). So I gave up and just copied the tree from that thread's top comment, by user kps (https://news.ycombinator.com/item?id=21043091) and manually edited the tree in Emacs to add new nodes by copying things around.

Additionally, after posting the comment to HN from desktop, I happened to notice within the 2-hour edit window that it looked awful on mobile (as does the comment I copied from: probably a bug with the default font stack used by Chrome on Android not using monospace versions of the box-drawing characters), so I edited to replace "─" with an ASCII "-". It looks less perfect on desktop, but less poorly aligned on mobile, so is a reasonable compromise.


Thanks for explaining the steps you took. I'm always looking out for ways to approach unknown unknowns and this is a great help.


Agreed! I’ve not seen ascii art that beautiful on HN before.


Moffat and Turpin's 1997 paper On The Implementation of Minimum Redundancy Prefix Codes contains all the usual tricks and then some: https://github.com/tpn/pdfs/blob/master/On%20the%20Implement...


See also Charles Bloom's rants [1] describing some of less known ideas from the paper.

[1] https://cbloomrants.blogspot.com/2010/08/08-12-10-lost-huffm...


I'm glad the resources on the basic trees exist, since that's what got me interested. But I would love to see more resources on canonical Huffman and especially the package-merge algorithm!

The former was confusing at first but mind-blowing when it clicked for me, and the latter looks awesome but the implementation is incomprehensible to me —I guess I'm stuck allocating trees until I can figure it out!


Canonical Huffman can be thought of as a pre-determined tree shape that's easy to re-construct. First, decide your bit lengths from your probabilities (something the most probable should get the shortest code-length, choose wisely), and then add 1s to the start of each longer to make it unambiguous.

If we have three symbols, A, B, and C, and let's say we assign bit lengths of A=1, B=2, C=2, meaning A is the most probable then we count:

    A = 0
    B = 10
    C = 11
For code lengths A=1, B=2, C=4, D=4, E=4, then we have:

    A = 0
    B = 10
    C = 1100
    D = 1101
    E = 1110
Note that all we need to send is the bit lengths (1, 2, 4, 4, 4) to the other side, and we have an algorithm to assign the bits. Though, even (1, 2, 4, 4, 4) is actually too much information, we just need to send the number of symbols for each given length: (1, 1, 0, 3)

Much faster, smaller, and generally better than sending over a whole tree.


You can even send the code lengths with at worst one bit per symbol, by sending the number of internal nodes at each level of a complete binary tree structure, this can be a concatenation of unary numbers and there's n-1 internal nodes for a complete binary tree of n leaves.

Your example isn't a complete/perfect tree, one of the length 4 symbols could be 3, for level sequence 1,1,1,2. Converted to internal node counts (excluding root) this is 1,1,1 and could be encoded 0,0,0 (commas for illustration only). That's a little boring, another example is:

    A = 00
    B = 01
    C = 10
    D = 110
    E = 111
level sequence 0,3,2, internal node counts 2,1, encoded as 10,0.

I'm not sure if this is well known, I thought I'd discovered it but there's an equivalent coding in Narimani and Khosravifard 2008, "The supertree of the compact codes."

A. You can avoid encoding the number of leaves by ending the string with impossible sequence 0,11, that is 1 internal node branching into at least 3 at the next level.

B. Or, you can use this as a 1-to-1 mapping between binary strings and canonical trees. If you read 2m-1 1s after a row with m internal nodes, you know this row has 2m so the final 0 isn't needed. (e.g. 2,1 would just be 1,0 instead of 10,0). This can't be combined with A since there are no longer impossible sequences, but at most (on a tree with all symbols the same length) it only saves lg(n) bits, so if you want to save bits you're better off with A.


Not sure I understand here - how would (1,1,0,3) disambiguate between different streams of data with bit lengths of (1,2,4,4,4) and (1,4,2,4,4)?


If you're sending just the counts then that would be in addition to an ordered list of symbols.


It's so odd that this crucial detail is always omitted from this explanation.

If you're encoding a byte stream, the receiver already knows what the set of symbols is, although not the order. Is it really more efficient to send the ordered list of symbols plus the histogram of code lengths than it is to send a length per symbol, with an implicit list of symbols? Surely not.


Like most things in compression, it depends.

But DEFLATE sends the list of 288 code lengths, with all the symbols being implicit.

It also compresses that list with another layer of repetition encoding and huffman encoding.

For example, if you were encoding DEFLATE's default huffman table, you would first write it like (8, repeat 143, 9, repeat 111, 7, repeat 23, 8, repeat 7), and then you would compress that with a separate tiny huffman table.

The tiny huffman used on code lengths is itself encoded as a series of 3 bit code lengths, taking up 5-ish bytes, with the order of symbols predefined in the spec.


As long as the list of symbols is small then yes, because the symbols are a subset of the entire character list. I imagine some implementations basex encode the bytestring first then indicate the original and basex encoding instead of sending the symbol list.


Thanks! I've got the canonical stuff down pat already in my project, but I am still allocating the tree to get the bit-lengths in the first place. Once I have the package-merge algorithm up and running, the code will never allocate a tree at all for construction, plus I'll have the added benefit of length-limited codes.


Incidentally, table-based canonical Huffman works best when the "root" of the Huffman codes is stored in the more significant bits, as that simplifies the algorithm from having to do a bit string reversal and the codes remain in order in the table.

...but I believe DEFLATE goes the exact opposite way, for reasons unknown.


I believe you're getting confused. There's no bit string reversal in DEFALTE, it just changes your shift slightly. Both MSB and LSB have their advantages and disadvantages.

As always, ryg has great posts on this stuff: https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far...


I meant that with the bit ordering that DEFLATE has (the root in the LSB), you can't simply add/subtract 1 to each code to get to the next one, because the carry goes the wrong way.

With a normal (ascending) canonical code, e.g. "2 codes of length 3, 2 codes of length 4" becomes

    000
    001
    0100
    0101
when read into a register. With DEFLATE, the same codes would look like

    000
    100
    0010
    1010
so you need to pre-rotate the bits in your lookup table to match.


"There's no bit string reversal in DEFALTE"

The DEFLATE RFC describes it as such.

"Huffman codes in bit-reversed order (i.e., with the first bit of the code in the relative LSB position)."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: