Bitstring Matching and Construction in OCaml

Thomashuet · on Sept 22, 2020

What am I supposed to see there? What is this? Why is it interesting?

tom_mellior · on Sept 22, 2020

The examples are more interesting than the installation instructions: https://bitstring.software/examples/

rwmj · on Sept 22, 2020

And the guide too: https://bitstring.software/documentation/

Thomashuet · on Sept 22, 2020

Thanks, it's much more interesting indeed. The link should probably point there directly.

Now that I understand what it is, it's actually something that I've wanted to use before. It looks very useful.

non-entity · on Sept 22, 2020

Neat. After using Erlang for some projects, I get frustrated that more languages don't have nice interfaces for working with bitstrings.

rwmj · on Sept 22, 2020

Yes, I wrote this starting by looking at how Erlang handled bitstrings which I was very impressed by. Any similarities are not coincidental!

fanf2 · on Sept 22, 2020

Erlang's bitstring syntax has some weird endianness behaviour: it's biased to big-endian and can do weird things with non-byte-aligned little-endian strings. Here's an example based on the old 26-bit ARM program counter - https://fanf.dreamwidth.org/77938.html

Some years ago there was some nice work on adding bitdata support to Haskell, aimed at the kinds of things operating systems need to do with hardware (rather than serialization or network protocols) - http://web.cecs.pdx.edu/~mpj/pubs/bitdata.html and http://web.cecs.pdx.edu/~mpj/pubs/bytedata.html - these take the approach of bitdata being for layout of bitfields inside words, and bytedata being for detailed control of the memory layout of fields in a structure.

More recently I fleshed out the ideas at the end of that old blog post, with some thoughts on how to handle differing endianness in a more principled/algebraic way - https://fanf.dreamwidth.org/132484.html

rwmj · on Sept 22, 2020

All I can say that OCaml bitstring can handle fields of any bit width aligned to any bit and with any endianness. It will optimize the aligned cases at compile time. I don't think it suffers from the problem in your article because each field is treated separately (but TBH I wrote this 13 years ago, so you may be better looking at the sources).

It's possible to use variable-bit-length fields and alignment, eg:

  {| len : 4; i : len+1 |}

would match a 4 bit unsigned length field followed by a "len+1" (1-16) bit big-endian integer (although this cannot be optimized at compile time, so will be rather slow). Note that the i field starts at the 4th bit.

Even variable endianness is possible, eg:

  {| len : 4;
     e : 1;
     i : len+1 : endian(if e then LittleEndian else BigEndian) |}

where e is a single bit flag controlling endianness and the integer starts at the 5th bit.

Writing this drove me slightly round the bend, so I don't envy the current maintainer :-)

strmpnk · on Sept 22, 2020

I don't think this is the issue the prior article mentions about the bit string syntax. It's a notational problem paired with the combination of bit numbering and byte endianness.

The article makes it look trivial but if you note the ASCII diagram starts by labeling each byte-row with 7 not 0. Without some way to switch bit numbering (kind of like bit endianness) of the entire expression, we can't make something that looks like continuous bits in one view look continuous in the other.

A reduced example:

     7   6   5   4   3   2   1   0
   ------------------------+-------+
 0   5         X         0 |   Y   |
   ------------------------+-------+
 1  13            X              6

Here we can't say <<Y:2/little, X:14/little>> (Y is bits [0,1], X is bits [2,15]) because we need to read <<X1:6/little, Y:2/little, X2:8/little>> (X1 is bits [0,5], Y is bits [6,7], X2 is bits [8,15]).

The interpretation of the contents of those bits will be done in "little endian order" (odd term since endianness is a byte order not bit numbering concern) as the expression says but because Erlang or similar syntax has no knowledge of the bit numbering we have in this diagram, it can't automatically assume that the first bit is numbered 7 not 0.

Laying the bytes out on a line makes the notational problem more obvious:

     7   6   5   4   3   2   1   0 |  15  14  13  12  11  10  9   8
   ------------------------+-------+---------------------------------
     5         X1        0 |   Y   | 13            X2             6
   ------------------------+-------+---------------------------------

Even if we know that we interpret each number with a given endianness, the bit stream numbering here needs to be expressed separately at the top level. Generally, this is why serialized protocols prefer to start bit numbering at 0 like the written in the IPv4 header example.