Yes, it's included in the comparison matrix: https://github.com/kstenerud/concis...

no_circuit · on Jan 2, 2022

Local protobuf user here. Appreciate seeing a comparison chart. :-) It's unfortunate that it isn't documented very well, but Protobuf does have a text format [1] which I've used a lot, usually when writing test cases, but also when inspecting logs. Similar to the CBE encoder spec [2], it does use variable length encoding for ints [3] and preserves the type information. Another efficiency item to compare against different message types is the implementation itself, e.g. memory arenas out of the box. [4]

Regarding CE, what would be the use case? APIs, data at rest, inter-service communications? If data at rest meant for analysis, then there probably are a handful more formats to compare against.

If one doesn't wish to decode the whole message into memory to read it, FlatBuffers [5] can be checked out which is also supported as a message type in gRPC. It is similar to what is used in some trading systems. There is also a FlexBuffers variation if you'd want something closer to JSON/BSON.

Must say however, I found it cool that you have some Mac/iOS GitHub repos. Definitely going take some time to check them out -- I used to develop iOS apps.

[1] https://developers.google.com/protocol-buffers/docs/referenc...

[2] https://github.com/kstenerud/concise-encoding/blob/master/cb...

[3] https://developers.google.com/protocol-buffers/docs/proto3#s...

[4] https://developers.google.com/protocol-buffers/docs/referenc...

[5] https://google.github.io/flatbuffers/flatbuffers_white_paper...

kstenerud · on Jan 2, 2022

CE's primary focuses beyond security are ease-of-use and low-friction, which is what made JSON ubiquitous:

- Simple to understand and use, even by non-technical people (the text format, I mean).

- Low friction: no extra compilation / code generation steps or special tools or descriptor files needed.

- Ad-hoc: no requirement to fully define your data types up front. Schema or schemaless is your choice (people often avoid schemas until they become absolutely necessary).

Other formats support features like partial reads, zero-copy structs, random access, finite-time decoding/encoding, etc. And those are awesome, but I'd consider them specialized applications with trade-offs that only an experienced person can evaluate (and absolutely SHOULD evaluate).

CE is more of a general purpose tool that can be added to a project to solve the majority of data storage or transmission issues quickly and efficiently with low friction, and then possibly swapped out for a more specialized tool later if the need arises. "First, reach for CE. Then, reach for XYZ once you actually need it."

This is a partially-solved problem, but the existing solutions are security holes due to under-specification (causing codec behavior variance), missing types (requiring custom secondary - and usually buggy - codecs), and lack of versioning (so the formats can't be updated). And security is fast becoming the dominant issue nowadays.

memling · on Jan 3, 2022

An interesting project!

Regarding some of the ASN.1 comparison characteristics, I'm not quite sure if I understand--there's a lot to read here, and it's likely I've missed something by a lack of acquaintance with your documents/specifications. But a couple comments:

- Cyclic data: ASN.1 supports recursive data structures.[0]

- Time zones: ASN.1 supports ISO 8601 time types, including specification of local or UTC time.[1] I'm not sure how else you might manage this, but perhaps it's not what you mean?

- Bin + txt: Again, I'm unclear on what you mean here, but ASN.1 has both binary and text-based encodings (X.693 for XML encoding rules[2], X.697 for JSON[3], and an RFC for generic string encoding rules[4]; compilers support input and output).

- Versioned: Also a little unclear to me--it seems like the intent is to capture the version of data sent across the wire relative to the schema used in its creation or else that it ties the encoding to the notation/encoding specification. ASN.1 supports extensibility (the ellipsis marker, ...[5]) and versioning,[6] but AFAIK there's nothing that forces a DER-encoded document to describe whether it's from the first release or the newest. Relative to security, it also supports various canonical encodings.

[0]: https://www.obj-sys.com/asn1tutorial/node19.html and X.680 3.8.61.

[1]: https://www.itu.int/rec/T-REC-X.680-X.693-202102-I/en (see X.680 §38 and Annex J.2.11)

[2]: https://www.itu.int/rec/T-REC-X.693/en -- X.694 governs interoperability between XSD and ASN.1 schema.

[3]: https://www.itu.int/rec/T-REC-X.697/en

[4]: https://datatracker.ietf.org/doc/rfc3641/

[5]: See X.680 3.8.41

[6]: See X.680 §3.8.95

kstenerud · on Jan 4, 2022

> - Cyclic data: ASN.1 supports recursive data structures.

Not sure if I missed something, but the link was talking about self-referential types, not self-referential data. For example (in CTE):

    &a:{
        "recursive link" = $a
    }

In the above example, `&a:` means mark the next object and give it symbolic identifier "a". `$a` means look up the reference to symbolic identifier "a". So this is a map whose "recusive link" key is a pointer to the map itself. How this data is represented internally by the receiver of such a document (a table, a dictionary, a struct, etc) is up to the implementation, but the intent is for a structure whose data points to itself.

> - Time zones: ASN.1 supports ISO 8601 time types, including specification of local or UTC time.

Yes, this is the major failing of ISO 8601: They don't have true time zones. It only uses UTC offsets, which are a bad idea for so many reasons. https://github.com/kstenerud/concise-encoding/blob/master/ce...

> - Bin + txt: Again, I'm unclear on what you mean here, but ASN.1 has both binary and text-based encodings

Ah cool, didn't know about those.

> - Versioned: Also a little unclear to me

The intent is to specify the exact document formatting that the decoder can expect. For example we could in theory decide to make CBE version 2 a bit-oriented format instead of byte-oriented in order to save space at the cost of processing time. It would be completely unreadable to a CBE 1 decoder, but since the document starts with 0x83 0x02 instead of 0x83 0x01, a CBE 1 decoder would say "I can't decode this" and a CBE 2 decoder would say "I can decode this".

With documents versioned to the spec, we can change even the fundamental structure of the format to deal with ANYTHING that might come up in future. Maybe a new security flaw in CBE 1 is discovered. Maybe a new data type becomes so popular that it would be crazy not to include it, etc. This avoids polluting the simpler encodings with deprecated types (see BSON) and bloating the format.