An alternative approach might be to use an existing popular serialization format such as Protocol Buffers, Apache Thrift, or Cap'N'Proto and create or improve tools that convert to/from human-readable text formats to the serialized binary format.
- Cap'N'Proto has a `capnp` tool for encoding/decoding text representations to binary which seems to be officially supported and documented! https://capnproto.org/capnp-tool.html
These libraries have been battle-tested by major companies in production, some protocols and implementations have gone through security audits, and in addition each of these formats already has many language bindings, for example:
- Protobufs is not an ad-hoc format, which is a big reason why low-friction formats like JSON are popular. There are many use cases where formats like protobufs are clearly the superior choice, but CE doesn't target those. This is a fundamental trade-off so you can't have both.
- readable-thrift is a diagnostic tool. You wouldn't want to be inputting data like that. I want the text format to be fully usable by non-technical people, like JSON is.
- the capn proto tool page doesn't seem to document how the text format works (or at least I couldn't find any examples). It looks more like a diagnostic tool, not a first-class citizen.
I felt that there were enough pain points, missing types, and missing security features (for example versioning) to warrant a fresh start.
> the capn proto tool page doesn't seem to document how the text format works (or at least I couldn't find any examples). It looks more like a diagnostic tool, not a first-class citizen.
Cap'n Proto's text format works pretty much exactly like Protobuf's. You can use it in all the same ways.
This looks like a very ambitious project, and I can see that you've put a lot of thought, time, and effort into it! You clearly have a lot of interesting ideas (the graph idea is really cool) and significant experience with data formats.
If this is a security-oriented application, then with cyclic data structures there is the risk of blowing out your server's memory using something like a fork bomb when processing untrusted user input (https://en.wikipedia.org/wiki/Fork_bomb).
There are some systems like DHall that guarantee termination by putting upper bounds on computation: https://dhall-lang.org/
I'm also a bit concerned with how the different features can interact, for example it's not super clear how to distinguish between UTC offset (-130, or do these always have to be 4 digits?) and global coordinates (-130/-172). An attacker could specify a comment inside the media type (eg: application/* which would require special logic to filter out).
My concern is that the parser will become extremely complicated and require a lot of special-case logic and validation (eg: there must be at least one digit on each side of a radix point) which is more prone to errors and unexpected behaviors.
Rather than using slash delimiters, I'd recommend splitting the time formats into subfields, eg:
{ date: "2022-01-01"
time: "21:14:10"
offset_is_negative: true
offset: "10:30" }
This does make the text format more verbose, but it reduces ambiguity and makes the parsing faster as well since you don't need to descend into branches and backtrack when they don't match, and also might permit more code/logic reuse.
It's also not clear how easy it is to add new data types to the grammar. Based on the project description, it seems like you're using ANTLDR parser.
Yes, I had a look at avro as well. I've been following all of the established and nascant formats over a number of years, hoping for one that addresses my concerns, but unfortunately nothing emerged. My ambitions are actually at a much higher level; this is just to set a solid foundation for them.
Cyclic bombs are but one security concern... There are actually a LOT of them, which I try to cover cover in the security section ( https://github.com/kstenerud/concise-encoding/blob/master/ce... ). The security space is of course wider and more nuanced than this, but I didn't want to turn it into an entire tome so I tried to cover the basic philosophical problems. At the end of the day, you must treat data crossing boundaries as hostile, and build your ingestors with that in mind. Sane defaults can avoid the worst of them (and CE actually REQUIRES sane defaults for a lot of things in order to be compliant), but no format can protect you completely. A "fork bomb" using cyclic data is unlikely, unless your application code is really naive (if you're using cyclic data, you need to have a well-reasoned purpose for it, and are likely just using pointers internally - which won't blow out your memory unless you're doing something foolish when processing the resulting structs). Actually, this does give me an idea... make cyclic data disallowed by default, just to cover the common case where people don't use it and don't even want to think about it.
Re time formats: global coordinates will always start with a slash, so 12:00:00/-130/-172. UTC offsets will always start with + or -, and be 4 digits long, so 12:00:00+0130 or 12:00:00-0130.
The validation rules are very specific, and that does complicate the text format a bit, but this drives to the central purpose of it: The text format is for the USER, and is not what you send to other machines or foreign systems. It's for a user to edit or inspect or otherwise interact with the data on the RARE occasions where that is necessary. So the text format doesn't need to be fast or efficient, only unambiguous and easy for a human to read. You certainly shouldn't open an internet connected service that accepts the text format as input (except maybe during development and debugging...) In fact, I would expect a number of CE implementations (such as for embedded systems) to only include CBE support, since you could just use a standalone command-line tool or the like to analyze the data in most cases.
Re: subfields. That would make it harder for a human to read. The text format sacrifices some efficiency for human friendliness and better UX. Parser logic re-use isn't really a priority (other than making sure it's not OBVIOUSLY bad for the parser), because text parsing/encoding is supposed to be the 0.0001% use case.
It's not super easy to add new types to the text format grammar, but that's fine because human friendliness trumps almost all, and adding new types should be done with EXTREME CARE. I've lost count of all the types I've added and then scrapped over the years. It's really hard to come up with these AND justify them!
The ANTLR grammar is actually more of a documentation thing. I've verified it in a toy parser but it's not actually tied to the reference implementation (yet). The reference implementation currently is similar to a parser combinator, with a lot of inspiration from the golang team's JSON parser (I watched a talk by the guy some time ago and was impressed). But at the same time I'm starting to wonder if it might have been better to implement the reference implementation as just an ANTLR parser after all... leave the optimizations and ensuing complications to other implementations and keep the reference implementation readable and understandable. The binary format code is super simple, and about 1/3 the size of the text format code. The major downside of ANTLR of course is the terrible error reporting.
Thank you for the detailed and comprehensive explanations!
> There are actually a LOT of [security concerns], which I try to cover in the security section
If you'd like to eventually harden the binary implementations, you might also be interested in coverage-guided fuzz testing which feeds random garbage data to a method to try and find errors in it: https://llvm.org/docs/LibFuzzer.html
as well as maybe some kind of optional checksum or digital signature to ensure that the payload has not been tampered with (although perhaps this should be performed in another higher layer of the stack).
> make cyclic data disallowed by default, just to cover the common case where people don't use it and don't even want to think about it.
Yes, I think that making it an option which is restrictive (safe) by default would be a great idea. Or perhaps separating out the more dynamic types (eg: graphs, markup, binary data) to be loadable modules could also reduce the default attack surface area.
> You certainly shouldn't open an internet connected service that accepts the text format as input (except maybe during development and debugging...)
Yes, I fully agree with this! I initially assumed that the text format could be sent from an untrusted client similar to JSON and XML, but this makes more sense.
> because text parsing/encoding is supposed to be the 0.0001% use case
I see, so the main use case of the CTE text format is rapid prototyping, and then the user should convert to the CBE binary format in production?
> It's not super easy to add new types to the text format grammar
Customizable types could be a really great way to differentiate from other serialization protocols. I did notice that the system allows the user to define custom structs which is quite useful.
Another approach would be to embed the grammar and parser into an existing language like Python, Rust, or Haskell, and let the user define their own custom types in that language. In my experience, custom types help prevent a lot of errors (eg: for a fitness tracker IoT application, you could define separate types for ip_v4 address, duration_milliseconds, temperature_celsius, heart_rate_beats_per_minute, blood_pressure_mm_hg for systolic and diastolic blood pressure rather than using just floating point or fixed-point numbers, and this could prevent many potential unit conversion and incorrect variable use errors at compile-time). Or you could better model your domain with custom types (eg: reuse the global coordinate datastructure from the timezones implementation to create path or polygon types using repeated coordinates).
> adding new types should be done with EXTREME CARE
maybe it would make sense to create a small set of core types (kind of like a standard library), and then permit extensions via user-defined types which must be whitelisted? But pursuing that route could end up addressing a very different niche (favoring a stricter schema) in the design space.
> The major downside of ANTLR of course is the terrible error reporting.
This is a major advantage of the parser combinator approach, in that it is possible to design them to emit very helpful and context-aware error messages, for example look at the examples at the end of: https://www.quanttec.com/fparsec/users-guide/customizing-err...
Anyway, hope this was useful and I wish you good luck with your project!
> If you'd like to eventually harden the binary implementations, you might also be interested in coverage-guided fuzz testing which feeds random garbage data to a method to try and find errors in it: https://llvm.org/docs/LibFuzzer.html
Yes, I plan to fuzz the hell out of the reference implementation once it's done. So much to do, so little time...
> I see, so the main use case of the CTE text format is rapid prototyping, and then the user should convert to the CBE binary format in production?
CTE would be for prototyping, initial data loads, debugging, auditing, logging, visualizing, possibly even for configuration (since the config would be local and not sourced from unknown origin). Basically: CBE when data passes from machine to machine, and CTE only where a human needs to get involved.
> Another approach would be to embed the grammar and parser into an existing language like Python, Rust, or Haskell, and let the user define their own custom types in that language.
I demonstrate this in the reference implementation by adding cplx() type support for go as a custom type. Then people are free to come up with their own encodings for their custom needs (one could specify in the schema how to decode them). I think there's enough there as-is to support most custom needs.
> maybe it would make sense to create a small set of core types (kind of like a standard library), and then permit extensions via user-defined types which must be whitelisted?
I thought about that, but the complexity grows fast, and then you have a constellation of "conformant" codecs that have different levels of support, which means you can now only count on the minimal set of required types and the rest are useless. The fewer optional parts, the better.
For example:
- Protocol buffers have a text format mode: https://medium.com/@nathantnorth/protocol-buffers-text-forma...
- Thrift has readable-thrift which is a human-friendly encoder and decoder: https://github.com/nccgroup/readable-thrift
- Cap'N'Proto has a `capnp` tool for encoding/decoding text representations to binary which seems to be officially supported and documented! https://capnproto.org/capnp-tool.html
These libraries have been battle-tested by major companies in production, some protocols and implementations have gone through security audits, and in addition each of these formats already has many language bindings, for example:
- Protocol Buffer Third-party language bindings: https://github.com/protocolbuffers/protobuf/blob/master/docs...
- Apache Thrift language support: https://github.com/apache/thrift/blob/master/LANGUAGES.md
- Cap'N'Proto in other languages: https://capnproto.org/otherlang.html