I wanted to add a few details to the previous reply.
While the Raft/HybridTime implementation has its roots in Apache Kudu the results will NOT be quite applicable to Kudu. Aside from the fact that the code base has evolved/diverged over the 3+ years, there are key/relevant areas (ones very relevant to these Jepsen tests) where YugaByte DB has added capabilities or follows a different design than Kudu. For example:
-- Leader Leases: YugaByte DB doesn't use Raft consensus for reads. Instead, we have implemented "leader leases" to ensure safety in allowing reads to be served from a tablet's Raft leader.
-- Distributed/Multi-Shard Transactions: YugaByte DB uses a home grown (https://docs.yugabyte.com/latest/architecture/transactions/t...) protocol based on two-phase commit across multiple Raft groups. Capabilities like secondary indexes, multi-row updates use multi-shard transactions.
-- Allowing online/dynamic Raft membership changes so that tablets can be moved (such as for load-balancing to new nodes).
FWIW, we implemented dynamic consensus membership change in Kudu way back in 2015 (https://github.com/apache/kudu/commit/535dae) but presumably that was after the fork. We still haven't implemented leader leases or distributed transactions in Kudu though due to prioritizing other features. It's very cool that you have implemented those consistency features.
Thanks for correcting me on the dynamic consensus membership change. Looks like the basic support was indeed there, but several important enhancements were needed (for correctness and usability).
- Remote bootstrap (due to membership changes) also has undergone substantial changes given that YugaByte DB uses a customize/extended version of RocksDB as the storage engine and does a tighter coupling of Raft with RocksDB storage engine. (https://github.com/YugaByte/yugabyte-db/blob/master/docs/ext...)
- Dynamic Leader Balancing is also new-- it causes leadership to be proactively altered in a running system to ensure each node is the leader for a similar number of tablets.
I'm curious if you did anything to prevent automatic rebalancing from being triggered at a "bad time" or have throttled it in some way, or whether moving large amounts of data between servers at arbitrary times was not a concern.
I am also curious if you added some type of API using the LEARNER role to support a CDC-type of listener interface using consensus.
We should really start some threads on the dev lists to periodically share this type of information and merge things back and forth to avoid duplicating work where possible. I know the systems are pretty different at the catalog and storage layers but there are still many similarities.
Yes, it does. At the core, the raft implementation is still based on kudu's. But, these areas have been worked on actively so the implementations might has diverged a little.
May be worth looking through the individual issues to see what applies and what doesn't:
Nope, Kudu https://kudu.apache.org/. Although from Arrow's homepage it looks like it works with Kudu. "Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics."
Kudu was designed to be a columnar data store. Which means that if you had a schema like table PERSON(name STRING, age INT), you would store all the names together, and then all the ages together. This lets you do fancy tricks like run-length encoding on the age fields, and so forth.
There is also a Kudu RPC format which uses protocol buffers in places. But Kudu also sends some data over the wire that is not encoded as protocol buffers, to avoid the significant overhead of protocol buffer serialization / deserialization.
Apache Arrow is a separate project which started taking shape later. Essentially it was conceived as a way of allowing multiple applications to share columnar data that was in memory. I remember there being a big focus on sharing memory locally in a "zero-copy" fashion by using things like memory mapping. I never actually worked on the project, though. Developers of projects like Impala, Kudu, etc. thought this was a cool idea since it would speed up their projects. I don't know how far along the implementations are, though. Certainly Kudu did not have any integration with Arrow when I worked on it, although that may have changed.
Protocol buffers is simple and has a lot of language bindings, but its performance is poor on Java. The official PB Java libraries generate a lot of temporary objects, which is a big problem in projects like HDFS. It works better on C++, but it's still not super-efficient or anything. It's... better than ASN.1, I guess? It has optional fields, and XDR didn't...
Isn't the runtime implemented in Go since 1.5? A quick overview of the C content[0] shows that the C source in Go is mostly cgo (either test cases or the runtime integration[0]) and the shootout C sources (for bench tests?).