danburkert's comments

danburkert · on March 27, 2019

Does YugaByte still use the Raft and HybridTime implementations from Apache Kudu? If so, how relevant are these results for Kudu?

kmuthukk · on March 27, 2019

I wanted to add a few details to the previous reply.

While the Raft/HybridTime implementation has its roots in Apache Kudu the results will NOT be quite applicable to Kudu. Aside from the fact that the code base has evolved/diverged over the 3+ years, there are key/relevant areas (ones very relevant to these Jepsen tests) where YugaByte DB has added capabilities or follows a different design than Kudu. For example:

-- Leader Leases: YugaByte DB doesn't use Raft consensus for reads. Instead, we have implemented "leader leases" to ensure safety in allowing reads to be served from a tablet's Raft leader.

-- Distributed/Multi-Shard Transactions: YugaByte DB uses a home grown (https://docs.yugabyte.com/latest/architecture/transactions/t...) protocol based on two-phase commit across multiple Raft groups. Capabilities like secondary indexes, multi-row updates use multi-shard transactions.

-- Allowing online/dynamic Raft membership changes so that tablets can be moved (such as for load-balancing to new nodes).

regards Kannan (Co-founder @ YugaByte)

mpercy · on March 27, 2019

FWIW, we implemented dynamic consensus membership change in Kudu way back in 2015 (https://github.com/apache/kudu/commit/535dae) but presumably that was after the fork. We still haven't implemented leader leases or distributed transactions in Kudu though due to prioritizing other features. It's very cool that you have implemented those consistency features.

kmuthukk · on March 27, 2019

hi @mpercy,

Thanks for correcting me on the dynamic consensus membership change. Looks like the basic support was indeed there, but several important enhancements were needed (for correctness and usability).

- To make the "online" piece of the membership change work correctly we added support for LEARNER (PRE VOTER) role (where the new member enters in a non-voting mode till it's caught up). https://github.com/YugaByte/yugabyte-db/commit/909d26e31ecd0....

- Load Balancing (which uses the membership changes) is automatic. (https://github.com/YugaByte/yugabyte-db/commit/e4667eb7ec0e6...)

- Remote bootstrap (due to membership changes) also has undergone substantial changes given that YugaByte DB uses a customize/extended version of RocksDB as the storage engine and does a tighter coupling of Raft with RocksDB storage engine. (https://github.com/YugaByte/yugabyte-db/blob/master/docs/ext...)

- Dynamic Leader Balancing is also new-- it causes leadership to be proactively altered in a running system to ensure each node is the leader for a similar number of tablets.

regards, Kannan

mpercy · on March 27, 2019

Interesting. Just last year we implemented improved re-replication (https://github.com/apache/kudu/commit/79a255) which sounds very similar to what you did with LEARNER roles, and we added manually-triggered rebalancing (https://github.com/apache/kudu/commit/ccdcf6 and https://kudu.apache.org/releases/1.8.0/docs/administration.h...).

I'm curious if you did anything to prevent automatic rebalancing from being triggered at a "bad time" or have throttled it in some way, or whether moving large amounts of data between servers at arbitrary times was not a concern.

I am also curious if you added some type of API using the LEARNER role to support a CDC-type of listener interface using consensus.

By the way, we also recently added support for rack/location awareness in a series of patches including https://github.com/apache/kudu/commit/ebb285

We should really start some threads on the dev lists to periodically share this type of information and merge things back and forth to avoid duplicating work where possible. I know the systems are pretty different at the catalog and storage layers but there are still many similarities.

acubel23 · on March 27, 2019

Yes, it does. At the core, the raft implementation is still based on kudu's. But, these areas have been worked on actively so the implementations might has diverged a little.

May be worth looking through the individual issues to see what applies and what doesn't:

https://github.com/YugaByte/yugabyte-db/projects/11

danburkert · on April 12, 2017

> Kudu

Do you mean Apache Arrow?

jnewhouse · on April 12, 2017

Nope, Kudu https://kudu.apache.org/. Although from Arrow's homepage it looks like it works with Kudu. "Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics."

colin_mccabe · on April 12, 2017

Former Kudu developer here.

Kudu was designed to be a columnar data store. Which means that if you had a schema like table PERSON(name STRING, age INT), you would store all the names together, and then all the ages together. This lets you do fancy tricks like run-length encoding on the age fields, and so forth.

There is also a Kudu RPC format which uses protocol buffers in places. But Kudu also sends some data over the wire that is not encoded as protocol buffers, to avoid the significant overhead of protocol buffer serialization / deserialization.

Apache Arrow is a separate project which started taking shape later. Essentially it was conceived as a way of allowing multiple applications to share columnar data that was in memory. I remember there being a big focus on sharing memory locally in a "zero-copy" fashion by using things like memory mapping. I never actually worked on the project, though. Developers of projects like Impala, Kudu, etc. thought this was a cool idea since it would speed up their projects. I don't know how far along the implementations are, though. Certainly Kudu did not have any integration with Arrow when I worked on it, although that may have changed.

Protocol buffers is simple and has a lot of language bindings, but its performance is poor on Java. The official PB Java libraries generate a lot of temporary objects, which is a big problem in projects like HDFS. It works better on C++, but it's still not super-efficient or anything. It's... better than ASN.1, I guess? It has optional fields, and XDR didn't...

danburkert · on Sept 12, 2015

That's not exactly fair. Go implements a runtime, GC, etc. all of which Clojure inherits from the JVM or the JS implementation.

nickbauman · on Sept 12, 2015

That's a "trueism". However, the CLOC was on the Go part of the repo. Go's GC isn't implemented in GO.

masklinn · on Sept 13, 2015

Isn't the runtime implemented in Go since 1.5? A quick overview of the C content[0] shows that the C source in Go is mostly cgo (either test cases or the runtime integration[0]) and the shootout C sources (for bench tests?).

[0] https://github.com/golang/go/tree/master/src/runtime/cgo