EVCache definitely has some sharp edges and can be hard to use, which is one of the reasons we are putting it behind these gRPC abstractions like this Counter one or e.g. KeyValue [1] which offer CompletableFuture APIs with clean async and blocking modalities. We are also starting to add proper async APIs to EVCache itself e.g. getAsync [2] which the abstractions are using under-the-hood.
At the same time, EVCache is the cheapest (by about 10x in our experiments) caching solution with global replication [3] and cache warming [4] we are aware of. Every time we've tried alternatives like Redis or managed services they either fail to scale (e.g. cannot leverage flash storage effectively [5]) or cost waaay too much at our scale.
I absolutely agree though EVCache is probably the wrong choice for most folks - most folks aren't doing 100 million operations / second with 4-region full-active replication and applications that expect p50 client-side latency <500us. Similar I think to how most folks should probably start with PostgreSQL and not Cassandra.
Throwing out a clarification: EVcache is effectively a complex memcached client + an internal ecosystem at Netflix. You can get much of its benefits with other systems (such as the memcached internal proxy: https://docs.memcached.org/features/proxy/).
For plugging into other apps they may only need a small slice of EVCache; just the fetch from local-then-far, copy sets to multiple zones, etc. A greenfield client with the same backing store could be trivial to do.
That all said I wouldn't advise people copy their method of expanding cache clusters: it's possible to add or remove one instance at a time without rebuilding and re-warming the whole thing.
Every zone has a copy, and clients always read their local zone copy (via pooled memcached connections) first and fallback only once to another zone on miss. Key is staying in zone and memcached protocol plus super fast server latencies. It's been a little while since we measured, but memcached has a time to first byte of around 10us and then scales sublinearly with payload size [1]. Single zone latency is variable but generally between 150 and 250us roundtrip, cross AZ is terrible at up to a millisecond [2].
So you put 200us network with 30us response time and get about 250us average latency. Of course the P99 tail is closer to a millisecond and you have to do things like hedges to fight things like the hard coded eternity 200ms TCP packet retry timer ... But that's a whole other can of worms to talk about.
My experience has been that the talent density is the main difference. Netflix tackles huge problems with a small number of engineers. I think one angle of complexity you may be missing is efficiency - both in engineering cost and infrastructure cost.
Also YouTube has _excellent_ engineering (e.g. Vitess in the data space), and they are building atop an excellent infrastructure (e.g. Borg and the godly Google network). It's worth noting though that the whole Netflix infrastructure team is probably smaller than a small to medium satellite org at Google.
It is indeed similar to DynamoDB as well as the original Cassandra Thrift API! This is intentional since those are both targeted backends and we need to be able to migrate customers between Cassandra Thrift, Cassandra CQL and DynamoDB. One of the most important things we use this abstraction for is seamless migration [1] as use cases and offerings evolve. Rather than think of KeyValue as the only database you ever need, think of it like your language's Map interface, and depending on the problem you are solving you need different implementations of that interface (different backing databases).
Graphs are indeed a challenge (and Relational is completely out of scope), but the high-scale Netflix graph abstraction is actually built atop KV just like a Graph library might be built on top of a language's built in Map type.
As a long time user and developer of databases, I would suggest isolation failures are not actually the source of most data related bugs. Most bugs I deal with are due to alternative failure modes like:
* We didn't think about how we would retry this operation when something fails or times out (idempotency)
* We didn't put the appropriate checksums in the right place (corruption)
* We didn't handle the load, often due to trying to provide stronger guarantees than the application needs, and went down causing lost operations (performance bottlenecks)
* We deployed bad software to the app or database, causing irreparable corruption that can't be fixed because we already purged the relevant commit/redo logs + snapshots.
I legitimately don't understand the calls for "SERIALIZABLE is the only valid isolation level" - I have not typically (ever that I can recall) seen at-scale production systems pay that cost for writes _and_ reads. Almost all applications I've seen (including banking/payment software) are fine with eventually consistent reads, as long as the staleness period is understood and reasonably bounded in time. Once you move past a single geographic datacenter, serializable writes become extremely expensive unless you can automatically home users to the appropriate leader datacenter, which most engineering teams can't guarantee.
The key is typically not isolation, it's modeling your application in an idempotent fashion that doesn't require isolation to be correct and keeping snapshots and those idempotent operation logs for a good few weeks at minimum. Maybe the Java analogy would be "if you can design it to not need locks, do that".
Serializble is easy to reason about and it also moves the problems with distributed systems to the database where it can more appropriately be handled imo.
It is by no means a silver bullet and depending on your application it may not be the right choice.
The whole point of the RDBMS revolution in the 70s and 80s was to try to bring about a world where developers did not have to care about how their data was stored, and could rely on consistency (and data representational independence)
The way this should all have gone down is that the caching story should have been something that DB vendors resolved, rather than something pushed into the application tier. But the push towards three tier architectures, and OOP and ORMs, meant this wasn't feasible.
What would be ideal is a single consistent data retrieval model, which extends from the physical retrieval of relations, all the way up to the presentation layer, all one transaction, and handles caching for you. There is already caching happening within the DBMS, for example...
I'm saying the line between the two is largely of our own making. The push towards OO and component models meant a strong separation between the two layers -- this was and is accepted as the "right" way to model things. But it comes with the cost of leaky abstractions, potentially broken isolation models, and high non-essential complexity by nature of the constant transition between components.
If it wasn't for this, we could be looking at DB architectures in which application logic co-habits with the DB. This doesn't imply application logic in the DB, but means that the DB's view of the data moves its way up into the application. Where the logic gets execute isn't as much the concern as what that logic operates on and that the data isolation model is consistent.
I am also of the opinion that the relational model, with its predicate-logic view of the world, is a richer way to model information than objects. So that's my bias.
A lot of this is straight out of the "Out of the Tarpit" paper, FWIW.
Something like Hibernate in Java will fetch data from the database once, populate objects (potentially making cycles and complex relationships between Java objects), and then let your business logic deal with those long-running, persistent Java objects (as opposed to objects that you deallocate right after being done with then after you queried the database)
This means that if you ever happen to use this object again in another context without making a new query, you risk dealing with stale data. And this happens all the time, because querying the db is seen as "expensive" and reusing model objects is "cheap"
First, I am not sure the data on most in-use hardware (e.g. EC2 m5/c5/i3en etc ...) supports your conclusions. xxHash is faster than crypto hashes always and BLAKE3 single threaded is faster on every Intel machine I've come across in wide deployment. I hear similar arguments around CRC-32 and to be frank it just isn't true on most computers most people run things on.
Second, many languages don't properly use the hardware instructions and if they do they often don't use them correctly. For example, Java 8 has bog slow SHA-1, AES-GCM and MD5 implementations, and switching to Amazon Coretto Crypto Provider (which is just using proper native crypto) was able to speed SHA/MD5 up by 50% and AES-GCM by ~90% on a reasonably large deployment (although the JDK wasn't using proper hardware instructions for AES-GCM until Java 9 I think it is still slower even after that).
That being said, like I disclaimed at the top of the benchmark your particular hardware and your particular language matters a lot.
Agree with everything you say except that the post didn't mention non-cryptographic hashing algos that can be driven that hard. xxHash[1] (and especially XXH3) is almost always the fastest hashing choice, as it both is fast and has wide language support.
Sure there are some other fast ones out there like cityhash[2] but there aren't good Java/Python bindings I'm aware of and I wouldn't recommend using it in production given the lack of wide-spread use versus xxhash which is used by LZ4 internally and in databases all over the place.
>> or even just a high number of rounds of SHA-512
> Please god no :-)
Heh I didn't mean it as a recommendation per say, but I'm pretty sure Linux uses repeated SHA-512 on at least some common distros. At least on my Ubuntu Focal machine my /etc/shadow appears to be using SHA-512.
This is exactly the kind of stuff I wrote this post about. The first (exactly once) is actually just at-least-once with deduplication based on a counter. To reliably process the events, however, you need to make your downstream idempotent as well. Think of it like your event processor might fail, so even if you only receive the message "once" if you can fail processing it you still have to think about retries. In my opinion it would be explicitly better for the event system to provide "at least once" and intentionally duplicate events on occasion to test your processors ability to handle duplicates.
The second (lock) is actually a lease fwict, and writing code in the locked body that assumes it will be truly mutually exclusive is pretty dangerous (see the linked post from Martin [1] for why).
> In my opinion it would be explicitly better for the event system to provide "at least once" and intentionally duplicate events on occasion to test your processors ability to handle duplicates.
It conveys a false sense of correctness. Usually the system doing the processing has to use higher level or external methods of providing idempotency.
For example TCP implements "exactly once processing" by your definition but you probably still want Stripe to include idempotency keys in their charge API so you don't pay twice.
Great points, transactions are certainly useful in helping developers think about state transitions. I think some of the ~snark might come from my personal struggles with trying to convey why wrapping non idempotent state transitions in "BEGIN TRANSACTION ... COMMIT" doesn't immediately make the system reliable. I completely agree transactions make understanding the state transitions easier and that is valuable.
I do think CRDTs or idempotency/fencing tokens are also a valuable way to reason about state transitions, and they can provide much lower latency in a global distributed system.
EVCache definitely has some sharp edges and can be hard to use, which is one of the reasons we are putting it behind these gRPC abstractions like this Counter one or e.g. KeyValue [1] which offer CompletableFuture APIs with clean async and blocking modalities. We are also starting to add proper async APIs to EVCache itself e.g. getAsync [2] which the abstractions are using under-the-hood.
At the same time, EVCache is the cheapest (by about 10x in our experiments) caching solution with global replication [3] and cache warming [4] we are aware of. Every time we've tried alternatives like Redis or managed services they either fail to scale (e.g. cannot leverage flash storage effectively [5]) or cost waaay too much at our scale.
I absolutely agree though EVCache is probably the wrong choice for most folks - most folks aren't doing 100 million operations / second with 4-region full-active replication and applications that expect p50 client-side latency <500us. Similar I think to how most folks should probably start with PostgreSQL and not Cassandra.
[1] https://netflixtechblog.com/introducing-netflixs-key-value-d...
[2] https://github.com/Netflix/EVCache/blob/11b47ecb4e15234ca99c...
[3] https://www.infoq.com/articles/netflix-global-cache/
[4] https://netflixtechblog.medium.com/cache-warming-leveraging-...
[5] https://netflixtechblog.com/evolution-of-application-data-ca...