Author here. Pleasantly surprised to see the article here.
Some context behind the article. I studied CRDTs for a few months, and noticed that different CRDT designs use logical clocks in different and clever ways. And I haven't seen anyone narrate all those ways of use in one article. My attempt with this article was to dredge up those flavors of logical clocks into one article and give them names for future reference.
(To respond to a couple of other comments, I ignored atomic (and gps-based) clocks in this discussion, as indicated in my footnote 3).
> Conflict-free replicated data types (CRDTs), which are cooperating data structures that can mutate data while being disconnected from each other, use logical clocks as their foundation to preemptively deal with conflicts.
Not all CRDTs use logical clocks. You can implement a counter CRDT, for instance, without having to express when in time each edit occurred. It doesn't matter if +1 happened before -3. Both of those events just need to be added together in the end.
Nice write up, have you looked at Conflict free replicated relations?
We have been discussing Lamport clocks and CRDTs in the context of building local-first applications in our LocalFirstWeb discord https://lfw.dev.
Causality approximated with temporal proximity is a nice way of putting it. I've been trying to wrap my mind around an append only DAG to represent user actions so that multi account, multiuser and multi device experiences can run offline and get to eventually consist state when synced.
I have implemented a JSON-like CRDT using this nice tutorial [1], which is a tree, but never an append-only DAG.
I'm curious how two edges meeting the same vertex will make it difficult to resolve the conflict automatically? Is it that two users will create the same vertex (concurrently) with slightly different attributes, and attribute-merging requires human-in-the-loop?
Of course if the DAG is not append-only, but supports, say, "moving" of edges/branches, it is a totally different problem. Martin has something to say about this [2].
> what will happen [if] we pick minimum values of the corresponding vector clock entries? The resulting vector clock will describe a point in (logical) time seen by both emitters. If we'll fold this minimum over all most recent timestamps received from all replicas, what we get is the timestamp that describes the point in time seen by everyone - so called stable timestamp.
>
> This timestamp serves us as a check point - since we know that all events prior to it have been acknowledged by all replicas.
this is not correct, as there is no reliable definition of "all replicas" -- nothing ensures that all replicas in the system are represented in the accounting for a given entity
> CRDT: Replicated state machines allow for generic fault tolerant systems. It does this by having consensus on append to a command log. But if the commands are commutative, those could be applied in any order. There is subset on optimizations on paxos for concurrent commutative commands, called Generalized Paxos. In the extreme case, if all operations are commutative, then you have this nice situation where you don't need consensus. And replicas can converge on the same state, eventually. These data structures with commutative operations are called CRDT.
My thoughts on how clocks, sync and CRDT are related. Not an authority on the subject, so could be quite wrong.
FWIW, at this point, the initialism has kind of become the name. It's like POSIX -- I doubt a lot of people know what it stands for, but it doesn't matter, because there's only one POSIX and it's the portable operating system interface.
Not expanding posix will get you a lot fewer blank-eyed stares than not expanding crdt (not that the expansion is particularly useful to someone not already in the space, but it's better than nothing)
Apropos, there was an initialism that had "commutative" in it, was that also crdt or something else?
Curious to know whether, in your opinion, the use of verifiable delay functions and proof of history represents a more reliable and scalable approach to decentralized time-ordering.
Need more time to think this through. A few comments:
1. The problem space being solved in proof-of-history needs special mention. It assumes byzantine faults. (My article does not).
2. If I understood it correctly, the idea behind proof-of-history is that you perceive the flow of time in "windows". Each window is linked to the state of the previous window (ala blockchain) and has enough randomness built into it that predicting window attributes is a very-low-probability case. When a new event is generated you declare that event to be part of a certain window. You could not have fraudulently backdated your event because the future windows already considered the original future window state (i.e., the state before you inserted your event). You cannot future-date your event because you cannot predict the randomness.
3. At the outset, this is a clever idea. But you mentioned "scalable", I wonder how you would deal with the order of events that are rightfully binned to the same window. Wouldn't you end up with "concurrent events" and find yourself back at square one?
4. At some level, this design can be classified as a causal consistent system. In a non-byzantine world, causal consistent systems can afford network partitions. But in this world, you assume the systems have access to the window-ing system of time flow.
Apologies if I grossly misinterpreted the article.
Yes, 1 seems to be the most important difference, with 4 as one of its most important consequences. It's definitely intended to be a solution to Byzantine faults (or double payments as they would show up in practice if the transactions are financial!).
I think your summary in 2 is accurate. The answer to scalability 3 seems to be that verifiable delay functions are always costly to solve the first time, but can be checked in parallel -- like a maze that takes time to solve the first time, but can be checked instantly simply by looking to see whether the solution crossed cross any walls.
It seems to me like a reasonable technological solution to time-ordering transactions in a decentralized ledger subject to byzantine faults, but I'm not an expert here. It's at the core of the Solana layer 1 blockchain. Solana took it on the chin with the FTX collapse because FTX had chosen Solana as its main exchange and made a heavy investment in its ecosystem. Wondering whether that was an exogenous shock that makes Solana a relative good investment opportunity at the moment. The capability of doing verifications in parallel means verification of 100k transactions per second. Which is on par with what the credit card companies can do.
net TPS for by a distributed and BFT system is going to be bottlenecked by the cost of information exchange between participating nodes, which is a function of protocol design and, ultimately, network bandwidth/latency
any basic server-class hardware can saturate normal network capabilities with no special effort: a single core, a few hundreds of megabytes of system memory, is all you need to push 1Gbps, probably even 10Gbps
but solana defines hardware requirements for validator nodes https://docs.solana.com/running-validator/validator-reqs that stipulate, at minimum, 12 physical cores, 128GB of RAM, and independent PCIe NVME SSDs for specific types of persisted state
even accounting for the additional work you need to do in a trustless/open/BFT/etc. protocol, these requirements are orders of magnitude beyond what should be necessary, and pretty clearly indicate that solana is very much at the "naive prototype" stage, technically
also, fwiw, solana is not doing the claimed theoretical maximum ~700k TPS, or even 100K TPS, but just 3k TPS, as reported by https://explorer.solana.com
Can't speak for op, but can speak as just an observer - probably not. Blockchains are ultimately CP systems, not AP. From that link - "Data can be inserted into the sequence by appending the data to the previous generated state. The state, input data, and count are all published. Appending the input causes all future output to change unpredictably."
Meaning there is a single source of truth. That single source is decentralized (because blockchain), but it still requires a quorum to agree to accept a bit of data, and to treat the new state as the valid state. I.e., multiple competing parallel writes must be serialized across a network of computers, which is always more expensive than if they multiple competing parallel writes can be done in parallel.
Lamport clocks and etc are AP. It fully expects nodes to make decisions autonomously, and to synchronize after the fact. And data can be lost without realizing it.
(The above is all a bit of a simplification, but fundamentally, they're solving different problems)
Blockchains as commonly deployed seem to be more AP to me: you can get into split-brain and while there's a way to select the canonical form after the partition is healed, that's after the fact. Whereas e.g. Paxos won't even let you do anything without knowing it has quorum.
alas, because blockchains allow open participation, there is no reliable definition of split-brain, or network membership, or anything at all, really
because there is no canonical definition of the set of nodes which represent a source of truth for the state of the blockchain
the blockchain state is a function of game theoretical (read: unreliable and non-deterministic) algorithms that assume the result provided a majority (typically 2/3) of participating nodes can be assumed to be valid
they are I guess CP based on these assumptions (and others), but those assumptions are probabilistic, they absolutely can be violated, and those violations create forks of the chain which invalidate even the weakest consistency models
and nobody seems to care
the notion that blockchains are "trustless" is a complete fiction, all state necessarily exists in the context of some trusted authority, decentralization just means that this authority is not well defined, arbitrary from the perspective of its downstream consumers, unaccountable
Yeah, I should probably say "they're treated as CP systems", in that they purport CP guarantees, when in reality they adhere to almost none of them (but, they also adhere to none of the traditional expectations of an AP system, i.e., a way to heal partitions and restore some form of consistency).
> However, because of clock drifts and/or assumptions around network time delays, timestamps from conventional clocks are not always mutually comparable, and therefore events cannot be reliably ordered using timestamps from conventional clocks.
This isn’t quite right. CockroachDB and Yuggabyte both use conventional clocks to reliably order events. Spanner uses GPS to do so (conventional time stamp but maybe not technically a conventional clock).
* Spanner creates very tight bounds on the clock synchronization (through atomic clocks/GPS) and ... just waits out the length of the bound.
* CockroachDB seems to do lamport-clocks-with-real-timestamps for linearizability (track the highest seen timestamp for causality chains). For preventing consistency violations with reads, they also track the bounds on the clock and potentially attempt to read again.
So they approve of the overall message (of not just comparing conventional timestamps) but work around those using the uncertainty of the clocks they have available.
My professor used to introduce logical clocks with "as we usually can't use atomic clocks on all nodes" because of that.
That's an interesting link, it leads to a this 1991 Liskov paper, which might seem out-of-date, but apparently was very foundational to the whole concept. A little searching turned up this modern discussion of it:
Seems the bottom line is: "Since clock synchronization can fail occasionally, it is most desirable for algorithms to depend on synchronization for performance but not for correctness."
is really broken if you have to wait for garbage collection, wait for your process/thread/coroutine/... to be scheduled, wait for a nearly invisible pause as a cache line is refreshed, .... The time you're using the atomic timestamp is at some future moment, and if that delay matters for your algorithm you need to track your uncertainty just like Spanner/Cockroach do (you just get significant benefits because the uncertainty is low).
If your get_atomic() isn't also capturing or resetting an internal hardware timer atomically it's kinda broken code anyway. It's important to remember that you're getting data from the firmware of an atomic clock (or GPS receiver) and that they will go to great lengths to compensate for latency variations so not doing it in the sampling code would be nuts (or just not using isochronous coms). Then the local clock/timer can be calibrated from the that and you can have near clock-cycle correct timing. Typically they only send out 1 pulse per second so anything finer than that is locally interpolated. We used 1ms internal intervals with an accuracy of ~1ppm.
Last time I worked on something like this was 20 years ago, but it was a Pentium box. The clock firmware to converted a 5MHz sinewave output to 1Pps, but could also output a synthetic UTC through LAN in cases where there was no other connectivity. I guess I'm saying this is really specialized code for whatever hardware/OS you're using.
All I was pointing out is that worst-case latency on a throughput-optimized OS can be unbounded, and it's not totally unexpected to see 1ms+ delays between those two lines of code, even if the clock implementation is flawless. Even in a RTOS you have jitter from cache misses and whatnot, just to a lesser degree. Code sensitive to clocks not being correct is often sensitive no matter how small the deviance is, and smaller deviations simply make bugs less frequent. Treating that point-in-time estimate as anything other than an estimate from the recent past can lead to code that looks flawless but occasionally breaks.
The GPS clock is really used to provide long-term stability for the computer's internal oscillator, which provides fine short-term stability. Your calls to gettimeofday() are handled by the internal oscillator, not by asking the GPS unit directly. Additionally, NTP daemons are impressive in their ability to get the "right answer" on current internal oscillator frequency as affected by the entire system. With 86400 very accurate data samples per day, you can do a lot!
In a distributed system, the absolute value of the time is important, since that's what your transaction timestamps are based on. There is a lot to do to ensure that your time offset from UTC ends up being correct; antenna cable length compensation, oscillator quantization compensation, etc. These are not strictly necessary for Spanner but can make microsecond-level differences which are quite noticeable.
"The API directly exposes
clock uncertainty, and the guarantees on Spanner’s timestamps depend on the bounds that the implementation provides. If the uncertainty is large, Spanner slows down to
wait out that uncertainty. Google’s cluster-management
software provides an implementation of the TrueTime
API. This implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock
references (GPS and atomic clocks)."
Basically, there is some tradeoff between clock synchronization perfection and transaction processing speed that can be made. You can build dedicated hardware that's clocked by a GPSDO and provide an API to get hardware timestamps, and maybe process transactions a little faster. Your good old C++ program running on Linux with an internal oscillator adjusted by NTP with a GPS PPS input still beats communicating between datacenters to agree on event ordering, though. Light is slow!
> In a distributed system, the absolute value of the time is important
Only for a co-variant data cohort is total temporal order necessary for correctness. It doesn't need to be system wide. Distinct (data independent) process groups can be partially ordered.
There really is no such thing as "now" outside of a shared frame (or intersecting frames) of observation.
FYI the general technique Cockroach is using is commonly called Hybrid Logical Clocks, though Cloudera call their very similar idea Virtual Time as I recall.
> perfectly synchronized clocks are a holy grail of sorts for distributed systems research. They provide, in essence, a means to absolutely order events, regardless of which node an event originated at
two events that occur on two physically discrete nodes at exactly the same time have no well-defined order, even if their clocks are synchronized
> before a node is allowed to report that a transaction has committed, it must wait 7ms. Because all clocks in the system are within 7ms of each other, waiting 7ms means that no subsequent transaction may commit at an earlier timestamp, even if the earlier transaction was committed on a node with a clock which was fast by the maximum 7ms. Pretty clever.
so sending information between two antipodes on earth takes 66ms in one direction, 132ms round-trip, minimum. the speed of light dictates this lower limit
if two transactions are made on those two nodes at exactly the same time, there is no objective order between the two. you can choose an order, but only with knowledge of both transactions, and that information physically cannot traverse space-time in 7ms
so it's really not clear to me how this works. as an optimization, sure -- maybe there is some optimistic concurrency control that will fail transactions that conflict in the way i've described?
yes but those guesses have "smudge windows" which are functions of the physical distance between relevant nodes
if you want to make assertions about order between events from a set of nodes, and you want to use per-node physical clocks to determine that order, then even if those clocks are perfectly synchronized, order can only be decided when the light-cones of all nodes intersect, which is the maximum distance between any two nodes times the speed of light
7ms works for up to 2098km, that's in the best case
Strictly speaking, atomic clocks by themselves wouldn’t fully solve the problem either, because they tick at slightly different rates in different locations, due to variations in the gravitational field.
This is why reading the original Lamport paper is useful. This distributed nature of things isn’t due to bad clocks or software or anything. It is an irreducible part of the physics of our world, with relativity. Perfect clocks won’t help with space like separations and different velocities or gravitational force. You have to build communication into it. The causality relationship is inherently a partial order.
This isn’t really true: it’s not got anything to do with relatively specifically, it’s just because of the finite and varying time for messages to travel between points in a distributed system. So it’s about physics but not about relativistic physics. A Newtonian universe with finite speed of light would have the same phenomenon.
That said, the analogy with causality in relativity and light cones etc is useful, and it’s no coincidence that Lamport wrote a book on general relativity before he turned to distributed systems.
What makes relativity special is that everybody agrees on the speed of light. So Alice is passing Bob at say 0.1% of c, she detonates a big firecracker or other bright sudden event, the light from that event forms a big bubble that expands away from her in all directions at speed c.
If you imagine that light is like sound, that it propagates through a medium called the lumineiferous ether, then maybe Bob is at rest relative to the luminiferous ether. If so, he thinks that the bubble is centered on a fixed point in space, and Alice is just a smidge closer to one of the sides of the bubble than she is to the other. Alice, on this account, sees the center of the bubble drifting “backwards” relative to her motion and agrees that she is closer to one edge than the other. If they were right next to each other when the light burst went off, Alice agrees that Bob is the true center of the light bubble.
Special relativity says that there is no ether: Bob thinks that this expanding light bubble is indeed expanding from a fixed point and that Alice is closer to this side than that, and he's right... but Alice thinks that this bubble is expanding out uniformly with herself at the center, and she is also 100% right. According to her, Bob is closer to one side of the bubble than the other, and Alice is the true center.
This has a bunch of consequences. The first one is that it makes the Zeno paradox into a real life thing. To outrun a light beam, you first have to accelerate to half the speed of the light beam, at which point if you look to see how fast it's going away from you, it is receding at speed c still, so you accelerate again to half the speed of the light beam and it is still moving at speed c away from you, and you can never catch up to it. Bob watching this must agree that Alice never catches up to the leading edge, even though he sees her dumping tremendous amounts of energy into her motion. So nobody can accelerate faster than the speed of light, it takes infinite energy to get there. (The converse of this has now become engineering reality, all of our particle accelerators dump huge amounts of energy into small particles and operate under the simplifying assumption that they all max out at speed c.)
The Newtonian universe with finite speed of light is the first situation, the relativistic universe is the second.
No, a Newtonian universe with finite speed of light is consistent. People were able to do physics in that model for two centuries. Special relativity flows from the much stronger and more radical assumption that the speed of light is constant in all frames of reference, which is what the Michelson-Morley experiment showed to be true in our universe.
there is no sense of simultaneity that crosses coordinate systems. there are three relationships between events a and b. a is before b, b is before a, and neither. to get the total ordering, you have to arbitrary and non physically pick a or b.
in that Newtonian universe, good enough clocks and time stamps could be used to break the tie. good enough clocks will never be enough in our universe, there is no canonical way to say of two events separated in a space like way is first. in a newtonian universe you can.
When a system overheats the CPU can speed up and the clocks drift and if you're producing volumes of events you will get corruption. Wouldn't ever use a system clock... You can write a hybrid logical clock in about forty lines of code that can scale as much as these wildly expensive appliances that these big tech companies shill.
It really depends on how you define "reliability".
If a system never orders events wrong but will immediately stops processing events when enough number of nodes have their clocks out-of-synced, which is the case with all systems using conventional clocks, can it still be considered reliable?
I suppose you can always order events by the time it takes for light to (physically or hypothetically) travel to a central server plus the server's local time.
Huh, this is a really nice writeup of the logical time part of the foundations of distributed systems lecture I used to assist in. I always wondered how much they are actually used in real systems.
I haven't seen basic Lamport Clocks used much but sequence numbers and various kinds of vector clock are widely used in eventually consistent systems (fashionable to call this CRDT nowadays).
Edit: perhaps git uses a kind of lamport clock, but with linked lists of hashes not numbers as the values.
Right. The flavors of Lamport clocks I stated in the article are used in CRDTs designs I studied.
While CRDTs are eventually consistent, I wouldn't dismiss them as such without qualification. They are causal-consistent when offline and sequential-consistent when online. (This duality is why CRDTs have been hard for me to wrap my head around them).
Indeed, certain properties of CRDT are invariable to network state. However, it is worth pointing out that in ops-based CRDT “implementations”, you deal with local ops case differently from remote ops case. That is, while the properties are invariant, how you produce them are different.
So I was on a my quest to understand the “essence” of CRDTs, not just understand them to be able to practically use them. Atomic broadcast and Raft were easy enough for me to wrap my head around (although quite challenging to implement). But not CRDTs.
I found common statements about CRDTs that they ensure ops to be commutative to be superficial. A slightly deeper characteristic was that ops-based CRDTs are just causally-linked ops (aka causal tree). But what about state-based? Finally, when I realized CRDTs are dual-consistent and that’s what makes any data structure a CRDT, that was a moment of epiphany for me.
so an important "eureka" observation about CRDTs is that the op-based model is theoretical, useful for proofs insofar as any op-based CRDT can be translated to a state-based CRDT, but not something that can actually exist in practice
all practical CRDTs are state-based CRDTs
that's because it's not possible to assert a causal order for arbitrary operations on arbitrary data structures (in useful contexts)
CRDTs don't _ensure_ ops are commutative (and associative, and idempotent) but rather they _require_ that ops are commutative (and associative, and idempotent)
and it's definitely not the case that any data structure is a CRDT, it's possible to translate many data structures to CRDTs, but that translation rarely preserves the operations in full fidelity
> CRDTs don't _ensure_ ops are commutative (and associative, and idempotent) but rather they _require_ that ops are commutative (and associative, and idempotent)
I disagree. You can create a CRDT flavor of data structure whose ops are not commutative. For example, a Set's add and delete operations. These are not commutative. You cannot switch the order of the ops for meaningfully processing them. However, you can create a CRDT Set. You do that by adding metadata to the ops, and having the instances always process them in the only order that makes sense even if such instances receive the ops in a different order. In that sense, you are "ensuring" ops are behaving like they are commutative and not "requiring" them to be so.
> it's definitely not the case that any data structure is a CRDT
I could have worded my statement better. I meant any data structure that has the aforementioned duality property is a CRDT. Not that any data structure unconditionally can be translated into a CRDT.
> an important "eureka" observation about CRDTs is that the op-based model is theoretical,
I do not understand your statement. Perhaps you could elaborate. My understanding is that a CRDT is op-based or state-based depending on what is "communicated" between the instances. If ops are communicated, then it is op-based CRDT, whereas if states (or delta-states) are communicated, then it is state-based.
At least in that sense, op-based model is NOT theoretical. Perhaps you have a different point in mind that I fail to observe.
> > CRDTs don't _ensure_ ops are commutative (and associative, and idempotent) but rather they _require_ that ops are commutative (and associative, and idempotent)
> I disagree. You can create a CRDT flavor of data structure whose ops are not commutative
i think it's important to reiterate (for other third-party readers) that this is factually incorrect
a data structure with operations that are not commutative (or associative or idempotent) is not a (state-based) crdt by definition
> I disagree. You can create a CRDT flavor of data structure whose ops are not commutative. For example, a Set's add and delete operations. These are not commutative. You cannot switch the order of the ops for meaningfully processing them. However, you can create a CRDT Set.
set add (union) is commutative, set delete is not. so a set with only the add (union) operation is a CRDT, but a set that supports both add and delete is not a CRDT, at least not without very specific caveats.
you can create an add-only CRDT set, or an add-remove CRDT set, or a variety of other CRDT sets that support specific operations with specific caveats. but you can't create a CRDT set that is a fully-fledged set, with all the operations that sets generally provide.
> You do that by adding metadata to the ops, and having the instances always process them in the only order that makes sense even if such instances receive the ops in a different order.
so this is the crux of the issue, I think -- "having the instances always process [ops] in the same order" is basically not possible in any real-world network
first, because it's not possible to decide what the "correct" set of ops actually is, nodes are always subject to partitions, faults, etc. which prevent reliable dissemination of knowledge, and plus all the stuff about light cones and etc.
second, because "the same order" implies a specific total ordering of events is unknowable (see prior comments)
> you are "ensuring" ops are behaving like they are commutative and not "requiring" them to be so
converting a non-commutative operation to a commutative operation in a distributed system requires reliable delivery, which no network provides
the whole point of CRDTs is that they give you a formally strong version of eventual consistency that holds even in the face of (unavoidably) unreliable delivery
> My understanding is that a CRDT is op-based or state-based depending on what is "communicated" between the instances. If ops are communicated, then it is op-based CRDT, whereas if states (or delta-states) are communicated, then it is state-based.
this is true in the abstract, but the issue is that "what is communicated" is not a given, it's subject to the choices you make when you encounter a network fault -- or in the CAP model, a partition
CRDTs are tools for eventually consistent (AP) systems, which means that you have to keep making forward progress if there are partitions, which means that message delivery between nodes is not reliable, it can always fail
for state-based CRDTs if you fail to deliver a message it's fine, the information in that message is not lost forever, it will be included in the next message, and (if the partition is eventually healed) the ultimate state will converge. this is also true for delta states
but for op-based CRDTs if you fail to deliver a message it's not fine, the information in that message is lost forever, it won't be included in the next message, and (even if the partition is eventually healed) the ultimate state will not converge
> "having the instances always process [ops] in the same order" is basically not possible in any real-world network
By having 1) causal order (eg. using what the article refers to as Lamport Causal Clock) and 2) a deterministic sorting function to sort ops that happened concurrently (from the perspective of causal order), we can derive total order.
It’s absolutely possible and used.
And with those two properties, almost any data structure can be turned into a (op-based) CRDT.
That is to say, thoughtlede has it correct in their comments above.
We don’t assume “reliable delivery” in AP or eventually consistent systems. We assume “once all messages have been delivered…”
So if you have all messages and the two properties above, a total order can be derived.
You’re correct to say that causal order != total order as such but with the use of correct primitives, like Lamport Causal Clocks, we can get a nice and clean linear order of events :)
"correct primitives" do not by themselves provide a linear order of events
a
/ \
b c
\ /
d
b and c are concurrent updates, how do you resolve d?
it's a trick question, you can't resolve d without information loss, unless you bring in additional knowledge from the application
you can resolve d with information loss by defining some heuristic for ordering concurrent updates, a.k.a. last-writer-wins, basically picking one of those updates deterministically
that gets you a total order, but it's cheating: whichever concurrent updates you don't choose are lost, and that violates consistency for any replica(s) that are based on that state
> "correct primitives" do not by themselves provide a linear order of events
Review the description of Lamport Causal Clock in the article. Note that it carries “additional info” (additional to the example diagram). This “additional info” is what establishes the structure needed for total order.
> whichever concurrent updates you don't choose are lost, and that violates consistency
They’re not lost! The concurrent updates not chosen are still part of the “list of operations”, but the update to the data/value itself may not be observable if the subsequent update (the update that was sorted to be first) updates the same data/value (eg. both operations update the same key in a key-value structure). If the two operations update different data/value, then both updated values are observable. This isn’t cheating, rather it works exactly as expected: it is eventually consistent.
we are only talking about updates to a specific value here, obviously updates to independent values are trivial to resolve
it's possible to construct a CRDT such that concurrent updates are merged without data loss to a single "list of operations" maintained in the object, but that's not true in general
resolving conflicts with the lww strategy, or variants of that strategy that order concurrent events by e.g. node ID, are indeed eventually consistent at the highest level, but they provide no meaningful consistency guarantees to users, because they allows "committed" writes to be lost
Can you elaborate what do you mean by this? I was arguing that it’s possible as the original argument was “this is not possible in a real system and is only theoretical”.
> provide no meaningful consistency guarantees to users, because they allows "committed" writes to be lost
If I set the (shared) value to green and you set it to blue, what is the expected observed value? What if you set it to green and I set it blue, what is the observed value? More importantly, what is the consistency that was lost?
so there are multiple threads of conversation here
first, there is no straightforward way to module "a value" as a CRDT, precisely for the reason you point out: how to resolve conflicts between concurrent updates is not well-defined
concretely, if you set v=green and I set v=blue, then the expected observed value is undefined, without additional information
there are various ways to model values as CRDTs, each has different properties, the simplest way to model a value as a CRDT is with the LWW conflict resolution strategy, but this approach loses information
example: say your v=green is the last writer and wins, then I will observe (in my local replicas) that v=blue for a period of time until replication resolves the conflict, and sets (in my local replicas) v=green. when v changes from blue to green, the whole idea that v ever was blue is lost. after replication resolves the conflict, v was never blue. it was only ever green. but there was a period of time where i observed v as blue, right? that observation was invalid. that's a problem. consistency was lost there.
--
second
> almost any data structure can be turned into a (op-based) CRDT.
yes, in theory. but op-based CRDTs only work if message delivery between nodes is reliable. and no real-world network provides this property (while maintaining availability).
> there was a period of time where i observed v as blue, right? that observation was invalid.
Not invalid. The observation was correct at that time and place, meaning, in the partition that it was observed. This is the P in AP.
It seems to me that you’re talking about and arguing for synchronization, that is, consensus about the values (=state), which takes the system to CP as opposed to AP.
> there is no straightforward way to module "a value"
I would recommend to look into the “Register” (CR)data structure.
there is no single definition of a crdt "register"
i know this because i've implemented many versions of crdt "registers", with different properties, at scale
there are lww registers and multi-value registers, the former provides deterministic (but arbitrary) conflict resolution which loses information, the latter provides deterministic and non-arbitrary conflict resolution without losing information but with a more complex API
> The observation was correct at that time and place, meaning, in the partition that it was observed. This is the P in AP.
this is not what partition means
partition means that different subsets of nodes in a single system have different views on reality
if v=blue was true only during a partition, and is no longer represented in the history of v when the partition heals, then this value is not legal, violates serializability, is incorrect, etc.
> there are lww registers and multi-value registers
Use a single value register then in place where I said “register”.
> partition means that different subsets of nodes in a single system have different views on reality
Partition means that nodes of a (single) network are (temporarily) not connected.
In the example discussed here, blue and green were in different partitions thus had “different view” to the state. Once synchronized, their view on the state is consistent and both observe both values, blue being the latest state.
> is no longer represented in the history of v when the partition heals
Please review the above discussion again and the original article. You keep saying this but it’s shown in both that it’s is not true.
"last writer wins" does not preserve state from losing (earlier) writers/writes
after synchronization, all nodes share a consistent view of state (yes) but that state has no history, it only contains the net final value (blue) as the singular and true state
this is not complicated, i think you're out of your element
> but for op-based CRDTs if you fail to deliver a message it's not fine, the information in that message is lost forever, it won't be included in the next message, and (even if the partition is eventually healed) the ultimate state will not converge
Can't you just have every node keep a history of ops, and when nodes communicate with each other they can compare clocks to know which ops to re-deliver? We should also be able to enforce idempotency this way.
basically this makes peer connections stateful, but maintaining that state accurately is very difficult, especially when considering tomography changes in the system
in fact if you can manage that state correctly, you've solved a problem that's roughly the same as the problem that CRDTs solve
(in other words, you're almost certainly not gonna solve that problem correctly)
An update sequence number (USN) is a 64-bit number in Active Directory that increases as changes occur. Local counters on every domain controller assign USNs.
Whenever an object is changed, its USN is incremented. When replication occurs, only the version of the object with the greatest USN is retained.
Local counters for USNs are considered reliable because they never decrease or "run backward." USNs are also always unique, making it easier for domain controllers to never use the same USNS at the same time.
I’ve been wondering off and on if the end game for borrow checkers is a language that can mode causality in data operations.
This withdrawal was predicated on checking records A and B at point X in the timeline and approving it. Meanwhile auto withdrawal happened 20 ms later and that’s why we let him overdraft.
Active Directory is the most wildly used multi-master distributed database in the world and it uses update sequence numbers.
An update sequence number (USN) is a 64-bit number in Active Directory that increases as changes occur. Local counters on every domain controller assign USNs.
Whenever an object is changed, its USN is incremented. When replication occurs, only the version of the object with the greatest USN is retained.
Local counters for USNs are considered reliable because they never decrease or "run backward." USNs are also always unique, making it easier for domain controllers to never use the same USNS at the same time.
The documentation that Microsoft puts out regarding its "multi-master" features basically advises against using "multi-master" mode and leveraging the Flexible Single Master Operation (FSMO) Roles to ensure consistency across the domain and forest because its conflict resolution is not comprehensive [1]. FSMO roles were created decades ago as a transition from Windows NT's Primary Domain Controller (PDC) model to avoid some of these complexities around changes in a large distributed system. Its a decent solution that has its own pros/cons, but AD is not something I would use as a good example of a multi-master distributed database.
FSMO roles are only used for the tasks which are not suited to multi-master replication. The vast majority of data can be mutated on any domain controller.
conflict resolution for concurrent modifications of the same object on different domain controllers are resolved first by local timestamp (unreliable) and then by GUID order (arbitrary) and in both cases mean that it's possible for someone to make a change to an object that is reflected in local reads for a period of time, but then "erased" and forgotten after replication, tl;dr: this approach provides basically no meaningful consistency at all
my understanding is that conflict resolution is done by all domain controllers comparing the USN of each object and replicating the object with the highest USN.
lww is arbitrary, because (a) it's unknowable if some local state is actually valid, and (b) resolving conflicts is destructive
conflicts cannot be resolved in general without additional information from the application
in an lww system if a bit of data loses in an lww contest, that data was not just stale, it actually becomes invalid
if you set a=1 and then do a bunch of operations based on the idea that a=1, and i set a=2 and then do a bunch of operations based on the idea that a=2, and lww resolves that my a=2 wins over your a=1, then after that merge, your a=1 is not just stale, it's entirely invalid, erased from history. at no point did a ever equal 1. anything you did with that assumption is similarly invalid. and lww provides no way to manage the causality implications of that decision.
it's fine for many use cases but it's not a general-purpose solution.
Things like this are interesting for a couple of reasons. Synchronizing clocks in the network and even attaching the current time to packets is supported with the correct hardware (PPS). This is extremely reliable (when ordering is important, not necessarily for keeping time).
What makes solutions like this article interesting is that this is at the application level, while there are synchronization problems you can solve at the hardware level when you control the hardware. Since moving things to the cloud, it’s as though we’ve forgotten about these technologies that have been around for over 20 years.
I find most of this stuff is an attempt to apply a mental model of the universe that is not logically consistent with the way the universe actually works.
Basically, what I'm saying is if the universe can live without strict ordering of events as a distributed system, chances are whatever thing you are designing can also live without ordered events.
For some reason people always point to financial transactions yet, there's absolutely no ordering to financial transaction events. eg. If I buy something on the weekend usually my statement says I bought it on Monday, and sometimes a day later the transaction might be visible, but might also say it occurred on Monday even though it's Sunday.
Instead banks have a better solution, if your bank account goes below zero, they charge you extra money to go extra below zero.
Given Einstein proved time is relative and not absolute I wonder at what time resolution all attempts to assume time as an absolute fall apart? The maximum distance of two surface nodes on this planet is about 20_000 km translating to about 67 milli seconds as speed of information.
All discussion here assumes 17th century Newton's physics.
Much of the rule of law is based on cause and effect. Most of the people who get hung up on time stamps are trying to reconstruct a chain of events over time. And like a lot of concurrency problems, the solution seems to be to solve a different problem instead. Such as causal chaining.
We made an error here because we were aware of three facts but not aware of a fourth. The answer is wrong but the data is consistent. That’s important for finding bugs. It’s also important for establishing intent.
If I bonk you with a bat, it might be an accident. If I had just found out you’ve been sleeping with my wife, people should be asking a lot more questions. If I don’t know that, then there’s no causal link between these events.
Which happened first can be in the eye of a third party observer versus a first party. it doesn’t matter which happened first, it matters what the observers saw and whether they can agree. The relativistic example here says that observers will never agree.
If an event that I didn’t observe happened first, it didn’t influence my decisions. It might influence yours though, but when things get grey we give benefit of the doubt.
If I communicate with you then that's enough to determine data causality, which is a weaker version of scientific causality. That does not mean that you necessarily shall do something as a result of my communication, so we shouldn't be talking in terms of you bonking someone with a bat as a result of receiving a communication.
In scientific causality, A shall cause B to the degree to which A is explanatory for B.
Most of us here are really trying to sort out bugs and data consistency issues. If I’m trying to figure out who deserves the last seat on the airplane or the last item on clearance or whether to charge you an overdraft fee, chronology doesn’t help that much, because I don’t really have one. I have a fiction that looks like one.
Time is not real, it is a measurement. Space is real. Einstein's times is a quantum model of change over space, but really only proves that where you stand is relative to where I stand, and our relativity changes if we move through space.
As I have just proven, Einstein's relativity is really a confabulation of what is obvious, resulting in people getting lost in that model universe. It is only useful for taking measurements within that model, everything else is worshiping light, the speed of.
> Time is not real, it is a measurement. Space is real.
In what particular way is space real that time is not?
> Einstein's times is a quantum model of change over space, but really only proves that where you stand is relative to where I stand, and our relativity changes if we move through space.
I think you might have misinterpreted the vocabulary: in the "principle of relativity", the word "relativity" refers to the relation between two coordinate systems, not the relation between two points in space.
> As I have just proven, Einstein's relativity is really a confabulation of what is obvious, resulting in people getting lost in that model universe.
All you have done is made assertions, without evidence or reasoning. You can certainly do that, but how is it supposed to prove anything? What is the argument?
1: The clock is real, the minute is abstract. The real and abstract are the essence of mathematical relativity, which is how we derive "irrational constants" like the relative measurement of one thing (radius) to another (circumference) making a ratio (PI). PI can never truly be defined, because it's not real: it can only be measured more or less accurately. If time and the clock were both same with regard to being real, there would be nothing to measure, or no reason to measure: time would be the clock. But we know a clock tells time, which is a measurement: days into hours into minutes into seconds. If a day is time, and so is an hour, time is no more real than PI, which describes all radii to arbitrary precision. Time describes change of space over arbitrary intervals, it's a derivative. The lightyear is a further derivative of speed over time, which wouldnt make sense if time wasn't describing a distance of space, which is what a lightyear is.
2: More than one point in space makes a coordinate system, and more than one point is a prerequisite of relativity. (Source?) Plus, within the questionable physics models which allow for singularities, the notional difference between a coordinate system and a point is conspicuously broken in order to explain other things. But that's how we get vague terms like spacetime, descriptions for weird parts of the model.
3: I did prove it, and you can reproduce it yourself by going to the nearest mountain range and yodel. In other words, I don't need math proofs from within abstruse models to prove my point, so my argument is more sound scientifically. I proved "relativity" is nothing more than understanding the difference in change over space, which is a very obvious/observable thing, perhaps the most basic of all material existence, comporting to the idea that no two things can be one, or "occupy the same space".
Time measures change over space, a thing you witness firsthand when your yodel returns in echo from the mountain. If space did not change from point A to point B, there would be no distance, no coordinate system, and our yodel would not echo from anywhere. Without change over space there is no time, so that tells you which one is primary. The measurement of the echo, taken in time units, aka intervals, gives useful information. It does not however indicate anything of some model's out-of-bounds parameters, like what if the mountain had infinite gravity?
The measure of an echo could not prove a black hole, even by the very definition of a black hole, for an echo would not return (infinite inches of time til return). Ergo no measurement can prove a black hole. Only a model taken to mathematical infinity "proves" a black hole, and that is not real science, it's just math. Math proofs and scientific falsity are not the same side of a single coin. The observation yields math, not the other way around, or else I could say look here in my notebook, it's a real black hole!
> Time measures change over space, a thing you witness firsthand when your yodel returns in echo from the mountain. If space did not change from point A to point B, there would be no distance, no coordinate system, and our yodel would not echo from anywhere.
I'm still not following what you mean. What is "space"? You claim it is real, in the same sense that a physical clock is real. Is it some kind of physical thing in the world? Is it the general idea that two objects can occupy different positions?
What does it even mean for "space" to "change"? Does it refer to the existence of different positions? Does it refer to the idea that a physical object can change which position it occupies? Why does the concept of distance depend on the existence of "change over space"? We can make the obvious observation of transmitting a sound to a distant object and receiving an echo back. But I don't see how to derive your concept of "change over space" from such an observation.
Further, why is "change over space" in particular the quantity that "time" measures? Normally, I'd think that "time" refers only to the intrinsic relationship between different events that allows us to place them in an ordered sequence. What does this necessarily have to do with "change over space"?
I don't mind; I'm mainly curious as to whether their ideas are self-consistent, more than anything else. Any coherent system of beliefs has to be either aligned with experimental evidence (perhaps with unconventional terminology), independent of experimental evidence (i.e., unfalsifiable), contrary to experimental evidence, or self-contradictory. It's an interesting exercise to tease out which one it is, yet often unsuccessful, if the other person gets tired of your incessant questions. (In the worst case, they simply refuse to let their beliefs cohere at all, falling back to unexplainable mysteries that it's your fault for not already understanding.)
Well, I'm afraid you beg the charity of your readers' minds when you post ideas contrary to popular belief on a public forum. If begging others' charity is something to be avoided, then the alternative is to keep your ideas to yourself.
"Rational discourse" is the demand of people who cannot grok and need it explained within their epistemology. It is pointless, for they bring no conviction but disbelief and biases; and it is endless because it gives them not food for thought but "rations for argument", which is all their lazy intellect desires.
Spacetime is a colloquial term for a derivative calculation or projection which only applies within the model. It has also not proven very germane as a unit, or it would be used more like a unit, rather than as description of a picture. The term seems to relate to the rendering of space in the model via geometry, where the model is based on singularities, which are the antithesis of space, and so is the model. We experience space, but we only find singularities from within a model.
That we pretend the relative speed of light is constant, and this proves black holes and big bangs, that is pseudo science taking a model beyond it's working boundaries, where it's calculation have no meaning; to wit, this is a physical analogue in optical aberration, where the rim of the lens is asymptotic refraction to the point of unreliable measurement (not a singularity, unless of course you model it as such, in the abstract).
The real question is why so much energy is used staring at model aberration. The "Uncertainty paradox" has everyone believe the model is real, and even ask the question "at what point does the model become real". The model is never real, and never becomes real, it stays a model forever. Whether the cat is dead or not is simply a measurement where, if you mistake the paradox for theory, you don't realize the probability of dead or alive given decay is nothing more than a measurement taken ex post facto, and not proof that the real world accord to the model.
> However, because of clock drifts and/or assumptions around network time delays, timestamps from conventional clocks are not always mutually comparable, and therefore events cannot be reliably ordered using timestamps from conventional clocks.
I don't think that this distributed logical clock solution isn't worth working on, but some combination of a low-stratum NTP server combined with PTP is good enough for most people, I would think, but cloud solutions are mixed.
Some context behind the article. I studied CRDTs for a few months, and noticed that different CRDT designs use logical clocks in different and clever ways. And I haven't seen anyone narrate all those ways of use in one article. My attempt with this article was to dredge up those flavors of logical clocks into one article and give them names for future reference.
(To respond to a couple of other comments, I ignored atomic (and gps-based) clocks in this discussion, as indicated in my footnote 3).