> Why replicate data at the VM/disk level when those disks are already provided ...

derefr · on Aug 7, 2023

Many systems might just be using Kafka to drive async batch transaction processing (think: sending emails, charging credit cards), and therefore don't care at all about availability.

nithril · on Aug 7, 2023

From a read perspective I would agree but from a write perspective if the partition is not available? (genuinely asking)

derefr · on Aug 7, 2023

Think Unix pipelines: writes that can't immediately complete block the producer. (Probably with a bit of an in-memory buffer, but still.) Pile up enough un-ACKed messages to push, and the whole producer should stop consuming its own input end until the consumer on its output end comes back to start pumping messages again. The whole pipeline from the blocked stage back to the start receives backpressure and temporarily stalls out — which is fine, because, again, async batch processing.

And yes, this means that you need to have logic all the way back at the original sender (the one triggering the async message-send as part of some synchronous business-logic), to be able to refuse / abort / revert the entire high-level business-logic operation if the async-message-send's message-accept fails. (A user shouldn't be considered signed up if you can't remember to send them a verification email; a subscription should not be created if you can't remember to charge the card; etc.)

In est, you can think of this as "semi-async": each stage is doing a synchronous RPC call to an "accept and buffer this batch of async messages" endpoint on the broker — which might synchronously fail (if the broker is unavailable, or if the consumer of a bounded-size(!) queue has blocked to create backpressure and therefore the queue has filled and the broker has in turn stopped accepting to that queue.)

With such an API, rather than pretending that there's some magic reliable-delivery system you can "fire and forget" messages onto, these failures gets bubbled up to the caller on the send side, like any other failure of a synchronous RPC call.

Take this to its fullest extent, and you get Google's "effectively synchronous" RPC philosophy, where you have event brokers for routing and discoverability (think k8s Services), but async messages are always either queued in either the sender process's [bounded] outbox, or the recipient process's [bounded] inbox, with no need for a broker-side queue, because everything is designed with backpressure + graceful handling of potential accept failure in mind, including the initial clients knowing to retry pushing the initial message-send. (If you're familiar with the delivery semantics of Golang channels — it's basically that, but distributed rather than process-internal. There's a reason that particular language feature came out of a language designed at Google.)

---

Mind you, there's also the "truly async" batch-processing semantics — the kind ATMs have, where if even the initial client doing a synchronous operation (think: withdrawing cash) can't get in contact with the server/broker to push the async message-sends, then you just append the message to a big ol' local log file, and proceed as if the async sends already succeeded; and then later, when you come back online, you dump your whole built-up log of messages to the broker, and all events in the log are inherently accepted — but there are higher-level semantics that might generate additional revert events in response to some of them (i.e. if the ATM user overdrew their account), that get backfed into the system. But you, as the initial producer of messages, don't have to worry about collating those against your messages or anything like that.

rcarmo · on Aug 7, 2023

Managed disks reside on redundant and high availability storage.