Async message-oriented systems vs. REST for inter-microservice communications

derefr · on Feb 12, 2023

> Back-pressure (e.g. slowing down the entry-points) can easily be introduced if queues becomes too large.

...which presumably includes load-shedding to stop misbehaving components from overloading the queues; at which point, unless you want clients to just lose track of the things they wanted done when they get a "we're too busy to handle this right now" response, you've essentially circled back around to clients having to use a client with REST-like "synchronous/blocking requests with retry/backpressure" semantics — just where the requests that are being synchronously-blocked on are "register this as a work-item and give me an ID to check on its status" rather than "do this entire job and tell me the result."

And if you're doing that, why force the client to think in terms of async messaging at all? Just let them do REST, and hide the queue under the API layer of the receiver.

stolsvik · on Feb 12, 2023

Well, yes - there is nothing with Mats that you cannot do with any other communication form, if you code it up. When you say "register this as a work-item and give me an ID to check on its status", you've implemented a queue, right?

The intention is that Mats gives you an easy way to perform async message-oriented communications. Somewhat of a bonus, you can also use it for synchronous tasks, using the MatsFuturizer, or MatsSocket. A queue can handle transient peaks of load much better than direct synchronous code. It is also quite simple to scale out. But if you do get into problems of getting too much traffic for the system to process, you will have to handle that - and Mats does not currently have any magic for performing e.g. load shedding, so you're on your own. (I have several thoughts on this. E.g. monitor the queue sizes, and deny any further initiations if the queues are too large).

Wrt. synchronous comms, Mats do provide a nice feature, where you can mark a Mats Flow as "interactive", meaning that some human is waiting for the result. This results in the flow getting priority on every stage it passes through - so that if it competes with internal, more batchy processes, it will cut the lines.

derefr · on Feb 12, 2023

> A queue can handle transient peaks of load much better than direct synchronous code.

Whether a workload is being managed upon creation using a work queue within the backend, has nothing to do with the semantics of the communications protocol used to talk about the state of said workload. You can arbitrarily combine these — for example, DBMSes have the unusual combination of having a stateful connection-oriented protocol for scheduling blocking workloads, but also having the ability to introspect the state of those ongoing workloads with queries on other connections.

My point is that clients in a distributed system can literally never do "fire and forget" messaging anyway — which is the supposed advantage of an "asynchronous message-oriented communications" protocol over a REST-like one. Any client built to do "fire and forget" messaging, when used at scale, always, always ends up needing some sort of outbox-queue abstraction, where the outbox controller is internally doing synchronous blocking retries of RPC calls to get an acknowledgement that a message got safely pushed into the queue and can be locally forgotten.

And that "outbox" is a leaky abstraction, because in trying to expose "fire and forget" semantics to its caller, it has no way of imposing backpressure on its caller. So the client's outbox overflows. Every time.

This is why Google famously switched every internal protocol they use away from using message queues/busses with asynchronous "fire and forget" messaging, toward synchronous blocking RPC calls between services. With an explicitly-synchronous workload-submission protocol (which may as well just be over a request-oriented protocol like HTTP, as gRPC is), all operational errors and backpressure get bubbled back up from the workload-submission client library to its caller, where the caller can then have logic to decide the business-logic-level response that is most appropriate, for each particular fault, in each particular calling context.

Message queues are the quintessential "smart pipe", trying to make the network handle all problems itself, so that the nodes (clients and backends) connected via such a network can be naive to some operational concerns. But this will never truly solve the problems it sets out to solve, as the policy knowledge to properly drive the decision-making for the mechanism that handles operational exigencies in message-handling, isn't available "within the network"; it lives only at the edges, in the client and backend application code of each service. Those exigencies — those failures and edge-case states — must be pushed out to the client or backend, so that policy can be applied. And if you're doing that, you may as well move the mechanism to enforce the policy there, too. At which point you're back to a dumb pipe, with smart nodes.

jgilias · on Feb 12, 2023

Is there something I can read about Google switching to sync RPC? Like a blog post or something like that?

Thanks!

anentropic · on Feb 13, 2023

> it has no way of imposing backpressure on its caller. So the client's outbox overflows. Every time

I don't see why you couldn't implement back-pressure here

and the OP article also states:

> Back-pressure (e.g. slowing down the entry-points) can easily be introduced if queues becomes too large.

stolsvik · on Feb 12, 2023

"Not everybody is Google"

These concepts has worked surprisingly well for us for nearly a decade. We're not Google-sized, but this architecture should work well for a few more orders of magnitude traffic.

Also, you can mix and match. If you have some parts of your system with absolutely massive traffic, then don't use this there, then.

Note that we very seldom use "fire and forget" (aka "send(..)"). We use the request-replyTo paradigm much more. Which is basically the basic premise of Mats, as an abstraction over pure "forward-only" messaging.

bcrosby95 · on Feb 12, 2023

RPC works for just non Google and Google scale. This is one of the times where, IMHO, you can skip the middle section. Novices resort to RPC, Google resorts to RPC, and in the mid tier you have something where messaging can step in.

Why not skip it? Use RPC like a novice. If it becomes problematic, start putting in compensating measures.

TedDoesntTalk · on Feb 13, 2023

What about systems where the clients push more messages per second than consumers can process?

lenkite · on Feb 13, 2023

Internal Mailbox/Queue on the receiving side. Caller gets back a transaction id that they can query for progress or caller provides a webhook for confirmation/failure. If mailbox is full, callee immediately responds with 429.

zeendo · on Feb 13, 2023

To quote the gp

    hide the queue under the API layer of the receiver

derefr · on Feb 12, 2023

> Note that very we use "fire and forget" very seldom (aka "send(..)"). We use the request-replyTo paradigm much more. Which is basically the basic premise of Mats, as an abstraction over pure "forward-only" messaging.

That doesn't help one bit. You're still firing-and-forgetting the request itself. The reply (presumably with a timeout) ensures that the client doesn't sit around forever waiting for a lost message; but it does nothing prevent badly-written request logic from overloading your backend (or overloading the queue, or "bunging up" the queue such that it'll be ~forever before your backend finishes handling the request spike and gets back to processing normal workloads.)

> If you have some parts of your system with absolutely massive traffic, then don't use this there, then.

I'm not talking about massive intended traffic. These problems come from failures in the architecture of the system to inherently bound requests to the current scale of the system (where autoscaling changes the "current scale of the system" before such limits kick in.)

So, for example, there might be an endpoint in your system that allows the caller to trigger logic that does O(MN) work (the controller for that endpoint calls service X O(M) times, and then for each response from X, calls service Y O(N) times); where it's fully expected that this endpoint takes 60+ seconds to return a response. The endpoint was designed to serve the need of some existing internal team, who calls it for reporting once per day, with a batch-size N=2. But, unexpectedly, a new team, building a new component, with a new use-case for the same endpoint, writes logic that begins calling the endpoint once every 20 seconds, with a batch-size of 20. Now the queues for the services X and Y called by this endpoint are filling faster than they're emptying.

No DDoS is happening; the requests are quite small, and in networking terms, quite sparse. Everything is working as intended — and yet it'll all fall over, because you've chosen yourself into a protocol where there's no inherent, by-default mechanism for "the backend is overloaded" to apply backpressure to make new requests from the frontend stop coming (as it would in a synchronous RPC protocol, where 1. you can't submit a request on an open socket when it's in the "waiting for reply" state; and 2. you can't get a new open socket if the backend isn't calling accept(2)); and you didn't think that this endpoint would be one that gets called much, so you didn't bother to think about explicitly implementing such a mechanism.

stolsvik · on Feb 12, 2023

Relying on the e.g. Servlet Container not being able to handle requests seems rather bad to me. That is a very rough error handling.

We seem to have come to the exact opposite conclusions wrt. this. Your explanations are entirely in line with mine, but I found this "messy" error handling to be exactly what I wanted to avoid.

There is one particular point where we might not be in line: I made Mats first and formost not for the synchronus situation, where there is a user waiting. This is the "bonus" part, where you can actually do that with the MatsFuturizer, or the MatsSocket.

I first and foremost made it for internal, batch-like processes like "we got a new price (NAV) for this fund, we now need to settle these 5000 waiting orders". In that case, the work is bounded, and an error situation with not-enough-threads would be extremely messy. Queues solves this 100%.

I've written some about my thinking on the About page: https://mats3.io/about/

derefr · on Feb 12, 2023

> Relying on the e.g. Servlet Container not being able to handle requests seems rather bad to me. That is a very rough error handling.

It's one of those situations where the simplest "what you get by accident with a single-threaded non-evented server" solution, and the most fancy-and-complex solution, actually look alike from a client's perspective.

What you actually want is that each of your backends monitors its own resource usage, and flags itself as unhealthy in its readiness-check endpoint when it's approaching its known per-backend maximum resource capacity along any particular dimension — threads, memory usage, DB pool checked-out connections, etc. (Which can be measured quite predictably, because you're very likely running these backends in containers or VMs that enforce bounds on these resources, and then scaling the resulting predictable-consumption workload-runners horizontally.) This readiness-check failure then causes the backend to be removed from consideration as an upstream for your load-balancer / routing target for your k8s Service / etc; but existing connected flows continue to flow, gradually draining the resource consumption on that backend, until it's low enough that the backend begins reporting itself as healthy again.

Meanwhile, if the load-balancer gets a request and finds that it currently has no ready upstreams it can route to (because they're all unhealthy, because they're all at capacity) — then it responds with a 503. Just as if all those upstreams had crashed.

> Your explanations are entirely in line with mine, but I found this "messy" error handling to be exactly what I wanted to avoid.

Well, yes, but that's my point made above: this error handling is "messy" precisely because it's an encoding of user intent. It's irreducible complexity, because it's something where you want to make the decision of what to do differently in each case — e.g. a call from A to X might consider the X response critical (and so failures should be backoff-retried, and if retries exceeded, the whole job failed and rescheduled for later); while a call from B to X might consider the X response only a nice-to-have optimization over calculating the same data itself, and so it can try once, give up, and keep going.

> I made Mats first and formost not for the synchronus situation, where there is a user waiting.

I said nothing about users-as-in-humans. We're presumably both talking about a Service-Oriented Architecture here; perhaps even a microservice-oriented architecture. The "users" of Service X, above, are Service A and Service B. There's a Service X client library, that both Service A and Service B import, and make calls to Service X through. But these are still, necessarily, synchronous requests, since the further computations of Services A and B are dependent on the response from Service X.

Sure, you can queue the requests to Services A and B as long as you like; but once they're running, they're going to sit around waiting on the response from Service X (because they have nothing better to be doing while the Service X response-promise resolves.) Whether or not the Service X request is synchronous or asynchronous doesn't matter to them; they have a synchronous (though not timely) need for the data, within their own asynchronous execution.

Is this not the common pattern you see for inter-service requests within your own architecture? If not, then what is?

If what you're really talking about here is forward-only propagation of values — i.e. never needing a response (timely or not) from most of the messages you send in the first place — then you're not really talking about a messaging protocol. You're talking about a dataflow programming model, and/or a distributed CQRS/ES event store — both of which can and often are implemented on top of message queues to great effect, and neither of which purport to be sensible to use to build RPC request-response code on top of.

ec109685 · on Feb 13, 2023

Unless you are super careful, the result of taking services out of rotation as they become overloaded, very often results in cascading failures as each server in turn falls over because more increasing amount of traffic is directed at healthy ones.

You need to be super careful that a service never allows more concurrency than it can handle and fast fails requests when it is overloaded. Otherwise, tcp queues back up and the server ends up only working on requests that are too old to be useful.

stolsvik · on Feb 12, 2023

To your latter part: This is exactly the point: Using messages and queues makes the flows take whatever time it takes. Settling of the mentioned orders are not time critical - well, at least not in the way a request from a user sitting on his phone logging in to see his holdings is time critical. So therefore, if it takes 1 second, or 1 hour, doesn't matter all that much.

The big point is that none of the flows will fail. They will all pass through as fast as possible, literally, and will never experience any failure mode resulting from randomly exhausted resources. You do not need to make any precautions for this - backpressure, failure handling, retries - as it is inherent in how a messaging-based system works.

Also, if a user logs into the system, and one of the login-flows need a same service as the settling flows, then that flow will "cut the line" since they are marked "interactive".

TedDoesntTalk · on Feb 13, 2023

> You do not need to make any precautions for this - backpressure, failure handling, retries - as it is inherent in how a messaging-based system works.

In reality, there are time limits on fulfillment of requests. Even in your previous example of an order execution / fulfillment system, the orders must execute while the exchange is still open. I can’t think of a system that is truly unbounded in time, but maybe one exists.

stolsvik · on Feb 13, 2023

This would not work for a exchange. What I work with is an UCITS mutual funds system. Once per day, we get a new price for each fund (the NAV, Net Asset Value). We now need to settle all orders, subscriptions and redemptions, waiting for that NAV. This is of course time critical, but not in the millisecond-sense: As long as it is done within an hour or two, it is all ok.

I believe this holds for very many business processes. If you get a new shipment of widgets, and you can now fulfill your orders waiting for those widgets, it does not really matter if it takes 1 second, or in a very bad outlier day, occasionally 2 hours.

Realize that the point is that this settling, or order fulfillment, will go as fast as possible. Usually within seconds or maybe minutes. However, if you suddenly get a large influx, or the database goes down for a few minutes, this will only lead to a delay - there is nothing you need code up extra to handle such problems. Also, you can scale this very simply, based on what holds you back (services, or database, or other external systems, or IO). It will not be the messaging by itself!

tacticus · on Feb 12, 2023

> you've implemented a queue, right?

Yeah you have just without having to run it on top of a shared DB. message queues are just shared DBs with some extra ordering

stolsvik · on Feb 13, 2023

> message queues are just shared DBs with some extra ordering

Completely agree. And I have stated that elsewhere in these threads. I mention it here: https://mats3.io/using-mats/matsfactory/#connecting-to-the-c...

And I have a feature-issue that explores implementing Mats using DBs: https://github.com/centiservice/mats3/issues/15

tass · on Feb 12, 2023

This is where I always end up. You can have queues which give you certain benefits, but there’a a lot of stuff to be built on top to make it as operationally simple as http.

stolsvik · on Feb 12, 2023

I will argue that this simplicity is exactly what Mats provides. At least that is the intention.

revskill · on Feb 12, 2023

I don't see the code on webpage to explain things. Simplicity means you can explain complex things with simple code.

Because English is ambigous and subjective. Just use code ?

stolsvik · on Feb 12, 2023

There is code here: https://mats3.io/docs/message-oriented-rpc/ .. and here: https://mats3.io/docs/mats-flow-initiation/ .. and here: https://mats3.io/docs/sync-async-bridge/ .. and here: https://mats3.io/docs/springconfig/ .. and here: https://mats3.io/background/what-is-mats/ .. and here: https://mats3.io/using-mats/endpoints-and-initiations/

.. and on the github page here: https://github.com/centiservice/mats3/blob/main/README.md

.. and you are advised to explore the code here: https://mats3.io/docs/explore/

cerved · on Feb 12, 2023

yes but there's sadly no code in what you posted

peoplefromibiza · on Feb 12, 2023

> And if you're doing that, why force the client to think in terms of async messaging at all? Just let them do REST, and hide the queue under the API layer of the receiver.

because REST is stupid

REST is request -> response, single connection, one direction only, which is a very limited way to model messaging .

There is more than one communication mode and bidirectional messaging is a thing.

REST also offers no control whatsoever over the communication channel, so you are stuck with the configuration set on the server side

which might or might not be correct for your use case

See RSocket for an example of a message driven protocol which solves most of the shortcomings of REST

on the bright side REST is also stupid simple

which is why is so widely deployed, it doesn't require thinking

> response, you've essentially circled back around to clients having to use a client with REST-like

no, because you did not block there waiting for the timeout which defaults to 30 seconds for HTTP

and even if you abandon on the client side, the server will still process the request, there's no way to abort it once it's been started.

naasking · on Feb 12, 2023

> REST is request -> response, single connection, one direction only, which is a very limited way to model messaging

You seem to be implying that being limited is a bad thing. Constraints are important to keep problems tractable.

peoplefromibiza · on Feb 13, 2023

> You seem to be implying that being limited is a bad thing

it is, when the limits are a bad thing.

Imagine if houses could only be built in the shape of a cube, because someone decided that the cube is the perfect shape for something else entirely.

HTTP has its place, REST has its place, messaging is a lot more than REST though.

It's always been, even before HTTP was invented.

Using REST for everything is the textbook example of if all you have is a hammer, everything looks like a nail

naasking · on Feb 13, 2023

This just seems like a list of platitudes lacking an actual justification. The fact is, synchronous request/reply semantics is Turing complete, as are async messaging semantics. I could can opt-in to sending an async message via a standard request/reply, but I can't opt out of it if async messages are the default. Async messages introduce DoS vulnerabilities, and even partially mitigating these introduces non-determinism, so opt-in should be the default.

hbrn · on Feb 13, 2023

Async processing is also a great way to sweep failures under the rug.

A lot of those failures won't even be detected, because you don't have proper monitoring, whereas in sync world you fail early and loudly.

And instead of "we had a downtime" you can sugarcoat it as "we had a processing delay". What's not to love?

stolsvik · on Feb 18, 2023

As the article points out, I feel that messaging really shines when it comes to failures: Instead of an - at best - WARN or ERROR log line amongst millions of log lines, the failing message "pops out" of the messaging fabric, into a Dead Letter Queue (DLQ).

You shall obviously set up monitoring of the MQ and all its DLQs. Once you've done that, you will literally never have a failing processing flow that isn't caught. A successfully initiated Mats Flow will either run to completion, or it will DLQ. Guaranteed.

The MatsBrokerMonitor is a Mats-specific such monitor (compared to the generic monitor ActiveMQ comes with). The pages here are a bit lacking, but it at least explains a bit: https://mats3.io/docs/matsbrokermonitor/

hbrn · on Feb 20, 2023

The major problem with dead letter queues is that people naturally end up mixing recoverable and non-recoverable errors together, which results in all errors being ignored.

Proper monitoring typically requires application specific details and logic, which queues, being a dumb bus, typically don't have.

E.g. your client has a scheduled downtime every midnight, so you want your monitoring to not alert on messages produced by account X between 00:00 and 02:00. Oh, and ideally these exclusion rules are set up by non-technical folks (maybe even by customer himself).

klabb3 · on Feb 12, 2023

In my experience the amount of serialized (network blocking) calls needed under the request reply paradigm always grows over time as the application gets larger.

At least this limitation can cause massive complexity once perf optimizations are needed. I think that’s important to factor in when we’re talking about the issues with large and resilient systems in either paradigm.

Personally I like message passing because it’s more true to the underlying protocol (TCP or UDP) and actually interops quite well (all things considered) with request-reply systems - it just requires two separate messages and a request id which is standard practice in request-response anyway. The inverse is not true though: we have like 10 different hacky solutions in the last decade for sending server initiated messages to clients.

naasking · on Feb 13, 2023

Request/reply is also message passing, it's just synchronous rather than asynchronous. To reduce blocking while preserving synchronous semantics, you can immediately return a future as your reply.

stolsvik · on Feb 18, 2023

Totally agree that HTTP can also be considered message passing, just synchronous!

Wrt. returning a Future, this is what the MatsFuturizer does, giving you a sync-async bridge into the "Mats fabric": https://mats3.io/docs/sync-async-bridge/

charrondev · on Feb 12, 2023

The system I’m currently on is currently moving a lot of work into queues. Some operations, like “change the criteria of this rank” could be anywhere between 5 seconds (if the number of users of the criteria to evaluate are small) or 10+ hours if we need re-evaluate the rules against 10m+ users.

In this case we write our jobs as generators that can be paused, serialized and picked up again later. We give the job 5 seconds synchronously, then if it passes that time, queue and job and let the client know a job has been registered.

The users account holds the IDs of the jobs as well as some basic information about the the tasks they have queued. There is a rest endpoint to return the current status of the jobs and information about them (what are they doing, what’s their progress, how much work remains).

The client will negotiate a web socket connection with a different service to be notified whenever progress is made on the job and the client can then check the endpoint for the latest status.

latchkey · on Feb 12, 2023

That 5 seconds is going to bite you.

There is going to be some sort of stall in the future that causes all of your jobs to hit that 5 seconds and everything is going to start to back up and cause other problems up the line that are really hard to test for in advance.

You're better off designing a system that doesn't rely on some arbitrary number of seconds (why not 4 or 6 seconds?) to begin with.

naasking · on Feb 12, 2023

Yes, non-determinism is the bane of distributed systems. It should be minimized whenever possible.

convolvatron · on Feb 13, 2023

in general I've been a lot more successful making fewer assumptions than trying to enforce them

charrondev · on Feb 15, 2023

The system doesn’t rely on the 5s FWIW worth and is capable of running entirely through the distributed queue.

The 5s check is just there to improve UX on small user actions.

latchkey · on Feb 15, 2023

Let's say that ALL of the jobs suddenly take up 6s. They ALL get queued up.

Now you have a queue problem and you have more to debug. You have to figure out why the queue is filling up and why things are taking longer than 5s. You also have an issue where the execution might die (machine crashes) before the job makes it into a queue.

You're better off just going straight to queue with the jobs and remove the added complexity of a 5s rule.

Supermancho · on Feb 12, 2023

> you've essentially circled back around to clients having to use a client with REST-like "synchronous/blocking requests with retry/backpressure" semantics

Yes, they both do the same thing. That's not even the starting point of the discussion. The implementation from HTTP to a message queue (mailbox system) is the discussion point.

Having the caller (who needs work done) wait to be informed when the work is done (or not done) is less deterministic than telling the callee how long before the work doesn't matter anymore. The callee gives back a transaction ID/is provided a callerID or is unavailable, and the caller knows (very quickly) it's not going to get done or knows where to look for the work (or abandon it). Either way, it allows for optimization on both sides.

stolsvik · on Feb 18, 2023

> .. telling the callee how long before the work doesn't matter anymore.

That is interesting - this is actually a feature of JMS, which is employed in Mats: You can say that a Mats Flow is nonPersistent (i.e. "not all that important, really"), and in the same call you can also say how many seconds until timeout.

<https://mats3.io/javadoc/mats3/0.19/api/io/mats3/MatsInitiat...>

This is employed by the MatsFuturizer if you use the "nonEssential" invocation: <https://mats3.io/javadoc/mats3/0.19/modern/io/mats3/util/Mat...>

The point is that if there are too big queues, bunching up even more work on them which will not be processed in time does not make any sense: If the Future on the caller side have already timed out, it makes no sense at all that the callee should process the work and return the answer.

naasking · on Feb 12, 2023

> And if you're doing that, why force the client to think in terms of async messaging at all? Just let them do REST, and hide the queue under the API layer of the receiver.

Yes, exactly. And on top of that, async messaging implicitly introduces DoS vulnerabilities exactly because of the buffering required. At least with sync messaging exposing a queue in the API layer, you opt-into this vulnerability.

stolsvik · on Feb 12, 2023

As mentioned here: https://mats3.io/background/system-of-services/

.. Mats is meant to be an inter-service communication solution.

It is explicitly not meant to be your front-facing endpoints. If you are DoS'ed, it would be from your own services. Of course, that might still happen, but then things would not have been much better if you used sync comms.

It is true that you can bridge from sync to the async world of Mats using the MatsFuturizer (https://mats3.io/docs/sync-async-bridge/), but then you still have your e.g. Servlet Container as the front-facing entity.

(Also check out https://matssocket.io/, though)

rkangel · on Feb 12, 2023

If I understand this right, this is basically the Erlang/Elixir OTP programming model, but across microservices rather than across a single (potentially distributed) VM. To be clear - that is a good thing.

One of the core concepts of OTP (effectively the Erlang standard library) is the GenServer. A GenServer processes incoming messages, mutates state if appropriate and sends responses. The OTP machinery means that this "send a message and wait for a response" is just a straight function call with return value to the caller. OTP takes care of all the edge cases (like when the process at the other end goes away half way through). This means that your code is just a straight series of synchronous function calls, which may be sending messages underneath to do things or get data, but you don't have to care. It's a lovely system to work in, and makes complicated systems feel simple.

The elements communicating are in Erlang terminology 'processes' - but not OS processes, they are instead lightweight userspace schduled things - very lightweight to create. Erlang has built in distribution that allows you to connect multiple running machines, and then the same message passing works across network boundaries. You're still limited to the BEAM VM though. This is the 'full' microservice version of that.

fy20 · on Feb 13, 2023

I'm pretty sure there should be a meme here, that any new async messaging system ends up reinventing Erlang/OTP.

stolsvik · on Feb 13, 2023

There is: https://news.ycombinator.com/item?id=34767858

stolsvik · on Feb 13, 2023

Thanks, I'll take that Erlang-comparison as praise!

(I actually mention the actor model as inspiration here: https://mats3.io/background/what-is-mats/#attempt-at-a-conde...)

guhcampos · on Feb 12, 2023

Oh yes, another "this thing I sell is an actual silver bullet" post.

Message busses are great. RPC is too. There are use cases for both. Saying one is "better" than the other is silly, and in this case, a shame.

There are loads of message passing libraries out there, based on all kinds of backend, from RabbitMQ, to NATS, to Redis, to Kafka. This does not innovate over anything, it's just shameless marketing.

stolsvik · on Feb 12, 2023

This is unfair. I made Mats so that I could use messaging in a simpler form. Nothing else.

Mats is an API that can be implemented on top of any queue-based message broker - which excludes Kafka. But it definitely includes ApacheMQ (which is what we use), Artemis and hence RedHat's MQ (which the tests runs), and RabbitMQ (whose JMS implementation is too limited to directly be used, but I do hope to implement Mats on top of it at some point). Probably also NATS. Probably also Apache Pulsar, which I just recently realized have a JMS client.

You could even implement it on top of ZeroMQ, and implement it on top of any database, particularly Postgres since it has those "queue extensions" NOTIFY and SKIP LOCKED.

edit: I actually have an feature-issue exploring such an implementation: https://github.com/centiservice/mats3/issues/15

latchkey · on Feb 12, 2023

> ApacheMQ

I hope ActiveMQ Artemis is better than the 'classic' version and that is what you're using. The last time I used it probably a decade ago now, there were so many issues with it, that it was a complete train wreck at scale. I would be very hesitant to pick that one up again.

mdaniel · on Feb 12, 2023

Making your own license is the new JavaScript framework, I guess: https://github.com/centiservice/mats3/blob/v0.19.4-2023-02-1...

Pet_Ant · on Feb 12, 2023

Not open source.

> Noncompete

> Any purpose is a permitted purpose, except for providing to others any product that competes with the software.

stolsvik · on Feb 12, 2023

I do not say that it is Open Source either.

From the front page: "Free to use, source on github. Noncompete licensed - PolyForm Perimeter."

Feel free to comment on this: Is this a complete deal breaker for all potential users?

marginalia_nu · on Feb 12, 2023

Honestly I'm pretty annoyed by the "how dare you give the source away under other terms than the ones I would prefer"-type reactions that crop up from time to time. It's an incredibly entitled attitude and is not a good look for the open source community in general.

Like by all means, share code with GPL or Apache or MIT or whatever, but don't get mad when someone selects another license, including non-free ones with weird incompatibilities.

jagged-chisel · on Feb 12, 2023

Those kinds of complaints are indeed entitled. At the same time, there's no problem pointing out that fewer people and organizations can select a dependency with an unconventional, unknown license.

You're welcome to license your projects however you see fit. But when you get to a point that no one is using your stuff, you have to be ready to hear "it's the license."

lazyasciiart · on Feb 12, 2023

Your comment is “how dare you complain about licensing”. What you are responding to is “huh, weird license, won’t use, that’s a shame”.

tinco · on Feb 12, 2023

Well yeah of course, it's a direct contradiction of the rising tide lifts all boats principle. Do you think Kubernetes would have any traction at all if it had a clause that it couldn't be used on AWS?

If it can't be adopted by the industry as a whole, then it can't be considered an industry standard. It wouldn't fly at my organization anyway, not even looking up what PolyForm is.

mejutoco · on Feb 12, 2023

I see your point, and could not avoid thinking if ElasticSearch would have more revenue if AWS could not offer it directly.

mixedCase · on Feb 12, 2023

Do you believe the fat ugly monster that is ElasticSearch would've had anywhere near its current adoption rates if it had a non-OSI license from the start?

It would've been completely overshadowed by some other Lucene-based wrapper or maybe some even better alternative would've come along earlier.

indymike · on Feb 12, 2023

I built on Elastic early on over Solr and several others because it was open source and seemed to be better. I would have selected a different Lucene wrapper if I had known where Elastic was going.

mejutoco · on Feb 12, 2023

Algolia does pretty well I believe. I could be wrong.

stavros · on Feb 12, 2023

As far as I can see, this doesn't say it can't be used on AWS, it only says Amazon can't launch its own service that uses this software to compete with itself. It's too short to really tell what "compete" entails, though.

stolsvik · on Feb 12, 2023

You are correct, this is meant as an AWS/GCP/Azure preventor. ElasticSearch situation. That is, AFAIU, the intention of the license I adopted. The "examples" part spells it pretty directly out, as I also try to do here: https://centiservice.com/license/

You may definitely use it anywhere you like.

mdaniel · on Feb 12, 2023

I end up regretting it every time I weigh into this mess, but that line of reasoning drives me so crazy that this just baits me into it

It is some yeowsers level hubris of every one of these folks who adopts some "source available" license because "Amazon gonna use our software for freeeeee, we go brokkkke". My company goes head-to-head with an existing Amazon closed-source offering and our stance is that if we can't out-customer-service, out-price, and generally make something more awesome than Amazon, that's on us, not because Amazon took our software and somehow ... forked it? innovated in a way we couldn't by using it?

In the meantime, during the days up until Amazon Armageddon Day(tm), you don't have any bugfixes from the software engineers in the trenches trying to use your software, the cutesy license carte-blanc rules out its use in a non-trivial number of shops that would otherwise, and it ends up generating a lot of threads during every discussion which aren't "wow, that's awesome you made transactional JMS -- I can't wait to try that out in my use case!" where you do shine

I am so sick of people pointing to the NOT OUT OF BUSINESS Elastic as the case study of "Amazon took our stuff, now we broke". AWS offers managed Kafka, also, and Confluent seems to be doing just fine. We use the Apache licensed Kafka not because we hate Confluent, but because it's infinitely easier to deploy a non-locked-in docker image than to deal with licensing keys in our deployment strategy. We similarly avoid Amazon Managed Kafka because it's pricing is stupid and the kinds of risks it drives down are not our risks

An alternative viewpoint of AWS taking your software is "wow, what a market validation! Now come get the 5 versions newer release from the experts who built it."

stavros · on Feb 12, 2023

I'm really in favor of something like that. AWS using your own FOSS software to choke your revenue stream is a blight on FOSS, so good for you for using that license.

stolsvik · on Feb 12, 2023

Thank you!

mixedCase · on Feb 12, 2023

Can't speak for all potential users, but the license is in fact a complete deal-breaker for me and any client I've worked with given the FOSS tools available in the ecosystem.

But then, there's also the "Java-only" which is a complete deal-breaker in any client I've worked with doing {micro,}services.

Then there's the "what the hell does this actually do" deal-breaker when trying to explain it to some decision makers, and the "we already have queues and K8s to solve all of those issues" deal-breaker when explaining it to most fellow SWEs/SREs.

stolsvik · on Feb 12, 2023

Hahaha, that's rough! :-)

I'll tell you one thing: "What the hell does this actually do?!" is extremely spot on! I am close to amazed to how hard it is to explain this library. It really does provide value, but it is evidently exceptionally hard to explain.

I first and foremost believe that this is due to the massive prevalence of sync REST/RPC style coding, and that messaging is only pulled up as a solution when you get massive influxes of e.g. inbound reports - where you actually want the queue aspect of a message broker. Not the async-ness.

I've tried to lay this out multiple times, e.g. here: https://mats3.io/docs/message-oriented-rpc/, and in the link for this post itself.

mixedCase · on Feb 13, 2023

I think I got it, but I'm inclined to believe I'm missing something.

From what I understood, it boils down to "distributed CPS" (continuation-passing style), where the state instead of being automatically closed over it is explicitly assigned to, and the continuation execution is distributed.

The problem about this specific approachh from where I see it is that something like this should be a few days' work to implement using an off-the shelf persistent queue (which could be abstracted away), something like Protobuf for defining your shared state and auto-generating serializers/deserializers, and then some reusable glue code to push/pull from the queue, and it would support multiple languages almost out of the box.

With that said, I am not sure if I'd want to define pipelines as a shared "mutable" state passed around, it sounds like classic object oriented design which makes it easy to create buggy code. It'd be much more robust to clearly define several steps which very strict, well-defined inputs and outputs every step of the way. You're paying for the copies in each serialization anyway (and much more, since this is a Remote Procedure Call), so what reason could there be for choosing a fixed mutable model?

stolsvik · on Feb 13, 2023

I think you're onto it, but not quite? The "distributed CPS" explanation is quite good.

There is no "shared mutable state" per se. The "state object" is passed through the stages of the same endpoint, not shared with stages of other invoked endpoints. It is meant to emulate the method local variables that you "get for free" if you code within a method. If you have String firstname set before you invoke HTTP endpoint X, you of course have that firstname available after the HTTP invocation returns, right? This is so obvious that one really doesn't think about it. When you throw a message onto a queue (and then the processor goes back listening on its incoming queue), you of course loose that String firstname, unless you pass it along (But the next service might not need it, so he then just have to pass it on further, so that the downstream service that actually needs it, can get it). But with Mats' state object, if you assigned the firstname to the state object, it is "magically present" on the next stage too.

It is passed within the message, this is what I mean by "messaging with a call stack": The call stack (called MatsTrace) holds the Reply-queues (where an Endpoint's Reply should go to), and the state object which will appear for that replied-to stage. There is no external storage for this.

The point of Mats is to leverage normal developers' innate understanding of normal, straight-down, sequential method-invocation-based coding. You can reason exactly like that, while still being able to code fully async messaging.

I love the Loom project, and I feel that there is a massive comparison here: Instead of having to embrace async/await, or Promises, or whatnots, Loom instead "hacks" the threading model. The result is exactly the same as if you were using async/await, but with a bit more overhead, you can code linearly, not having to warp your brain around the mess of async/await/Promises. This is exactly the same rationale for me making Mats.

delusional · on Feb 12, 2023

Well computing is all about redefining problems in terms of other atoms. A messaging service is really just a series of ALU operations and memory writes, which is then a series of nand gates.

It seems incredibly muddy to me what "competing" would mean in that sense. If I make something with this it could be argued that my system built on top of MATS is just immaterial configuration that was intended to be done by the user. That the authors intention was for the end user to use MATS themselves, and that I'm therefore in competition with the product.

A non programming example would be hammers and houses. You could imagine that if I build you a house, you'd be less likely to need to buy a hammer (to build your own) making my house competition for the hammer.

I wouldn't touch this at all.

jagged-chisel · on Feb 12, 2023

It's muddy/unknown enough that no one in a commercial enterprise can entertain shipping a service using your project.

jgilias · on Feb 12, 2023

Your license would not be a dealbreaker for me in an SME commercial setting. AGPL would be a dealbreaker.

stolsvik · on Feb 12, 2023

Thanks a bunch! Seriously. And I agree that AGPL is pretty harsh - I have a feeling that this is typically used in a "try before you buy" situation, where there is a commercial license on the side.

lazyasciiart · on Feb 12, 2023

That comment is quoting from the Polyform license. If it doesn’t represent your position, you may have made a bad choice in license.

stolsvik · on Feb 12, 2023

I was referring to the "not open source". I edited my comment to be more specific.

indymike · on Feb 12, 2023

Yes, it is a deal breaker for me.

stolsvik · on Feb 12, 2023

Not entirely my own: https://polyformproject.org/licenses/perimeter/1.0.0/

https://centiservice.com/license/

mrkeen · on Feb 12, 2023

> Transactionality: Each endpoint has either processed a message, done its work (possibly including changing something in a database), and sent a message, or none of it.

Sounds too good to be true. Would love to hear more.

stolsvik · on Feb 12, 2023

Well, okay, you're right! It is nearly true, though. :) I've written a bit about it here: https://mats3.io/using-mats/transactions-and-redelivery/

weatherlight · on Feb 12, 2023

“Virding's First Rule of Programming:

Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.” ― Robert Virding

stolsvik · on Feb 13, 2023

:-)

I actually mention the actor model as inspiration for Mats here: https://mats3.io/background/what-is-mats/#attempt-at-a-conde...

samsquire · on Feb 12, 2023

Thanks for this.

I love the idea of breaking up a flow into separately scheduled but still linear message flow.

I wrote about a similar idea in ideas2

https://github.com/samsquire/ideas2#84-communication-code-sl...

The idea is that I enrich my code with comments and a transpiler schedules different parts of the code to different machines and inserts communication between blocks.

I read about how Zookeeper algorithm for transactionality and robustness to messages being dropped, which is interesting reading.

https://zookeeper.apache.org/doc/r3.4.13/zookeeperInternals....

How does Mats compare?

LMAX disruptor has a pattern where you split up each side of an IO request into two events, to avoid blocking in an handler. So you would always insert a new event to handle an IO response.

robertlagrant · on Feb 12, 2023

Having done a reasonable amount of messaging code in my time, I would say the final form of this sort of thing might look more like Cadence[0] than anything like this.

[0] https://github.com/uber/cadence

stolsvik · on Feb 12, 2023

Cadence is a workflow management system. As is Temporal, Apache Beam, Airbnb Airflow, Netflix Conductor, Spotify Luigi, and even things like Github Actions, Google Cloud Workflows, Azure Service Fabric, AWS SWF, Power Automate.

A primary difference is that those are external systems, where you define the flows inside that system - the system then "calling out" to get pieces of the flow done.

Mats is an "internal" system: You code your flows inside the service. It is meant to directly replace synchronously calling out to REST services, instead enabling async messaging but with the added bonus of being as simple as using REST services.

But yes, I see the point.

MuffinFlavored · on Feb 12, 2023

Is GitHub Actions really similar enough to Temporal/Cadence to be included in the list?

lorendsr · on Feb 15, 2023

You could call them both workflow management systems. One of the differences is Temporal/Cadence uses code to define workflows instead of YAML. It's a large enough difference that Temporal is defining their category as "durable execution":

https://temporal.io/blog/building-reliable-distributed-syste...

stolsvik · on Feb 12, 2023

Hmm. Maybe not. But they sure have much in common: You define a set of things that should be done, triggered by something - either a schedule, an event (oftentimes a repository event, but it doesn't have to), or from another Github action.

robertlagrant · on Feb 13, 2023

I think that definition could encompass everything from CPU interrupts to having a human secretary : - )

jeffbee · on Feb 12, 2023

If you pretend that your message bus has zero producer impedance and costs nothing then this analysis makes great sense. If you have ever operated or paid for this type of scheme in the real world then you will have some doubts.

stolsvik · on Feb 12, 2023

I guess you'd say the same about cloud functions and lambdas, then? To which I agree.

Paying per message would require the message cost to be pretty small. Might want to evaluate setting up a broker yourself if the cost starts getting high.

ay · on Feb 12, 2023

Makes me think of https://grugbrain.dev/:

Microservices

grug wonder why big brain take hardest problem, factoring system correctly, and introduce network call too

seem very confusing to grug

Animats · on Feb 13, 2023

OK, so they are using an "async" system to simulate a thread-oriented system with blocking. You can do that, but why?

The primitive they've created is roughly equivalent to QNX messaging. That's a synchronous send message and wait for reply system, like REST. It just has a lot less overhead, and can be used both on the same machine and remotely. QNX lets you have multiple clients talking to the same server, and multiple servers serving the same connection port.

People keep re-inventing this, but not improving it much. For this to really work well, the message passing has to be integrated with the CPU dispatcher, so that every interprocess call doesn't put you back to the end of the line for the CPU. That's what nobody seems to get right.

vore · on Feb 13, 2023

What happens when a high priority process RPCs to a low priority process to avoid priority inversion? Does the low priority process inherit the caller's priority for the duration of the call?

stolsvik · on Feb 13, 2023

Mats has two levels of priority: "ordinary" and "interactive".

JavaDoc for interactive: <https://mats3.io/javadoc/mats3/0.19/api/io/mats3/MatsInitiat...>

Notice that this flag follows the entire Mats Flow, so that the initiated flow gets a "cut the line"-card for all queues it goes through.

Animats · on Feb 13, 2023

In QNX, yes, In this new thing, I don't know.

That's a key feature of QNX, because it's a hard real time system. Priorities are taken very seriously. Process priority is also RPC priority, so higher priority processes get priority on services, such as file I/O.

stolsvik · on Feb 13, 2023

> they are using an "async" system to simulate a thread-oriented system with blocking. You can do that, but why?

The "simulation" is purely visual, or rather "cognitive load reduction"-wise. It is explained multiple time throughout the pages om https://mats3.io, for example here: https://mats3.io/docs/message-oriented-rpc/, and here: https://mats3.io/using-mats/endpoints-and-initiations/ (with multiple examples, both pure Java and Spring-based definitions, also comparing to an ordinary REST controller).

The clue is to be able to use the simple mental model of sequential steps, with "RPC calls" intertwined. However, this is only on a superficial level, to make the developer's reasoning simple and familiar - what really happens is that your Mats Endpoint is really multiple stages, where each stage is an independent little message processor. To achieve this, Mats implements "Messaging with a Call Stack", where you get a state object which "magically" follows you through the stages, simulating the stack variables you'd have if it actually was a proper method.

It works surprisingly well.

> That's a synchronous send message and wait for reply system, like REST.

This you get if you employ the MatsFuturizer: https://mats3.io/docs/sync-async-bridge/ This is a "tack on"-tool to the otherwise fully async nature of Mats/messaging.

> For this to really work well, the message passing has to be integrated with the CPU dispatcher

It sounds like you are 100% set on speed. This is not really what Mats is after - it is meant as a inter-service communcation system, and IO will be your limiting factor at any rate. Mats sacrifices a bit of speed for developer ergonomics - the idea is that by easily enabling fully async development of ISC in a complex microservice system, you gain back that potential loss from a) actually being able to use fully async processing (!), and b) the inherent speed of messaging (it is at least as fast as HTTP, and you avoid the overhead of HTTP headers etc.

It is mentioned here, "What Mats is not": https://github.com/centiservice/mats3#what-mats-is-not

lowbloodsugar · on Feb 13, 2023

> Service Location Discovery is avoided, as messages only targets the logical queue name, without needing information about which nodes are currently consuming from that queue.

Something is talking to something over a network. They need to get sent somewhere so the “where” must be known. If the “where” is “a load balancer” then that works for REST too. If the “where” is a mesh discovery protocol then that works for REST too. Ultimately you service is going to send a packet. The network card requires a destination IP.

stolsvik · on Feb 13, 2023

In a message broker architecture, it is typically the clients connecting to the broker. Thus, as you correctly point out, they need to know where to connect. But that is the same value for every single client. Using the failover transport of ActiveMQ, we have defined amq1, amq2 and amq3 - so it'll cycle through these until one of them answers.

Once you're in, you only refer to the queue names. Which in Mats effectively are direct references to the endpoint name, typically in a Java-style class.method way, e.g. "OrderService.placeOrder".

rajeshp1986 · on Feb 13, 2023

The author lost me at the first few lines : > In a microservice architecture one needs to communicate between the different services. The golden hammer for such communication is REST-ish services employing JSON over HTTP.

Are people really relying on REST-ish inter-services communications? I have seen very large systems and most of the systems are async/promised based or have an underlying message-queue implementation or pub-sub like Kafka.

tsimionescu · on Feb 13, 2023

Yes, REST or something like gRPC are probably the most common ways to do this. For example, the Kubernetes API server that sits at the center of Kubernetes' communication model is a REST server.

stolsvik · on Feb 13, 2023

Yes, as far as I have come to understand, REST and similar protocols like gRPC are definitely the most common inter-service communications solutions.

Witness e.g. Netflix's stack with Hystrix, and the popular Kubernetes add-on Istio service mesh with its Envoy service proxy. And as the sibling comment points out, Kubernetes' own control layer is REST.

Messaging is an alternative, which I believe is underused due to its completely different mental model which is harder to reason about. Mats tries to make that simple - by "emulating" that you are working with something like REST, by introducing "Messaging with a call stack" and the state object.

More here: https://mats3.io/docs/message-oriented-rpc/, here: https://mats3.io/background/what-is-mats/ and much more here: https://mats3.io/using-mats/endpoints-and-initiations/

edit: Btw: "async/promised based" - this is not the point. It is true that you can code REST with "async/await" or Promise-semantics. However, you have still then bound the process to a piece of memory on a server. If the server goes down, all the processes it held, be it blocking in a thread, or in some Promise object, goes down with it. With Mats, the process "lives in the message". As long as you got it onto the message broker (i.e., the initiation went OK), it will never disappear: It will either go to completion, or it will pop out into a DLQ (Dead Letter Queue), where it can be inspected, you can jump to your logging system to find its processing, and you can - importantly - reissue the message and thus restart the flow from where it stopped, if the underlying problem is now cleared (e.g. a database that was down, or some data that (due to a bug) was in some erroneous state which now is fixed)

I write about that here: https://mats3.io/using-mats/endpoints-and-initiations/, search for "A comment about the meaning of asynchronous".

You can read about the MatsBrokerMonitor, a Mats-specific tool for inspecting the message broker: https://mats3.io/docs/matsbrokermonitor/. It gives more details about Mats flows than e.g. ActiveMQ's standard inspection GUI gives. I should however really work on that text, and throw in some images. You can clone and run the development-server if you want to see it in action: https://github.com/centiservice/matsbrokermonitor/blob/main/...

weatherlight · on Feb 12, 2023

chuckles in erlang

stolsvik · on Feb 13, 2023

:-)

As mentioned a few places, the actor model is actually part of the inspiration behind Mats: https://mats3.io/background/what-is-mats/#attempt-at-a-conde...

drkrab · on Feb 12, 2023

nitwit005 · on Feb 12, 2023

> Messaging naturally provides high availability, scalability, location transparency, prioritization, stage transactionality, fault tolerance, great monitoring, simple error handling, and efficient and flexible resource management.

What is "stage transactionality"? If I do a Google search for it, I just find this page.

stolsvik · on Feb 12, 2023

Hehe, okay. It was meant to mean "Each stage is processed in a transaction". Kinda hard to get down into a list. But my wording evidently didn't make anything clearer!

If you read a few more pages, then it should hopefully become clearer. This page is specifically talking about it: https://mats3.io/using-mats/transactions-and-redelivery/ - but as it is one of the primary points of why I made Mats, it is mentioned multiple places, e.g. here: https://mats3.io/background/what-is-mats/

This is not Mats-specific - it is directly using functionality provided by the message broker, via JMS.

hbrn · on Feb 12, 2023

1. Anything that is connected to user interface should be synchronous by default.

2. You can't predict which parts of your system will be connected to user interface.

3. Here's the worst part: async messaging is viral. A service that depends on async service becomes async too.

You should be very cautions introducing async messaging to your systems. The only parts that should be allowed to be async are the ones that can afford to fail.

I spend good amount of time trying to work around these dumb enterprise patterns when building products on top of async APIs. You are literally forced to build inferior products just because someone thought that async messaging is so great. It's great for everybody, except the final user.

Async processing is not a virtue, it's a necessity for high load/high throughput systems.

The reason SOA failed many years ago is precisely the async message bus.

stolsvik · on Feb 12, 2023

We clearly do not agree.

Wrt. sync processing when using Mats: https://mats3.io/docs/sync-async-bridge/

But my better solution is instead to pull the async-ness all the way out to the client: https://matssocket.io/

Also, I have another take on the SOA failure, mentioned here: https://mats3.io/about/

It was definitely not because of async, at least as I remember it.

SpaghettiX · on Feb 12, 2023

I appreciate some events can be asynchronous for clients, for example: actions taken by other users, or events generated by the system. However, I do think implementation details (using async in the server) should be encapsulated from clients: when users save a new document, it's much easier for the client to receive a useful albeit delayed response, rather than "event submitted", wait for the result on a stream. Of course, other relevant clients may need to hear about that event too. The service architecture should not affect / make-life-harder for clients.

Therefore I think disagree with both parent and grandparent comments. Use each when they make sense, not "synchronous by default" (grandparent comment, though I do think there are good points made), or "asynchronous based on service architecture" (parent comment).

> But my better solution is to pull the async-ness all the way out to the client: https://matssocket.io/

Is that a solution that you use? I took a look at matssocket https://www.npmjs.com/package/matssocket, it currently has 2 weekly downloads. :thinking:.

stolsvik · on Feb 12, 2023

To make a point out of it: This is not event based in the event sourcing way of thinking. It is using messages. You put a message on a queue, someone else picks it up. Mats implements a request/reply paradigm on top ("messaging with a call stack").

In the interactive, synchronous situation, you do not "wait for an event" per se. You wait for a specific reply. When using the MatsFuturizer (https://mats3.io/docs/sync-async-bridge/), it is extremely close to how you would have used a HttpClient or somesuch.

MatsSocket: The Dart/Flutter implementation is used in a production mobile app. For the Norwegian market only, though.

The JS implementation is used in an internal solution.

Would have been really nice with a bit more usage, yes. It is actually pretty nice, IMHO! ;-)

toast0 · on Feb 12, 2023

> Async processing is not a virtue, it's a necessity for high load/high throughput systems.

> 1. Anything that is connected to user interface should be synchronous by default.

If everything UI is synchronous, you prevent users from acheiving high throughput. Sometimes that's fine, but sometimes it's not.

It's simple to wait for a response to a request sent via asynchronous messaging. It's not simple to split a synchronous API into send and receive parts. However, REST is HTTP and there's lots of async HTTP libraries out there.

ako · on Feb 13, 2023

Here’s someone who disagrees, AWS CTO Werner Vogels: https://youtu.be/RfvL_423a-I His entire Re:invent keynote was on topic of async systems, and why you need it.

hbrn · on Feb 13, 2023

So the argument is that synchrony taken to extreme is ridiculous, therefore we need asynchrony? Sorry, but that's a textbook manipulation. Almost like he's trying to sell us something...

No wonder AWS UX is so terrible.

zmmmmm · on Feb 12, 2023

This reminds me more of Apache Camel[0] than other things it's being compared to.

> The process initiator puts a message on a queue, and another processor picks that up (probably on a different service, on a different host, and in different code base) - does some processing, and puts its (intermediate) result on another queue

This is almost exactly the definition of message routing (ie: Camel).

I'm a bit doubtful about the pitch because the solution is presented as enabling you to maintain synchronous style programming while achieving benefits of async processing. This just isn't true, these are fundamental tradeoffs. If you need a synchronous answer back then no amount of queuing, routing, prioritisation, etc etc will save you when the fundamental resource providing that is unavailable, and the ultimate outcome that your synchronous client now hangs indefinitely waiting for a reply message instead of erroring hard and fast is not desirable at all. If you go into this ad hoc, and build in a leaky abstraction that asynchronous things are are actually synchronous and vice versa, before you know it you are going to have unstable behaviour or even worse, deadlocks all over your system and the worst part - the true state of the system is now hidden in which messages are pending in transient message queues everywhere.

What really matters here is to fundamentally design things from the start with patterns that allow you to be very explicit about what needs to be synchronous vs async (building on principles of idempotency, immutability, coherence, to maximise the cases where async is the answer).

The notion of Apache Camel is to make all these decisions a first class elements of your framework and then to extract out the routing layer as a dedicated construct. The fact it generalises beyond message queues (treating literally anything that can provide a piece of data as a message provider) is a bonus.

[0] https://camel.apache.org/

hummus_bae · on Feb 12, 2023

> The ultimate outcome that your synchronous client now hangs indefinitely waiting for a reply message instead of erroring hard and fast is not desirable at all.

Async frameworks don't eliminate the possibility of long running processes, that continue to process long after responding a request - this is still possible with specific libraries/frameworks, they'll only take away the synchronous interface and provide as asychonous one instead.

It is also important to note that error handling will be different between these 2 paradigms and it's important, whatever the most suitable one is, to acknowledge this since it forces us (developers) to handle the potential errors differently depending on the approach we choose.

zmmmmm · on Feb 12, 2023

I think you're stating exactly my point?

The pitch of MATS is that it let's:

> developers code message-based endpoints that themselves may “invoke” other such endpoints, in a manner that closely resembles the familiar synchronous “straight down” linear code style

In other words, they want to encourage you to feel like you are coding a synchronous workflow while actually coding an asynchronous one. You are pointing out that error handling needs to be different b/w these paradigms and you are correct, but that is only the start of it. A framework that papers over the differences is at very high risk of just creating a massive number of leaky abstractions that don't show up in the happy scenario but come back and bite you heavily when things go wrong.

(I'm saying this as a long time user of Camel which models this exact concept heavily and also experiences many of these issues)

stolsvik · on Feb 12, 2023

Hmm. I want to distance this library pretty far from Camel!

Wrt. "papering over": Not really. I make it "feel like" you're coding straight down, sequential, linear, as if you're coding synchronously.

But if you look at the examples, e.g. at the very start of the Walkthrough: https://mats3.io/docs/message-oriented-rpc/, you'll understand that you are actually coding completely message-driven: Each stage is a completely separate little "server", picking up messages from one queue, and most often putting a new message onto another queue.

It is true that the error handling is very different. Don't code errors! You cannot throw an exception back to the caller. You can however make for "error return" style DTOs, but otherwise, if you have an actual error, it'll "pop out" of the Mats Fabric and end up on a DLQ. This is nice! It is not just a WARN or ERROR log-line in some log that no-one will see until way later, if ever: It immediately demands your attention.

I wrote quite a long answer to something similar on a Reddit-thread a month ago: https://www.reddit.com/r/programming/comments/1059jpv/messag...

tacticus · on Feb 12, 2023

> It immediately demands your attention.

it immediately demands the attention of the ops team that you've thrown all problems at while hoping that the magic queue never stops being magic.

while the client is sitting there going "WTF, where's my shit?"

stolsvik · on Feb 13, 2023

Ops team?! We, the developers, are monitoring the DLQs: It is our mistakes, and we must fix them. Operations' only role is to keep the VMs the ActiveMQ is running on active, and the database it uses responsive.

If you used it in a synchronous fashion for a end-user, you are correct: He'll get a timeout - but the message will blink red on the DLQ, giving the devs exact info as to what has happened. As opposed to a 500 or something similarly bad, which probably in the best case was logged somewhere, and hopefully someone is tallying up the error codes every few weeks..

I fail to see how messaging is worse in this case.

tacticus · on Feb 13, 2023

I fail to see how it's better.

stolsvik · on Feb 13, 2023

Sorry about that. If you have any questions that could shine light on things, I am happy to answer.

zmmmmm · on Feb 12, 2023

> Hmm. I want to distance this library pretty far from Camel!

I'm curious what makes you distinguish it heavily from Camel? From everything you say it sounds to me like you are building Camel - or at least the routing part of it :-)

stolsvik · on Feb 13, 2023

While these ideas was brewing, I ventured down into many libraries, Camel was one of them. It did not solve any of my problems.

Mats is messaging with a call stack. One messaging endpoint can invoke another messaging endpoint, and get the reply supplied to its next stage, with the state from the previous stage magically present: https://mats3.io/docs/message-oriented-rpc/#isc-using-mats3 and other pages.

Camel has, AFAIK, nothing of the kind.

zmmmmm · on Feb 13, 2023

It's interesting how we can have such different takes on it. Still everything you say maps naturally onto Camel for me.

Your example using Camel would look something like:

    from('rest:/api/myprocess') // using REST API endpoint as an example starting point
      .inOut()
      .to('activemq:main')
      .to('activemq:mid')
      .to('activemq:leaf')
    .end()

Then you would define the handlers:

   from('activemq:main')
      .transform { e ->
            return new State(...)
      }
      
   from('activemq:mid')
      .transform { e ->
            e.body.number1 = 
      }

All the processing stages are happening asynchronously and results are passed back and forth by Camel. If things error out half way through you get a printout from Camel of the whole workflow including each step, where exactly it failed and what the content of the state was at that point (and it'll throw the lost message onto DLQ if you want to re-do the process).

stolsvik · on Feb 13, 2023

Mats does not explicitly define a set "route" through a bunch of stages. I view Camel more like a Workflow system, which is pointed out elsewhere in the discussion threads here - the workflows are defined external to the actual processing steps.

With Mats, you define Endpoints, which can be generic like a REST endpoint can be generic: "AccountService.getAccountListForCustomer" - which can be used (by invocation) by sevaral other endpoints.

Also, IIRC, the process you show with Camel is now effectively thread bound - or at least JVM bound. If the node running that process goes down, it takes all unfinished, mid-flow, processes with it. The steps are not transactional.

With Mats, the "life" of the process lives in the message. You can literally bring down your entire system from GCP, and rebuild it on Azure, and then when you start it up again, the mid-flow processes will just continue and finish as if nothing happened - as long as you brought along the state from the message broker (in addition to your databases). Viewed "from within the process", the only thing visible is that there was a bit more latency between one step and the next than usual. AFAIR, you cannot get anything like this with Camel. The idea is of course not the ability to move clouds, but that the result is exceptionally robust and stable.

awill88 · on Feb 12, 2023

Clickbait imho because there are less charged ways to express the point of what is essentially a sales pitch comparing a generic synchronous web service.

Why risk the cognitive dissonance by contriving the term “synchronous REST-based systems”

What? Lol.

stolsvik · on Feb 12, 2023

The original title was "Why messaging is much better than REST for inter-microservice communications" - @dang / HN changed it.

I feel the change is wrong in at least two ways: The linked article literally argues that messaging is much better ("superior"), and it is specifically arguing for the use in inter-service communications: "The arguments here are for service-mesh internal IPC/RPC communications. For e.g. REST endpoints facing the user/client, there is a “synchronous-2-Mats-bridge” called the MatsFuturizer"

You can read about that bridge here: https://mats3.io/docs/sync-async-bridge/

volf_ · on Feb 14, 2023

I made a very similar project in Rust that seems to mimic this idea: https://github.com/volfco/boxcar

The core idea I had was to decouple the connection from the execution of the RPC. Mats3 looks to be doing a lot more than what I've done so far, but it's nice to see similar ideas out there to take inspiration from.

lp4vn · on Feb 12, 2023

I think this article is kind of misleading.

You use messaging for asynchronous communication and REST for synchronous communication. The article makes me believe that using REST for synchronous communication is a kind of deprecated alternative in front of message passing.

stolsvik · on Feb 13, 2023

Not sure when you saw the HN post, but the original title was "Why messaging is much better than REST for inter-microservice communications" - @dang / HN changed it.

The article specifically points out that this is for inter-service communication. However, you can also use the same system for synchronous communication out the your end user clients, then using the MatsFuturizer: https://mats3.io/docs/sync-async-bridge/

mangatmodi · on Feb 13, 2023

Not sure about the claim that async has better Transactionality then REST.

>Transactionality: Each endpoint has either processed a message, done its work (possibly including changing something in a database), and sent a message, or none of it.

stolsvik · on Feb 13, 2023

The big point with messaging is that you have rollback, and retries. Mats leverages this.

If Stage N in the total process has picked up a message, starts to process it, and then something (temporarily) fails (or the node crashes), then it will roll back whatever DB operations you have done up till then, and roll back that it has picked up the message.

The MQ will now reissue the same message, and it will be picked up again. This time, things work out, a new message is produced, and the entire processing of this stage is committed.

So, either you have received a message, done your DB-stuff, and sent an outgoing message, or you have done none of that.

I do not see you easily do that with REST. Or at least you will have to code quite a bit to get such nice semantics. With messaging and Mats, you get it entirely for free.

To be fair, it is not entirely true. There are two separate transactions going here, and it can fail in an annoying way. I write about this here: https://mats3.io/using-mats/transactions-and-redelivery/

lnenad · on Feb 13, 2023

But rollback and retries are specific to the systems executing the commands, not the interface that is invoking them. If you have an RPC based transaction (command1, command2, command3) that fills a list with rollback commands and in case of a failure run through the list of rollback commands that does the same thing as putting them in a queue. If you don't have rollbacks in the system (ie stuff was written in the db during the first command) the queue isn't gonna help with that.

stolsvik · on Feb 13, 2023

I might not have gotten across clearly wrt. how this works.

It is each stage that is transactional. If the stage processing fails (or the node crashes), both the DB transactiona, and the messaging transaction, is rolled back.

It is then retried. ActiveMQ has a default of 1 delivery, and 6 redelivery attempts. If those 7 fails, the message is assumed "poison", and is put on the DLQ.

But this means that a Mats Flow is a series of transactions, each is individually handled. As you probably allude to, you cannot roll back the entire flow - it is just the particular step that is rolled back. Thus, if the message ends up on the DLQ, you have a mid-way process, where the steps in front are already done and committed, while this stage, and any downstream, are not yet done.

The message is however on the DLQ. If the problem was e.g. an temporary database failure, which now is resolved, you can now just reissue the DLQ (move it from the DLQ back to its queue), and the process will continue as if nothing had happened.

ngrilly · on Feb 12, 2023

How is it different from NATS.io that solves most of the problems listed? (except the transactional aspect but I’m not convinced it’s a good thing to have the same tool do everything)

stolsvik · on Feb 12, 2023

I see references to NATS multiple times, but I fail to see how it solves what Mats aims to solve?

Mats could be implemented on top of NATS, i.e. use NATS as a backend, instead of JMS. (We use ActiveMQ as the broker)

adamckay · on Feb 12, 2023

The articles notes about async messaging architectures being superior to REST-based systems seems rather disingenuous, in my opinion, as it's seemingly only considering the most basic REST API deployed on a single node as the alternative.

For example:

> High Availability: For each queue, you can have listeners on several service instances on different physical servers, so that if one service instance or one server goes down, the others are still handling messages.

This is negated in a REST-based system with the use of an API gateway / simple load balancer and multiple upstream nodes.

> Location Transparency [Elastic systems need to be adaptive and continuously react to changes in demand, they need to gracefully and efficiently increase and decrease scale.]: Service Location Discovery is avoided, as messages only targets the logical queue name, without needing information about which nodes are currently consuming from that queue.

Fair enough, service discovery is another challenge, but it's not hugely complex with modern API gateways and arguably no more complex with running and maintaining a message queue with associated workers. You've also got a risk in a distributed messaging system used by multiple teams that one service publishes messages into a queue that has been deprecated and has no consumers listening anymore.

> Scalability / Elasticity: It is easy to increase the number of nodes (or listeners per node) for a queue, thereby increasing throughput, without any clients needing reconfiguration. This can be done runtime, thus you get elasticity where the cluster grows or shrinks based on the load, e.g. by checking the size of queues.

Same as HA, solved with a load balancer.

> Transactionality: Each endpoint has either processed a message, done its work (possibly including changing something in a database), and sent a message, or none of it.

> Resiliency / Fault Tolerance: If a node goes down mid-way in processing, the transactional aspect kicks in and rolls back the processing, and another node picks up. Due to the automatic retry-mechanism you get in a message based system, you also get fault tolerance: If you get a temporary failure (database is restarted, network is reconfigured), or you get a transient error (e.g. a concurrency situation in the database), both the database change and the message reception is rolled back, and the message broker will retry the message.

These seem to be arguing the same point, and perhaps this is solved in the Mats library but as a general advantage of async message queues over synchronous REST calls, the message broker retrying the message or messages being lost isn't a given - they're difficult to get entirely right in both architectures.

> Monitoring: All messages pass by the Message Broker, and can be logged and recorded, and made statistics on, to whatever degree one wants.

>Debugging: The messages between different parts typically share a common format (e.g. strings and JSON), and can be inspected centrally on the Message Broker.

Centralising via an API gateway can also offer these.

stolsvik · on Feb 12, 2023

Well. My point is that messaging inherently have all these features, without needing any other tooling.

The combination of transactionality and retrying is hard to achieve with REST, don't you think? It is actually pretty mesmerizing how our system handles screwups like a database going down, or some nodes crashing, or pretty much any failure: The flows might stop up for a few moments, but once things are back in place, all the flows just complete as if nothing happened. I shudder when thinking of how we would have handled such failures if we used sync processing.

The one big deal is the concept of "state is on the wire": The process/flow "lives in the message" - not as a transient memory-bound concept on the stack of a thread.

tacticus · on Feb 12, 2023

> inherently have all these features, without needing any other tooling.

Providing your queuing service magically scales. and you can just scale that infinitely

If your queue is the bottle neck (yes it happens) how do you know there are more nodes getting added to it? How do you rebalance the topic over multiple new nodes in the queue?

> I shudder when thinking of how we would have handled such failures if we used sync processing.

easily. oh look i got an error and retried a few seconds later and it went to a new isolated backend running separately somewhere else or after it started again.

stolsvik · on Feb 13, 2023

> If your queue is the bottle neck (yes it happens) how do you know there are more nodes getting added to it? How do you rebalance the topic over multiple new nodes in the queue?

In a message queuing system, there is no rebalancing - you just add nodes, which then also get messages, typically in a round-robin fashion. This is not Kafka.

But of course, you can max out your message broker. As you can with a network switch for that matter.

> oh look i got an error and retried a few seconds later

So now you need a retrying mechanism. Which is provided as a basic feature of a messaging system. And with the transactionality, you do not get the problem of wondering whether the first attempt went halfway through, or not.

tacticus · on Feb 13, 2023

> In a message queuing system, there is no rebalancing - you just add nodes, which then also get messages, typically in a round-robin fashion.

And hope that the DNS\config propagates to the calling client and that the library can reliably add and remove nodes.

> Retrying

Wanna bet it's a built in feature in your messaging system?

Cause the whole publish op can fail at which point you have to retry sending it. or it published but you lost the ack (seen with AQ more than once) so now you send that message again and hope you remembered that whole duplicate message handling

stolsvik · on Feb 13, 2023

> And hope that the DNS\config propagates to the calling client and that the library can reliably add and remove nodes.

This is not how it works. A messaging broker client connects to the message broker. The client then creates receivers for one or multiple queues. If the broker has multiple clients receiving from the same queue, it uses a round-robin dispatch to the clients.

I fail to see how DNS or config, or "reliably add and remove" factors in here?

> Wanna bet it's a built in feature in your messaging system?

Yes?

This is not using pub/sub, but queuing. (There is an option of using publish/subscribe, but that is only meant for special cases like updating a GUI, or "broadcasting" an event like "invalidate cache for user X")

Message Queuing can be transactional. Mats leverages this. I have written about it multiple places both on this HN discussion, and on the webpage.

SNosTrAnDbLe · on Feb 13, 2023

How is this different from a standard enterprise service bus style architecture using ActiveMQ or Kafka ?

stolsvik · on Feb 13, 2023

I believe this is detailed a few places on that website, e.g. here: https://mats3.io/docs/message-oriented-rpc/

“Messaging with a call stack”, “invokable, message-based async endpoints”

eBombzor · on Feb 12, 2023

Why is this better over Kafka?

stolsvik · on Feb 12, 2023

As far as I understand, Kafka is positioning itself to be the leading Event Sourcing solution.

I view event sourcing to be fundamentally different from message passing. For a long time I tried to love event sourcing, but I see way to many problems with it. The primary problem I see is that you then end up with a massive source of events, which any service can subscribe to as they see fit. How is this different from having one gigantic spaghetti database? Also, event migrations over time.

RPC and messaging feels to me to be much clearer separated: I own the Accounts, and you own the Orders. We explicitly communicate when we need to.

I see benefits on both sides, but have firmly landed on not event sourcing.

revskill · on Feb 12, 2023

In our production apps, all network issues is resolved by simple rate-limiter.

tatersolid · on Feb 14, 2023

That rate limiter doesn’t solve anything for your users.

eikenberry · on Feb 12, 2023

Reminds me of the old Protocol vs. API debate.

http://wiki.c2.com/?ApiVsProtocol=

stolsvik · on Feb 12, 2023

Isn't that more of the difference between JMS and AMQP?

captainbeef · on Feb 13, 2023

How does Mats compare with dapr?

stolsvik · on Feb 13, 2023

I am not entirely sure. I had bunched dapr together with Istio, as a "service-mesh" layer utilizing side-car pods to mesh the different pieces of your service.

Just browsed it again, and there seems to be a bit more tooling in place. But I fail to see something like Mats in its feature list, though.

Mats is about endpoints which can invoke other service's endpoints. It is "messaging with a call stack", simplifying how you can code messaging-based inter-service communications. Pub/sub is not the same, and (synchronous) REST/gRPC is what Mats explicitly is an alternative to.

eterps · on Feb 12, 2023

"better than RPC" would be a more accurate title.

stolsvik · on Feb 12, 2023

Well, I actually call Mats "Message-Oriented Async RPC".