Durable Execution maintains program state in a journal, and this can lead to complications when updating service code in such a way that the journal no longer makes sense. Limiting the duration that handlers run makes this immutability problem a lot easier to deal with. But wait - doesn’t that lose one of the most interesting properties of durable execution - the ability to create business processes that operate over long durations (’code that sleeps for a month’)? The insight is that it is possible to break up long-running handlers into multiple short-running handlers if the handler can remember state. That way it knows where to continue from on the next invocation. With this approach, you only need to worry that your handler state is compatible with your handler upgrades which is usually a much simpler problem to solve. Restate provides you with consistent handler state so that you can choose to design your handlers the way you prefer it.
Yes, Restate supports registering different versions of your handler code. Restate requires that the code is still available for as long as there is an in-flight invocation of your handler for a given version. Breaking your handler/long-running process up into multiple steps can shorten this time tremendously. Then, one only needs to make sure that the handler/process state is forward-compatible. For example, using Protobuf for your handler/process state will make this fairly straight-forward.
The idea of handler sleeping for a month is neat idea. But I do wonder how you test new versions of your code, because it potentially needs to handle months of different historical “paused” states. That’s not something you just write a unit test for, especially as that unit test would only be valid until those states were consumed.
I think you do it just like you handle compatibility between services – you never remove parameters; you only ever add new optional ones if you have to. This way a message from the past will be compatible with a future handler, same as if you have a caller that depends on you which is using an outdated client/API definition.
But you are right; it's very hard to reason about testing such systems, since you may have accumulated state which causes your handler logic to behave differently. The problem exists in service architectures in general though, it's just very hard to miss with intentionally delayed processing.
How does Restate handle the scalability challenges, especially when dealing with the state management of long-running processes, without compromising performance?
It's a hard problem; we are basically building a distributed database from scratch to solve it. You have to pick a position on the DB-design trade-off curve where you can store state with low latency, consistently, where the state is very ephemeral (long running handlers aside, generally it should only live for a few minutes). But more than anything, the 'trick' is storing service state, execution state (ie the journal), and communication between services all in the same distributed log - that means you can commit all 3 atomically