A lot of the GenServer-information floating around explains code_change/3, no? T...

cess11 · 2024-12-13T14:06:31 1734098791

Someone put a reply and then deleted it while I wrote a response, and it irks me that it might have been a waste so here's the gist of it:

"Is it just that people are more comfortable with blue-green deploys, or are blue-green deploys actually better?"

It depends. If you can do a blue-green shift where you gradually add 'fresh' servers/VM:s/processes and drain the old, that's likely to be most convenient and robust in many organisations. On the other hand, if you rely on long running processes in a way where changing their PID:s break the system, then you pretty much need to update them with this kind of hot patching.

"Does Erlang offer any features to minimize damage here?"

The BEAM allows a lot of things in this area, on pretty much every level of abstraction. If you know what you're doing and you've designed your system to fit well into the provided mechanisms the platform provides a lot of support for hot patching without sacrificing robustness and uptime. But it's like another layer of possible bugs and risks, it's not just your usual network and application logic that might cause a failure, your handling of updates might itself be a source of catastrophe.

In practice you need to think long and hard about how to deploy, and test thoroughly under very production like conditions. It helps that you can know for sure what production looks like at any given time, the BEAM VM can tell you exactly what processes it runs, what the application and supervisor trees look like, hardware resource consumption and so on. You can use this information to stage fairly realistic tests with regards to load and whatnot, so if your update for example has an effect on performance and unexpected bottlenecks show up you might catch it before it reaches your users.

And as anyone can tell you who has updated a profitable, non-trivial production system directly, like a lot of PHP devs of ye olden times, it takes a rather strong stomach even when it works out fine. When it doesn't, you get scars that might never fade.

Muromec · 2024-12-13T14:58:38 1734101918

This is also a reply to that deleted comment, because I had to type it all and also got to go outside and have my European 2 hour long lunch break while doing it.

If you have any kind of state in gen_server and the state or assumptions of it have changed, you need to write that code_change thingy that migrates the state both ways between two specific versions. If by some chance this function is bugged, then the process is killed (which is okay), so you need to nail down the supervision tree to make things restartable and also not get into restart loops. Remember writing database migrations for django or whatever ORM of the day? Now do that, but for memory structures you have.

Now, while the function is running it can't be updated of course, so you need gen_server to call you back from the outside of the module. If you like to save function references instead of saving process references in your state, you need to figure out which version you will be actually calling.

If you change the arity of your record, then the old record no longer matches your patterns.

Since updates are not atomic, you will have two versions of the code running at the same time, potentially sending messages that old/new stuff does not expect, and both old and new code should not bug out. And if they do bug out, you have been smart enough to figure out how to recover and actually test that.

Than there is this thing, if somehow something from the version V-2 still running after update to V-1 and you start updating to the latest V, then things happen.

You can deal with all that of course and erlang gives you tools and recipies to make it work. Sometimes you have to make it work, because restarting and losing state is not an option. Also it's probably fun to deal with complex things.

Or you could just do do the stupid thing that is good enough and let it crash and restart instead of figuring out ten different things that could go wrong. Or take a 15 minutes maintenance window while your users are all sleeping (yes, not everybody is doing critical infra that runs 24/7 like discord group with game memes). Or just do blue-green and sidestep it all completely.