More

mononcqc · 2024-09-09T15:22:13 1725895333

Two years after writing A Pipeline Made of Airbags, I ended up prototyping a minimal way to do hot code loading from kubernetes instances by using generic images and using a sidecar to load pre-built software releases from a manifest in a way that worked both for cold restarts and for hot code loading: https://ferd.ca/my-favorite-erlang-container.html

It's more or less as close to a middle-ground as I could imagine at the time.

mononcqc · on Jan 27, 2023

I'm not quite sure how Erlang's world is totalizing. It has ways to ship things in a very integrated manner, but I have shipped and operated Erlang software that was containerized the same as everything else, in the same K8s cluster as the rest, with the same controls as everything else, with similar feature flags and telemetry as the rest, using the same Kafka streams with the same gRPC (or Thrift or Avro) messages as the rest, invisibly different from other applications in the cluster to the operator in how they were run aside from generating stack traces that look different when part of it crashes.

That it _also_ ships with other ways of doing things in no way constrains or limits your decisions, and most modern Erlang (or Elixir) applications I have maintained ran the same way.

You still get message passing (to internal processes), supervision (with shared-nothing and/or immutability mechanisms that are essential to useful supervision and fault isolation), the ability to restart within the host, but also from systemd or whatever else.

None of these mechanisms are mutually exclusive so long as you build your application from the modern world rather than grabbing a book from 10-15 years ago explaining how to do things 10-15 years ago.

And you don't _need_ any of what Erlang provides, the same way you don't _need_ containers (or k8s), the same way you don't _need_ OpenTelemetry, the same way you don't _need_ an absolutely powerful type system (as Go will demonstrate). But they are nice, and they are useful, and they can be a bad fit to some problems as well.

Live deploys are one example of this. Most people never actually used the feature. Those who need it found ways (and I wrote one that fits in somewhat nicely with modern kubernetes deployments in https://ferd.ca/my-favorite-erlang-container.html) but in no way has anyone been forced to do it. In fact, the most common pattern is people wanting to eventually use that mechanism and finding out they had not structured their app properly to do it and needing to give it a facelift. Because it was never necessary nor totalizing.

Erlang isn't the only solution anymore, that's true, and it's one of the things that makes its adoption less of an obvious thing in many corners of the industry. But none of the new solutions in the 2023 reality are also mutually exclusive to Erlang. They're all available to Erlang as well, and to Elixir.

And while the type system is underpowered (and there are ongoing area of research there -- I think at least 3-4 competing type systems are being developed and experimented with right now), that the syntax remains what it is, I still strongly believe that what people copied from Erlang were the easy bits that provide the less benefit.

There is still nothing to this day, whether in Rust or Go or Java or Python or whatever, that lets you decompose and structure a system for its components to have the type of isolation they have, a clarity of dependency in blast radius and faults, nor the ability to introspect things at runtime interactively in production that Erlang (and by extension, languages like Elixir or Gleam) provide.

I've used them, I worked in them, and it doesn't compare on that front. Regardless of if Erlang is worth deploying your software in production for, the approach it has becomes as illuminating as the stacks that try and push concepts such as lack of side-effects and purity and what they let you transform in how you think about problems and their solutions.

That part hasn't been copied, and it's still relevant to this day in structuring robust systems.

mononcqc · on Sept 30, 2022

I tried it with Erlang. Their "Sample Code 2" does not generate working code (lowercase variable assignments), and their "Sample Code 3" does not even generate valid syntax, flat out leaving bits of typescript untranslated.

mononcqc · on May 18, 2022

The Erlang 'maybe' expression expands on what 'with' allows in Elixir, mostly because the 'with' construct allows a list of conditional patterns and then a general 'do' block, whereas the Erlang 'maybe' allows mixed types of expressions that can either be conditional patterns or any normal expression weaved in together at the top level.

It is therefore a bit more general than Elixir's 'with', and it would be interesting to see if the improvement could feed back into Elixir as well!

The initial inspiration for the 'maybe' expression was the monadic approach (Ok(T) | Error(T)) return types seen in Haskell and Rust, and the first EEP was closer to these by trying to mandate the usage of 'ok | {ok, T}' matches with implicit unwrapping.

For pragmatic reasons, we then changed the design to be closer to a general pattern matching, which forced the usage of 'else' clauses for safety reasons (which the EEP describes), and led us closer to Elixir's design, which I felt was inherently more risky in the first drafts (and therefore I now feel the Erlang design is riskier as well, albeit while being more idiomatic).

So while I did get inspiration from Elixir, and particularly its usage of the 'else' clause for safety reasons, it would possibly be reductionist to say that "the good ideas were stolen from Elixir." The good ideas were stolen from Elixir, but also from Rust, Haskell, OCaml, and various custom libraries, which have done a lot of interesting work in value-based error handling that shouldn't be papered over.

I still think these type-based approaches represent a significantly positive inspiration that we could ideally move closer to, if it were possible to magically transform existing code to match the stricter, cleaner, more composable patterns that they offer.

In the end I'm hoping the 'maybe' expression still provides significantly nicer experiences in setting up business logic conditions in everyday code for Erlang user, and it is of course impossible to deny that I got some of the form and design caveats from the work done in the Elixir language already :)

Also as a last caveat: I am not a member of the Erlang/OTP team. The design however was completed and refined with their participation (and they drove the final implementation whereas I did the proof of concept with Peer Stritzinger and wrote the initial EEP), but the stance expressed in my post here is mine and not the one of folks at Ericsson.

_randyr · on May 18, 2022

> the 'with' construct allows a list of conditional patterns and then a general 'do' block, whereas the Erlang 'maybe' allows mixed types of expressions that can either be conditional patterns or any normal expression weaved in together at the top level.

This seems slightly incorrect to me. You can write expressions in Elixir's with macro too, by simply swapping the arrow for an equals sign. For example, this is perfectly valid Elixir code:

    with {:ok, x} <- {:ok, "Example"},
         IO.puts(x),
         len = String.length(x) do
      IO.puts(len)
    end

Did you mean something else?

mononcqc · on May 18, 2022

See https://news.ycombinator.com/item?id=31425298 for a response, since this is a duplicate. TL:DR; I had never seen it and had no idea it was possible because I don't recall seeing any documentation or post ever mentioning it! Ignorance on my part.

OkayPhysicist · on May 18, 2022

The with statement in Elixir already allows for abitrary expressions between the with and the do. I'm not sure what I'm missing here.

mononcqc · on May 18, 2022

You're right. After all these years (and even writing a book that had Elixir snippets in it) I had never seen a single example showing it was possible and did not know it could do it.

Well there you go, I guess the pattern is equivalent but incidental.

mononcqc · on April 16, 2022

To illustrate the point, let's imagine that 5 years ago, a driver in the linux kernel was written in such a way that interplay with a new disk drive could corrupt data when power failed. The disk on its own (and with other OSes) is fine, and the linux kernel in all other cases are fine.

An open-source database is being used and operated as a service by a vendor, which a SaaS company relies on to provide a feature that your organization uses to manage data on behalf of users.

We now have a chain that includes: users <- customer organisation <- SaaS vendor <- DB as a service vendor <- OSS DB maintainers <- Linux maintainers <- Driver writers <- Hardware vendors.

There is suddenly a power outage at the DB as a service vendor (because of an unmaintained powerline falling over) and their UPCs appear not to be functional for yet unknown reason (cost cutting or supply chain issues during covid time may receive some blame). Your users lose their data regardless.

What is the bug? Who is at fault? Is it the engineer? The team who wrote the code? The QA folks? The organization that hired them? Who should fix the issue? Who should be on charge with repairing data corruption? Whose backups should be trusted most? The least? Have you been lenient in your usage of a SaaS vendor? Has the SaaS vendor been lenient in the services they use? Which actors can be considered liable from a legal standpoint? Which actors can be considered liable from an ethical or moral standpoint? Are your customers the one who made a bad decision contracting you? Can there be more than one party responsible? Do any of these answers changed based on whether the power loss is caused by an act of god or bad maintenance? Based on which jurisdiction you're in? How do you define honest mistakes? Negligence? A bug is a bug because the software did not meet the expectations that were set. Were the expectations reasonable? Who should have managed them?

Events happen. The meaning we attach to them is of course based on expectations and standards and the environment and context, but the way we build our explanation, the ways we attach blame and accountability varies. You can sometimes decide to assign accountability to individuals, sometimes to systems, sometimes both. Sometimes only some or sometimes neither.

So sure, you can point at the actual technical lines of code and say "these aren't doing what they should", but if you do this in a vacuum without also wondering who decided what these lines should be doing and what pressures were at play when they were written, are you necessarily learning a lot about how events unfold and how they might unfold in the future?

A systemic perspective will yield different reactions than one based on personal engineer responsibility, which will be different from one that looks at it from an insurer's point of view, which will be different from one which looks at it from an education point of view, etc. So the lens you take to look at the events surrounding the error and the interpretation you make of it are absolutely crucial to the corrections and learnings that follow.

blueflow · on April 16, 2022

Nothing you wrote is about errors in itself, you exclusively talk about the social interpretations/construction of it. These are things that every adult needs to be able to improvise and express on-the-fly, nothing 'static' that could be fruitfully discussed independently of specific occurrences.

Late edit: So, when interpreted in my context, its utter crap! Your article is shit! You explicitly invited your readers to disregard cross-checking against your own intentions[1], saying they "don't matter". Its basically a free card for me to project my prejudices onto you, without you having anything to conter it. You are a fragile individual who does not think things through.

[1] https://ferd.ca/inclusiveness-in-language-for-outsiders-look..., Section "Death of the Author"

mononcqc · on April 16, 2022

So this disagreement is an interesting example of error being a construction in itself: I will posit that you are wrong or not interpreting my point of view properly, and you will say that I am wrong or possibly not explaining things properly.

There is an objectively quantifiable disagreement. But its nature (and even whether it is desirable or not) are possibly camped in subjective terms. Of course you could argue that I am objectively wrong — though trying to prove that with my own writings is risky since we’ve established I’m not a trustworthy source — but that in itself does not resolve the overall disagreement from existing.

This sort of situation can also happen in software where an ambiguous specification yields two distinct compliant implementations that nevertheless do not work together.

c3534l · on April 16, 2022

This doesn't seem mysterious or confusing to me. What happened wasn't the launch of a new random power outage feature - it was a problem that needs to be fixed.

mononcqc · on April 16, 2022

Fixing the bug does not necessarily fix anything else it caused though, and none of the questions in my post are answered by “fix it”.

Here’s another one: should the incident actually have an impact on anyone of the companies employees yearly reviews? Either positive or negative? Why?

c3534l · on April 16, 2022

It seems like you're conflating blame with bugs.

mononcqc · on April 16, 2022

Well if all you have is bugs but you never look into how they get there and how they happen, you’re not going to improve your process or tools much.

Also blame and accountability and responsibility are all subtly different.

nvr219 · on April 16, 2022

Yours os is a pos

mononcqc · on Aug 9, 2021

Because that 60% from coal is more or less shared across everyone, but flights are, on the global scale, a luxury item. In many cases, taking a transatlantic flight emits more CO2 equivalent than your average person commuting for a year.

So even though aviation is a small part of it all, it is one of the individual actions able to have outsized impact, usually for leisure or at least often for non-essential reasons.

api · on Aug 9, 2021

That's quite false.

I live in Southwestern Ohio and get probably 75% of my power from coal. My pot of coffee emits far more CO2 than someone in Portland, who gets 0% from coal and most of their power from hydro. It's the same pot of coffee. Our cost of energy is similar and our standard of living is similar.

Where you get your power greatly affects how much CO2 you emit doing the exact same things. All energy is not created equal.

There are two huge barriers to doing anything about this problem. The first is denialists and those with vested interests in fossil fuels promoting denialism. The second is well-intentioned people who understand that there is a problem over-complicating the issue, promoting misunderstandings, or promoting the idea that this can't be solved without massive decreases in standard of living.

The latter, which I term "abstinence based environmentalism" by analogy with abstinence-based sex ed, will work about as well as politely telling teenagers not to have sex. If you tell people they need to become poorer to save the planet, they'll ignore you... especially if they are already poorer than you in which case you look like a hypocrite.

This problem is actually pretty simple. The following steps won't solve it 100% but they'll go pretty far.

(1) Phase out coal for electricity generation in favor of... almost anything else except maybe oil shale.

(2) Push electric vehicles, not because all EVs categorically emit less carbon than gasoline cars but because it's a hell of a lot easier to replace a few point source power plants than it is to replace a vast fleet of millions of internal combustion engines. (That and your typical EV is indeed better... even if your electricity is 100% coal an EV is generally no worse than an ICE car due to the superior efficiency of large power plants and the high embodied energy of gasoline.)

(3) Continue to subsidize renewable energy and grid-scale storage.

(4) At least stop shutting down perfectly good nuclear power plants before renewables are in place to replace them, and at best put some serious funding behind next-generation nuclear efforts. Fusion is also grossly under-funded. Ignore the "it'll always be N years away" idiots. There has been substantial progress even with very limited funding available.

There is no point in quibbling about small contributors like aviation (<5%) while we are still burning shitloads of coal and coal is far easier to replace than jet fuel. You don't solve a problem like this by making the solution maximally inconvenient.

mononcqc · on Aug 9, 2021

I'm aware of this. I'm living in a place where power is over 99% renewable and makes use of no gas for utilities, work from home and drive fewer than 5,000km/y as a household, and eat a low-meat diet.

The biggest gains to be had are obviously systemic, and what I do as a consumer is far more limited in its scope and impact. I still limit my flights, no longer attend in-person conferences, try to travel more local, because what else am I going to do? I'm aware this is like putting out a cigarette when the whole town's already on fire, but I can't deal with the dissonance otherwise. It's still an individual luxury that can have an oversized impact compared to everything else I do.

Advocating for it is not going to be sufficient at all, but it's still the most impact I can have when all the big stakeholders who have to fix their powergrid are not even in countries I live in.

andrekandre · on Aug 9, 2021

  > It's still an individual luxury that can have an oversized impact compared to everything else I do.

as an individual, isnt that plane going to fly wether im on it or not...?

ive ridden on many long intercontinental flights where a huge amount (almost half) of seats were unfilled..

api · on Aug 10, 2021

The unit of capital allocation in airlines is one plane, and cancelling a flight or adding a new one probably requires some advance notice to airports and other agencies not to mention time for maintenance, flight crews, etc. to ready it or take it out of operation. Then there's pilots' unions etc.

So no, your decision does not immediately affect the number of planes flying, but if many people fly less the net effect will be fewer planes in the air after some time delay.

mononcqc · on May 22, 2021

Generally this is going to be true of newer experiences that offer a shift in perspectives.

mononcqc · on Jan 14, 2021

as one of the rebar3 co-authors, you'd have gotten rebar3 regardless of hex: we couldn't use the packages in there, and there were no packages that existed. We even entirely broke rules about how package versioning works compared to what Hex expects.

However, we didn't spit on having a package manager (that wasn't a bad lazy index hosted on a github repo), and it became a very interesting bridge across communities that we don't regret working with. Our hope now is to try and make it possible to use more Elixir libraries from the Erlang side, but the two languages' build models make that difficult at times.

mononcqc · on Sept 25, 2020

Some services may require gigabytes of state to be downloaded to work with acceptable latency on local decisions, and that state is replicated in a way that is constantly updated. These could include ML models, large routing tables, or anything of the kind. Connections could be expected to be active for many minutes at a time (not everything is a web server serving short HTTP requests that are easy to resume), and so on.

Changing the instance means having to re-transfer all of the data and re-establish all of that state, on all of the nodes. You could easily see draining of connections take 15-20 minutes, and booting back and scaling up to be taking 15-20 minutes as well, if you can do it for _all_ the instances at once (which may not be a guarantee, and you could need to stagger things to be more cost-effective).

You start with each deploy taking easily over an hour. If you deploy 2-3 times a day and that your peak times line up with these, you can more than double your operating cost just to deploy, and that can take more than 4 figures to count.

Some of the systems we maintained (not those we necessarily live deployed to, but still required rolling restarts) required over 5,000 instances and could not just be doubled in size without running into limits in specific regions for given instance types.

If a blue/green deploy takes a couple minutes, you're probably not having a workload where this is worth thinking about that much.

theptip · on Sept 26, 2020

This sounds like an interesting usecase.

Did you look at any of the node-based options like checkpointing the state to a file on the node, and loading that into your newly started pods? Or using read-many persistent volumes? (Not sure if you needed to write to the state file from every process too?)

(This doesn’t help with connections of course, that’s a bit more thorny.)

toast0 · on Sept 26, 2020

> checkpointing the state to a file on the node, and loading that into your newly started pods

For some types of my nodes, the majority of the state was tcp connections and associated processes. I don't think there's a generally available system that is capable of transferring tcp connection state (although, I'd love to build one! if you've got a need, funding, and a flexible time table), which would be a prerequisite to moving the process state. All of those connections need to be ended, and clients reconnect to another server (where they might need to do it again, if they don't get lucky and get a new server to begin with).

The other nodes with more traditional state had up to half a terrabyte of state in memory, and potentially more on disk, and a good deal of writes. That's seven minutes to transfer state on 10G ethernet, assuming you can use the whole bandwidth and sending and receiving that data is faster than the network.

Although, in my experience, we didn't tend to explicitly replicate disk based storage for new nodes, all of our disk based data was transient, so replacing nodes meant writing to new nodes, read from new and old, and retiring the old nodes when their data had all been fetched and deleted, or the data retention cap was missed.

I/O Volume meant networked filesystems would be a big stretch. You could probably do something with dual-ported SAS drives, and redundant pairs of machines on the same rack, but then that pair with both go down when that rack has an unforseen problem, plus good luck getting dual-ported SAS drives hooked up properly when you're in someone else's bare metal managed hosting.

(Yeah OK, maybe we had big performance requirements, but hotloading works just as well for stuff that fits on a single redundant pair, or even a single server in a pinch)

theptip · on Sept 26, 2020

Interesting case study, thanks for sharing!

mononcqc · on Sept 25, 2020

Author here.

For the first system, it was deprecated without replacement, and just let to run by managers and people who had moved on to other teams (but used to work on it) who did the minimal maintenance required, and former employees were given emergency contracts in weird circumstances to deal with things. Roughly 3-4 years after, they finally replaced it after 2-3 attempts at rewrites that had failed before. The old design with minimal maintenance for years finally approached limits to how it could be scale without bigger redesigns; I consider this to be extremely successful.

It wasn't exactly bulldozed nearly as much as declared "done" and abandoned without adequate replacements while major parts of the business was just being rewritten to use Go and a more "standard" stack. Obviously these migrations always start with something easier and by replacing components that are huge pain points to its contemporaries, and you're left with more legacy stuff in the end that is much harder to replace done in the final pushes. I felt that the blog post would have veered off point if the whole thing became about that, though.

The people on these teams left in part because the hiring budget was redirected towards hiring on the new projects. The idea was that everything could be done in Go for these stacks (by normalizing on tools and libraries developed in another project and wanting to have one implementation for both the private and shared platforms), and the rewrites were to start with Ruby components.

You knew working on the Erlang side of things that no feature development would ever take place again, that no new hands would be hired to help, and that you would be stuck on call 24/7 with no relief for years. All efforts were redirected to Go and getting rid of Ruby, and your stuff fell in between the cracks. I was one of the people who left on the long tail there. After my departure, I was brought back on a lucrative part-time contract as a sort of retainer for years to help them in case of major outages (got 1 or 2 in 4 years) since that was the only way they could get expert knowledge once they drove us all away.

I'm still on good terms with the people there, it's just that "maintaining a self-declared unmaintained legacy stack without budget or headcount until we get to rewrite it in many years" is not where any of us wanted to drive our careers.

Interestingly, we tried very hard to add new developers. We wrote manuals, tooling, a book on operating these systems (see https://www.erlang-in-anger.com), wanted to set up internal internships so developers from other teams could come and work with us for a while, etc. Whereas our team was very willing, internal politics (which I can't easily get in a public forum) made it unworkable and most attempts were turned down. These things were not always purely business decisions, and organizational dynamics can be very funny things.

geofft · on Sept 25, 2020

Thanks, that's helpful. I think it is impressive, and actually kind of a selling point, that the system maintained itself without formal staffing. I was going to object that you didn't build a high-reliability system, you built a high-reliability system that worked as long as you had a couple of experts staffing it, but it sort of sounds like that's not actually what happened.

On the other hand, it seems like a fairly common problem that using neat but niche technology makes it hard to hire for it and get continued development. My own employer has a pretty nice system in Clojure that is well-maintained and has people working on it but nonetheless is always on the wrong side of things because it's using a language (and a development and release workflow, in turn) that nobody else is using.

Is the problem that we should build systems that have all the advantages of the neat tech we like but still look like the more boring and less complicated things, operationally? Was it that you had the backstop in your Erlang system but it wasn't enough to convince the business that it could be treated like a normal system? Or was it separate (and could it have been shaped to look like the more boring thing)?

Alternatively, if we take the goal of reliability engineering broadly as building systems that work despite both dysfunctional computer systems and dysfunctional human systems, it seems like that argues in favor of picking the approach that doesn't line you up to be on the wrong side of organizational politics. I don't like this conclusion - I'm a huge fan of solving problems with small but well-applied amounts of technical expertise - but I don't really have a good sense of how to avoid it. (Maybe the answer is more blog posts like yours influencing expectations.)

mononcqc · on Sept 25, 2020

I think it's a problem of "not being the tech of choice". Obscure languages have a higher cost in that you can't just out-source the work of training people to the rest of the industry as easily.

But you'll also get issues with staffing and getting people interested in working in mainstream languages that are less cool than they used to be, frameworks that are on older versions and hard to migrate, deployed through systems that aren't as nice on your resume than newer ones, or on platforms that aren't seen as positively.

I don't have a very clear answer to give about why Erlang specifically wasn't seen positively. The VP of Eng at the time (now the github CTO) saw Erlang very positively (https://twitter.com/jasoncwarner/status/1287383578435780608) but I know that some specific product people didn't like it, and so on. To some extent a lot of the work pushing us aside was just done by very eager Go developers who just started doing work on replacing our components with new ones on the other side of the org, and then propagating that elsewhere.

Whether the roadmap or other policies ended up kneecapping our team on purpose or accidentally is not something I can actually know. I kept pushing for years to improve things for our team, but at some point I got tired and left for a different sort of role.