I wonder how many iterations we will need before engineers are happy with a work...

alfalfasprout · 2024-07-22T22:39:09 1721687949

The issue is that "workflow orchestration" is a broad problem space. Companies need to address a lot of disparate issues and so any solution ends up being a giant product w/ a lot of associated functionality and heavily opinionated as it grows into a big monolith. This is why almost universally folks are never happy.

In reality there are five main concerns: 1. Resource scheduling-- "I have a job or collection of jobs to run... allocate them to the machines I have" 2. Dependency solving-- If my jobs have dependencies on each other, perform the topological sort so I can dispatch things to my resource scheduler 3. API/DSL for creating jobs and workflows. I want to define a DAG... sometimes static, sometimes on the fly. 4. Cron-like functionality. I want to be able to run things on a schedule or ad-hoc. 5. Domain awareness-- If doing ETL I want my DAGs to be data aware... if doing ML/AI workflows then I want to be able to surface info about what I'm actually doing with them

No one solution does all these things cleanly. So companies end up building or hacking around off the shelf stuff to deal with the downsides of existing solutions. Hence it's a perpetual cycle of everyone being unhappy.

I don't think that you can just spin up a startup to deliver this as a "solution". This needs to be solved with an open source ecosystem of good pluggable modular components.

SOLAR_FIELDS · 2024-07-23T03:17:17 1721704637

The issue indeed is that "workflow orchestration" is a broad problem space. I would argue that the solution is not this:

> I don't think that you can just spin up a startup to deliver this as a "solution". This needs to be solved with an open source ecosystem of good pluggable modular components.

But rather more specialized tools that solve specific issues.

What you describe just sounds like a better implemented version of Airflow or the over 100 other systems that are actively trying to be this today (Flyte, Dagster, Prefect, Argo Workflows, Kubeflow, Nifi, Oozie, Conductor, Cadence, Temporal, Step Functions, Logic Apps, your CI system of choice has their own, need I continue, that is not even scratching the surface). Most of those have some sort of "plugin" ecosystem for custom code, in varying degrees of robustness.

For what it is worth, everyone and their mom thinks they can make and wants to be this orchestrator. It's a problem that is just so generic and such a wide net that you end up with annoying-to-use building blocks because everyone wants to architecture astronaut themselves into being the generic workflow orchestration engine. The ultimate system design trap: Something so fundamentally easy to grok and conceptualize that you can PoC one in hours or days, but near infinite possibilities of what you can do with it, resulting in near infinite edge cases.

Instead, I'd rather companies just focus on the problem space that it lends itself to. Instead of Dagster saying "Automate any workflow" and try to capture that space, just make building blocks for data engineering workflows and get really good at that. Instead of Github Actions being a generic "workflow engine" just have it really good at making CI workflow building blocks.

But we can't have it that way. Because then some architecture astronaut will come around and design a generic workflow engine for orchestrating your domain specific workflow engines and say that you no longer need those.

Actually I think I just convinced myself that what you are suggesting actually IS the right way. If companies just said "we will provide an Airflow plugin" instead of building their own damn Airflow this would be easy. But we won't ever have that either. What we really need is some standards around that. Like if CNCF got together and got tired of this and said "This is THE canonical and supported engine for Kube workflows, bring your plugins here if you want us to pump you up". That might work. They've usually had better luck with putting people in lockstep in the Kube ecosystem at least than Apache has historically for more general FOSS stuff. Probably because the problem space there is more limited.

swyx · 2024-07-23T18:23:31 1721759011

great insight, appreciate this. would also point out logging/event sourcing for "free" observability

savin-goyal · 2024-07-22T21:10:00 1721682600

Metaflow sits on top of Maestro, and neither replaces the other

> ...Users can use Metaflow library to create workflows in Maestro to execute DAGs consisting of arbitrary Python code. from https://netflixtechblog.com/orchestrating-data-ml-workflows-...

The orchestration section in this article (https://netflixtechblog.com/supporting-diverse-ml-systems-at...) goes into detail on how Metaflow interplays with Maestro (and Airflow, Argo Workflows & Step Functions)

dinobones · 2024-07-22T20:53:34 1721681614

We rolled our own workflow engine and it almost crashed one of our unrelated projects for having so many bugs and being so inflexible.

I’m starting to think workflow engines are somewhat of a design smell.

It’s enticing to think you can build this reusable thing once and use it for a ton of different workflows, but besides requiring more than one asynchronous step, these workflows have almost nothing in common.

Different data, different APIs, different feedback required from users or other systems to continue.

ryanianian · 2024-07-22T21:39:42 1721684382

> workflow engines are somewhat of a design smell

Probably so, but the real design smell seems to be thinking of a workflow engine as a panacea for sustainable business process automation.

You have to really understand the business flow before you automate it. You have to continuously update your understanding of it as it changes. You have to refactor it into sub-flows or bigger/smaller units of work. You have to have tests, tracer-bullets, and well-defined user-stories that the flows represent.

Else your business flow automation accumulates process debt. Just as much as a full-code-based solution accumulates technical debt.

And, just like technical debt, it's much easier (or at least more interesting) to propose a rewrite or framework change than it is to propose an investment in refactoring, testing, and gradual migrations.

pm90 · 2024-07-22T22:39:54 1721687994

It’s likely because we haven’t yet found a workflow engine/orchestrator thats capable of handling diverse tasks while still being easy to understand and operate.

It’s really easy to build a custom workflow engine and optimize it for specific use cases. I think we haven’t yet seen a convergence simply because this tool hasn’t yet been built.

Consider the recent rise of tools that quickly dominated their fields: Terraform (IaC), Kubernetes (distributed compute). Both systems are hella complex, but they solve hard problems. Generic workflow engines are complex to understand and difficult to operate and offer a middling experience so many folks don’t even bother.

fragmede · 2024-07-23T02:22:59 1721701379

slurm? airflow?

pm90 · 2024-07-23T03:07:42 1721704062

airflow is notoriously hard to operate https://news.ycombinator.com/item?id=31482217

sgloutnikov · 2024-07-22T20:52:53 1721681573

Naming things, cache invalidation, and workflow engines? :)

https://github.com/meirwah/awesome-workflow-engines

cbsmith · 2024-07-23T00:54:00 1721696040

No, it's just the two things: naming things, cache invalidation, and off by one errors.

Nathanba · 2024-07-23T00:50:53 1721695853

It inherently asks for a custom implementation because it's almost like workflows are just how you'd have to code and run everything anyway. Conceptually: why wouldn't we want to reconnect to any work we are currently in progress of, just like in a videogame where if we lose connection for a splitsecond, we want to be able to keep going where we left off? So therefore we must save the current step persistently and make sure that we can resume work and never lose it. Workflow engines also do no magic: They still just run code and if it fails in a place that we didn't manually checkpoint (=by making it into a separate task/workflow/function/action/transaction that is persistable) then we still lose data so.. at that point why not just try doing it this way everywhere whether it's running in a "workflow engine" or not. Before "workflow engines" we already had db transactions but those were mostly for our benefit so we don't mess up the db with partial inserts. Although so far what I've seen in open source workflow engines is that they don't let you work with user input easily, it's sad how they all start a new thread and then just block the thread while it waits for the user to send something. This is obviously not how you'd code a crud operation. In my opinion this is a huge drawback of current workflow engines. If this was solved, we should literally do everything as a workflow I think. Every form submission from the user could offer to let the user continue where he left off and we saved all his data so "he can reconnect to his game" (to revive the videogame metaphor I started with)

dekhn · 2024-07-22T21:16:26 1721682986

I wrote my own because I wanted to learn about DAG and toposort and had some ideas about what nodes and edges in the workflow meant (IE, does data flow over edges? Or do the edges just represent the sequence in which things run? Is a node a bundle of code, does it run continuously, or run then exit?). I almost ended up with reflow, which is a functional-programming approach based on python, similar to nextflow, but I found that the whole functional approach to be extremely challenging to reason about and debug.

Often times what happens is the workflow engine is tailored to a specific problem and then other teams discover the engine and want to use it for their projects, but often need some additional feature, sometimes which completely up-ends the mental model of the engine itself.

renewiltord · 2024-07-23T02:03:00 1721700180

We all have different use-cases. We also have a workflow engine at work but that's because we wanted immediate execution. From submit to execute time can be 100 ms on our system, which makes it also work well for short jobs. Usually, the task coordinator overhead is greater than that on these things.

nijave · 2024-07-22T21:19:19 1721683159

These things tend to be fairly complex and require lots of integration with various services to get working. I think it's a little more organic to start building something simple and end up progressively adding more than implementing one from scratch (unless there are people around with experience)

ilrwbwrkhv · 2024-07-22T21:20:16 1721683216

Its because Netflix pretends to be a tech company to get the high market cap.

So they hire tons of engineers who have nothing to do but rearchitecture the mess their microservices have created.

Then there are others who create observability and test harnesses for all of that.

When Pornhub and other porn sites can deliver orders of magnitude more data across the world with much simpler systems, you know it's all bullshit.

tempest_ · 2024-07-22T23:10:05 1721689805

To be fair when netflix started they were solving legitimate problems that a major streaming provider would have.

In the time since those problems have been solved and now are offered as a service by most cloud providers (for a hefty fee of course)

thfuran · 2024-07-22T21:26:55 1721683615

>When Pornhub and other porn sites can deliver orders of magnitude more data across the world with much simpler systems, you know it's all bullshit

When is that, exactly? https://www.statista.com/chart/15692/distribution-of-global-...

rty32 · 2024-07-22T22:39:19 1721687959

What is the methodology of the report?

Just one of the questions I have regarding this -- China has nearly 1.4 billion people, and barely any of them use any of the services here. Instead, they have their own video platforms. And you tell me that none of those platforms use at least the same amount of traffic of Prime Video? I doubt it.

barnabyjones · 2024-07-23T02:11:49 1721700709

I found the report the statistic is from [0]. But note that it says "by app," so I don't think it's actually all traffic, just the top apps. Their reported source is data from 300m customers in different regions.

[0] https://www.sandvine.com/hubfs/Sandvine_Redesign_2019/Downlo...

ATMLOTTOBEER · 2024-07-22T22:07:58 1721686078

“Other” in your diagram is mostly porn

thfuran · 2024-07-23T00:33:22 1721694802

Even supposing that "Other" is just pornhub and nothing else, that's less than one order of magnitude more than Netflix.

exe34 · 2024-07-22T21:48:57 1721684937

isn't it like 30%?

renewiltord · 2024-07-23T02:06:00 1721700360

> When Pornhub and other porn sites can deliver orders of magnitude more data across the world with much simpler systems, you know it's all bullshit.

That's nothing. My dedicated server delivers two orders of magnitude greater traffic than Pornhub (and everything in the Mindgeek network really). And I don't even need the cloud. Just better engineering.

otabdeveloper4 · 2024-07-23T12:55:08 1721739308

> why engineers are so keen on building their own workflow engines

Because all the existing ones suck.

(We built our own tiny one two. We need tight integration with systemd jobs and cgroups, and existing solutions don't do that.)