Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What sort of operational complexities did you run into when using Airflow?

In regards to the idempotency of workflows, so much of that comes down to how you develop your DAG files. Having read through both sets of the docs, they both pay lip service to idempotent workflows, but doing the heavy lifting of making your workflows idempotent is up to you.



Airflow requires task queues (e.g. celery), message broker (e.g. rabbitmq), a web service, a scheduler service, and a database. You also need worker clusters to read from your task queues and execute jobs. All those workers need every library or app that any of your dags require. Each of these things are not necessarily big deals on their own, but it still all adds up to a sizable investment in time, complexity and cost to get it up and running.

You'll probably also want some sort of shared storage to deploy dags. And then you have to come up with a deployment procedure/practice to make sure that dags aren't upgraded piecemeal while they are running. To be fair though, that is a problem with luigi (or any distributed workflow system, probably).

Luigi, IMHO, is more "fire and forget" when it comes to atomicity/idempotency, because wherever possible, it relies on medium of the output target for those guarantees. The target class, ideally, abstracts all that stuff away, and the logic of a task can often be re-used with a variety of input/output targets. I can easily write one base task, and not care whether the output target is going to be a local filesystem, a remote host (via ssh/scp/sftp/ftp) or google cloud storage or s3. With airflow I always feel like I'm invoking operators like "CopyLocalFileToABigQueryTableButFTPItFirstAndAlsoWriteItToS3" (I'm exaggerating, but still... :P).


> Airflow requires task queues (e.g. celery), message broker (e.g. rabbitmq), a web service, a scheduler service, and a database. You also need worker clusters to read from your task queues and execute jobs.

All these are supported but the scheduler is pretty much the only requirement.

Source: been running Airflow for the last two years without a worker cluster, without having celery/rabbitmq installed and sometimes without even an external database (i.e. a plan sqlite file).




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: