How to Manage Apache Airflow with Systemd on Debian or Ubuntu

tyingq · on Dec 26, 2019

ETL is a funny space. At least in the "Enterprise" world, it's dominated by Ab Initio, which is crazy expensive.

They seem to be coasting too, for quite some time. Their website is probably the most terrible site I've ever seen for an expensive piece of software. You can't even tell what it is, how to buy it, or even how to contact them. https://www.abinitio.com/en/

thenaturalist · on Dec 26, 2019

What's your source for the fact that Ab Initio is dominating?

Other than that, there are a several other tools in the Enterprise Analytics space that fall into a similar pattern like Alteryx or Collibra. But from their perspective it makes perfect sense, I guess. When your sales is done by relationship building and there isn't much competition once you're in, there isn't really a need to boast a fancy website or make an effort.

If anyone has a good resource on how enterprise IT procurement is done or the dynamics around it, I'd love to read up on that.

GordonS · on Dec 26, 2019

I was going to ask the same question - I've never even heard of Ab Initio.

tyingq · on Dec 26, 2019

See this Quora Q&A:

https://www.quora.com/Which-companies-in-the-USA-use-Ab-Init...

My suspicion is that their customers are mostly companies that use Teradata, because it has a fair amount of Teradata specific features. Probably not good news for their future, but lucrative for now.

kfk · on Dec 26, 2019

Well I have a lot of experience with IT procurement. IT is more about cost and risk mitigation than technology. The last thing IT wants is to try new software that might blow up in the hands of their unprepared outsourced helpdesk support people. So the absolute first thing you want to show is robust and cheap support or even better a way to make sure your product works with their helpdesk setup. Another thing is cost and how your product helps them run things with less money. They could not care less about the quality of the product because they will not be the ones using it. In general you sell by telling a story that fits with whatever story they are telling the business. So if IT is telling business cloud is the next thing and you sell on premise you won’t go anywhere.

tyingq · on Dec 26, 2019

Just anectdotal experience. By "dominating", I mean in non-tech US Fortune 500 companies. They have very few employees, and the software is very expensive.

"When your sales is done by relationship building and there isn't much competition once you're in, there isn't really a need to boast a fancy website or make an effort."

Sure, but try and find a phone number or email address on their site. They've taken coasting to a whole new level.

jmngomes · on Dec 26, 2019

More than 90% of the enterprises I've worked with use one of the enterprise tools listed in Gartner's magic quadrant, typically from the Leader quadrant, sometimes from the Niche quadrant, here are a few examples: https://duckduckgo.com/?q=gartner+magic+quadrant+data+integr...

What's actually mind boggling for me, and I wonder if it's a bit of over-engineering, is people going for complex setups (oh, it's just Airflow scripts with k8s and a little bit of SystemD services and configs plus some shell scripts) when there are COTS tools that do more for less engineering cost. Yes, these carry a price tag, but it's usually quite less than paying for engineers to babysit a tool with a ton of moving parts...

massaman_yams · on Dec 26, 2019

Or, you can look at the (say) 6 month sales cycle for an enterprise platform, and the internal political wrangling that may be required to get approval for the line item, as significant hurdles to moving forward with an ETL project. There are some legitimate, and some less valid, reasons for many engineers' (and engineering-driven orgs') bias against commercial options; and yes, sometimes this results in over-engineering.

Airflow can do some things many commercial tools cannot, though, so for some it is the right option.

dsr_ · on Dec 26, 2019

FWIW, Ab Initio gave me the worst interview experience of my life. We ended up with 24 hours of interviews over the course of two weeks, then they dropped me as soon as we started discussing compensation. (This was almost 20 years ago; perhaps they've changed since then.)

villasv · on Dec 26, 2019

I have found that using KillMode=mixed is very useful when running Airflow with Celery workers. It allows the system to shudown gracefully coupled with TimeoutStopSec, so the workers will stop receiving new jobs but will finish their current jobs before exiting which is nice for auto scaling or spot instances on AWS (coupled with EC2 lifecycle hooks).

1996 · on Dec 26, 2019

Interesting. Why do you care about graceful shutdown?

Why not kill ASAP and restart?

Do you have problems with load booming up sometimes?

akramar · on Dec 26, 2019

In some cases we need to do an update-deploy-restart while a DAG is still running (not even the one being updated). Then several minutes or hours later the child processes raise a segfault and the jobs they were working on fail, requiring restarting any of those jobs. I imagine a graceful shutdown would allow the job to finish up and the DAG to continue with the remaining jobs.

kfk · on Dec 26, 2019

Me and my team tried Airflow but found it didn’t fit well with our analytics work flow. For instance you must rewrite your jupyter notebook into an Airflow dag, doing basically the same work 2 times. We use dask and will soon deploy dask.distributed. I have yet to figure out where Airflow actually fits in the BI/data science architecture.

jmngomes · on Dec 26, 2019

Not sure what you mean by "BI/data science architecture" but Airflow is essentially a scheduler and orchestrator for data processing jobs.

These activities are usually managed by cron and more often by advanced scheduler tools (depending on the vendor), so it's quite a core part of any architecture that needs to e.g. load/reload/refresh data periodically.

If the requirement is simply to connect notebooks to a data lake, then the only scheduling required is to load the data lake, and something like Airflow may be overkill for this, depending on what/how the data is processed and loaded.

kfk · on Dec 26, 2019

I mean the same thing you mean. My issue with airflow is that it’s complicated and doesn’t adapt well to cloud computing. Dask runs on aws emr and eks, Kubernetes, etc.. Unfortunately orchestration is a lot more complicated than it looks. Parallel executions, retries, logs, status tracking, email notifications. Airflow doesn’t really tackle all orchestration work.

smooc · on Dec 26, 2019

Airflow Maintainer here: what you are describing is exactly what Airflow takes care of (or should).

I wonder what your issue is/was? Notebooks are supported by means of a Papermill operator (equivalent to how Netflix operationalizes notebooks) or PythonOperator/BashOperator which would just wrap around your notebook.

However to parralelize tasks Airflow needs to know a bit more hence you might have found it required to break up your notebook into individual tasks that combine into a DAG. Is that what you meant?

kfk · on Dec 26, 2019

With dask we code the workflow in the notebook and run in the notebook. We don’t have to fiddle with operator as every task is python code. Dask is easy to install which is important since each analyst has to be able to test the workflows before sending them to production. Finally by programming our own scheduler we can build the things we need. For instance we are able to listen to sql tables and api changes and trigger work based on that. Anyway I am sure I could make Airflow work too but it’s a harder fit vs dask.

onefuncman · on Dec 26, 2019

I recall Jupyter being a parallel evolution alongside Airflow+Spark+Zeppelin (or similar mashup) and I think Jupyter has become "better".

smooc · on Dec 26, 2019

Airflow maintainer here

Jupyter doesn't do scheduling and integrates pretty well with Airflow.

teej · on Dec 26, 2019

A lot of folks are moving to Dask, or Dask+Prefect these days.

smooc · on Dec 26, 2019

Airflow Maintainer here

Prefect was started by an Airflow maintainer and friend of mine who also contributed the dask executor to Airflow. Hi @Jeremiah!

kyryu · on Dec 27, 2019

I heavily used to ETL (with Hive, Presto, and custom operators) using Airflow for three years. But I have no experience with the dask executor. Could you share your gentle introduction of the dask executor?

Lucasoato · on Dec 26, 2019

(Sorry if it's a little off topic) We were close to adopt Airflow in our company but we were let down by a detail: the scheduler isn't natively in a high availability mode. There was an article from Clairvoyant about how to make it HA but it didn't look safe at all. That was a serious issue for us, and at the end we went for NiFi. Have anyone had this problem?

smooc · on Dec 26, 2019

Airflow maintainer here

If you have idempotent tasks, which is a best practice, it is possible to use Airflow in HA even in active/active. It might occasionally schedule a task twice, which should be caught but in any case mitigated by idempotency.

If you are looking for more Enterprise support you can reach out to Astronomer (disclaimer: I'm a advisor to them) or use a Google cloud hosted version (Cloud Composer). Both are great products.

whydid · on Dec 26, 2019

Use K8s to schedule the work.

Filligree · on Dec 26, 2019

And make sure you hire two engineers to keep it working.

jpollock · on Dec 26, 2019

Why are all these tools DAGs?

I'm guessing I misunderstand what is meant by a "Workflow"?

My assumption is that these are managing a state machine, where workflow is stand-in for "Business Process"? If it's doing that sort of job, I'd expect timers and loops?

However, it seems these are aimed at data conversion pipelines?

jpau · on Dec 26, 2019

Heyo! Data guy here. Airflow and its DAG-managing peers are important for us.

Data transformations are one thing. For us, it’s the most important thing. Our data warehouse runs as a massive DAG of nightly batched transformations over app-generated data.

We also use DAG-managing tools to call external APIs and get new data (eg for weather and geocoding) and batched ML training/inference pipelines too.

Why something like Airflow? Dependencies are easier to manage reliably. If you have hundreds or thousands of nodes in your DAG, then it is a lifesaver to be able to easily 1) run many threads of independent nodes; 2) re-run on failures; and 3) find nodes impacted by failure.

jpollock · on Dec 26, 2019

Sorry, I most definitely didn't want to make light of the problem!

Pulling data from all the various teams' locally created data stores and external systems to push to analytics is definitely a large problem.

I was trying to figure out if these are aimed at data transformation pipelines, or state management systems - I've got state management problems, not data transformation problems.

Slightly different problems, but both fit with "Workflow".

rumanator · on Dec 27, 2019

> Why are all these tools DAGs?

I don't understand your question. Perhaps the answer is that workflows naturally require data processing tasks to spawn collections of child tasks when a parent task finishes, and conversely they are also require to spawn a child data processing task only after a collection of parent tasks finish executing. Therefore this requirement to fork and join tasks ends up being modelled as a directed acyclic graph of processing tasks.

closeparen · on Dec 26, 2019

No. The tasks inside a workflow, concretely, would be things like Spark job execution, SQL query execution, download a CSV from the internet to HDFS and load it as a Hive table, etc. Think fancy cron that deals correctly with failures in multistage processes.

The number of pipelines and executions is a function of the complexity of your application, and invariant of the number of records being processed by the batch jobs within those workflows.

bathtub365 · on Dec 26, 2019

It has a cron scheduler

Peteris · on Dec 26, 2019

Side note: the best way to build Airflow pipelines is through Kedro https://github.com/quantumblacklabs/kedro-airflow.

carlosf · on Dec 26, 2019

Good article! I actually like systemd, but nowadays I generally run stuff as containers, so there is less and less opportunity to use it.

Here is my current setup for Airflow:

- 1 container for the webserver

- 1 container for the scheduler

- 1 managed database (I use Postgres, it's a fairly small instance.)

- 1 S3 bucket to deploy DAGs. I mount it on my containers using s3fs-fuse.

- You can monitor the scheduler using a PID file, whereas the webserver can be monitored probing your admin URL.

- Most configuration can be done using environment variables, which is perfect for containers.

- I also configure DAG logs to be shipped to a S3 bucket.

EddieCPU · on Dec 26, 2019

I would have thought a scripting language would have been a better choice than unit files. The script engine would take care of verifying the script and starting the processes in the correct order and allowing/restricting access to other components. Thereby negating the need for such directives as After= or PrivateTmp=.

onefuncman · on Dec 26, 2019

Unit files are one of the things systemd got right.

A scripting language isn't well suited for process management, which is a very well specified task.

Now, if you mentioned https://ammonite.io/ you might have got my attention...

einpoklum · on Dec 26, 2019

[flagged]

fignews · on Dec 26, 2019

Your reply has nothing to do with the article. Perhaps after reading the post your systemd hate will subdue some.

1996 · on Dec 26, 2019

For managing a complex set of daemons, systemd is the thing.

I only wish that rules did integrate with system monitoring to have conditions (ex: something like StopWhenLoadThreshold_40=xx, StartWhenDiskFreeThreshold=99% )