ETL is a funny space. At least in the "Enterprise" world, it's dominated by Ab Initio, which is crazy expensive.
They seem to be coasting too, for quite some time. Their website is probably the most terrible site I've ever seen for an expensive piece of software. You can't even tell what it is, how to buy it, or even how to contact them. https://www.abinitio.com/en/
What's your source for the fact that Ab Initio is dominating?
Other than that, there are a several other tools in the Enterprise Analytics space that fall into a similar pattern like Alteryx or Collibra. But from their perspective it makes perfect sense, I guess. When your sales is done by relationship building and there isn't much competition once you're in, there isn't really a need to boast a fancy website or make an effort.
If anyone has a good resource on how enterprise IT procurement is done or the dynamics around it, I'd love to read up on that.
My suspicion is that their customers are mostly companies that use Teradata, because it has a fair amount of Teradata specific features. Probably not good news for their future, but lucrative for now.
Well I have a lot of experience with IT procurement. IT is more about cost and risk mitigation than technology. The last thing IT wants is to try new software that might blow up in the hands of their unprepared outsourced helpdesk support people. So the absolute first thing you want to show is robust and cheap support or even better a way to make sure your product works with their helpdesk setup. Another thing is cost and how your product helps them run things with less money. They could not care less about the quality of the product because they will not be the ones using it. In general you sell by telling a story that fits with whatever story they are telling the business. So if IT is telling business cloud is the next thing and you sell on premise you won’t go anywhere.
Just anectdotal experience. By "dominating", I mean in non-tech US Fortune 500 companies. They have very few employees, and the software is very expensive.
"When your sales is done by relationship building and there isn't much competition once you're in, there isn't really a need to boast a fancy website or make an effort."
Sure, but try and find a phone number or email address on their site. They've taken coasting to a whole new level.
More than 90% of the enterprises I've worked with use one of the enterprise tools listed in Gartner's magic quadrant, typically from the Leader quadrant, sometimes from the Niche quadrant, here are a few examples: https://duckduckgo.com/?q=gartner+magic+quadrant+data+integr...
What's actually mind boggling for me, and I wonder if it's a bit of over-engineering, is people going for complex setups (oh, it's just Airflow scripts with k8s and a little bit of SystemD services and configs plus some shell scripts) when there are COTS tools that do more for less engineering cost. Yes, these carry a price tag, but it's usually quite less than paying for engineers to babysit a tool with a ton of moving parts...
Or, you can look at the (say) 6 month sales cycle for an enterprise platform, and the internal political wrangling that may be required to get approval for the line item, as significant hurdles to moving forward with an ETL project. There are some legitimate, and some less valid, reasons for many engineers' (and engineering-driven orgs') bias against commercial options; and yes, sometimes this results in over-engineering.
Airflow can do some things many commercial tools cannot, though, so for some it is the right option.
FWIW, Ab Initio gave me the worst interview experience of my life. We ended up with 24 hours of interviews over the course of two weeks, then they dropped me as soon as we started discussing compensation. (This was almost 20 years ago; perhaps they've changed since then.)
I have found that using KillMode=mixed is very useful when running Airflow with Celery workers. It allows the system to shudown gracefully coupled with TimeoutStopSec, so the workers will stop receiving new jobs but will finish their current jobs before exiting which is nice for auto scaling or spot instances on AWS (coupled with EC2 lifecycle hooks).
In some cases we need to do an update-deploy-restart while a DAG is still running (not even the one being updated). Then several minutes or hours later the child processes raise a segfault and the jobs they were working on fail, requiring restarting any of those jobs. I imagine a graceful shutdown would allow the job to finish up and the DAG to continue with the remaining jobs.
Me and my team tried Airflow but found it didn’t fit well with our analytics work flow. For instance you must rewrite your jupyter notebook into an Airflow dag, doing basically the same work 2 times. We use dask and will soon deploy dask.distributed. I have yet to figure out where Airflow actually fits in the BI/data science architecture.
Not sure what you mean by "BI/data science architecture" but Airflow is essentially a scheduler and orchestrator for data processing jobs.
These activities are usually managed by cron and more often by advanced scheduler tools (depending on the vendor), so it's quite a core part of any architecture that needs to e.g. load/reload/refresh data periodically.
If the requirement is simply to connect notebooks to a data lake, then the only scheduling required is to load the data lake, and something like Airflow may be overkill for this, depending on what/how the data is processed and loaded.
I mean the same thing you mean. My issue with airflow is that it’s complicated and doesn’t adapt well to cloud computing. Dask runs on aws emr and eks, Kubernetes, etc.. Unfortunately orchestration is a lot more complicated than it looks. Parallel executions, retries, logs, status tracking, email notifications. Airflow doesn’t really tackle all orchestration work.
Airflow Maintainer here: what you are describing is exactly what Airflow takes care of (or should).
I wonder what your issue is/was? Notebooks are supported by means of a Papermill operator (equivalent to how Netflix operationalizes notebooks) or PythonOperator/BashOperator which would just wrap around your notebook.
However to parralelize tasks Airflow needs to know a bit more hence you might have found it required to break up your notebook into individual tasks that combine into a DAG. Is that what you meant?
With dask we code the workflow in the notebook and run in the notebook. We don’t have to fiddle with operator as every task is python code. Dask is easy to install which is important since each analyst has to be able to test the workflows before sending them to production. Finally by programming our own scheduler we can build the things we need. For instance we are able to listen to sql tables and api changes and trigger work based on that. Anyway I am sure I could make Airflow work too but it’s a harder fit vs dask.
I heavily used to ETL (with Hive, Presto, and custom operators) using Airflow for three years. But I have no experience with the dask executor. Could you share your gentle introduction of the dask executor?
(Sorry if it's a little off topic)
We were close to adopt Airflow in our company but we were let down by a detail: the scheduler isn't natively in a high availability mode. There was an article from Clairvoyant about how to make it HA but it didn't look safe at all. That was a serious issue for us, and at the end we went for NiFi. Have anyone had this problem?
If you have idempotent tasks, which is a best practice, it is possible to use Airflow in HA even in active/active. It might occasionally schedule a task twice, which should be caught but in any case mitigated by idempotency.
If you are looking for more Enterprise support you can reach out to Astronomer (disclaimer: I'm a advisor to them) or use a Google cloud hosted version (Cloud Composer). Both are great products.
I'm guessing I misunderstand what is meant by a "Workflow"?
My assumption is that these are managing a state machine, where workflow is stand-in for "Business Process"? If it's doing that sort of job, I'd expect timers and loops?
However, it seems these are aimed at data conversion pipelines?
Heyo! Data guy here. Airflow and its DAG-managing peers are important for us.
Data transformations are one thing. For us, it’s the most important thing. Our data warehouse runs as a massive DAG of nightly batched transformations over app-generated data.
We also use DAG-managing tools to call external APIs and get new data (eg for weather and geocoding) and batched ML training/inference pipelines too.
Why something like Airflow? Dependencies are easier to manage reliably. If you have hundreds or thousands of nodes in your DAG, then it is a lifesaver to be able to easily 1) run many threads of independent nodes; 2) re-run on failures; and 3) find nodes impacted by failure.
Sorry, I most definitely didn't want to make light of the problem!
Pulling data from all the various teams' locally created data stores and external systems to push to analytics is definitely a large problem.
I was trying to figure out if these are aimed at data transformation pipelines, or state management systems - I've got state management problems, not data transformation problems.
Slightly different problems, but both fit with "Workflow".
I don't understand your question. Perhaps the answer is that workflows naturally require data processing tasks to spawn collections of child tasks when a parent task finishes, and conversely they are also require to spawn a child data processing task only after a collection of parent tasks finish executing. Therefore this requirement to fork and join tasks ends up being modelled as a directed acyclic graph of processing tasks.
No. The tasks inside a workflow, concretely, would be things like Spark job execution, SQL query execution, download a CSV from the internet to HDFS and load it as a Hive table, etc. Think fancy cron that deals correctly with failures in multistage processes.
The number of pipelines and executions is a function of the complexity of your application, and invariant of the number of records being processed by the batch jobs within those workflows.
I would have thought a scripting language would have been a better choice than unit files. The script engine would take care of verifying the script and starting the processes in the correct order and allowing/restricting access to other components. Thereby negating the need for such directives as After= or PrivateTmp=.
For managing a complex set of daemons, systemd is the thing.
I only wish that rules did integrate with system monitoring to have conditions (ex: something like StopWhenLoadThreshold_40=xx, StartWhenDiskFreeThreshold=99% )
They seem to be coasting too, for quite some time. Their website is probably the most terrible site I've ever seen for an expensive piece of software. You can't even tell what it is, how to buy it, or even how to contact them. https://www.abinitio.com/en/