Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I remeber one time I was working as a Data & Analytics Lead (almost a Chief Data Officer but without the title) in a company were I don't work anymore and I was "challenged" by our parent company CDO about our data tech stack and operations. Just for context, my team at the time was me working as the lead and main Data Engineer plus 3 Data Analysts that I was coaching/teaching to convert into DEngs/DScientists.

At the time we were mostly a batch data shop, based on Apache Airflow + K8S + BigQuery + GCS in Google Cloud Platform, with BigQuery + GCS as the central datalake techs for analytics and processing. We still had RT capabilities due to having also some Flink processes running in the K8S cluster, and also having time-critical (time, not latency) processes running in microbatches of minutes for NRT. It was pretty cheap and sufficiently reliable, with both Airflow and Flink having self-healing capabilities at least at the node/process level (and even cluster/region level should we need it and be willing to increase the costs), while also allowing for some changes down the road like moving out of BQ if the costs scaled up too much.

What they wanted us to implement what according to them was the industry "best practices" circa 2021: a Kafka-based Datalake (KSQL and co.), at least other 4 engines (Trino, Pinot, Postgres and Flink) and an external object storage with most of the stuff running inside Docker containers orchestrated by Ansible in N compute instances manually controlled from a bastion instance. For some reason, they insisted on having a real time datalake based on Kafka. It was an insane mix of cargo cult, FOMO, high operational complexity and low reliability in one package.

I resisted the idea until the last second I was in that place. I reunited with some of my team members for drinks months later after my departure and they told me the new CDO was already convinced that said "RT-based" datalake was the way to go forward. I still shudder every time I remember the architectural diagram and I hope they didn't finally follow that terrible advice.

tl;dr: I will never understand the cargo cult around real time data and analytics but it is a thing that appeals to both decision makers and "data workers". Most businesses and operations (especially those whose main focus is not IT by itself) won't act or decide in hours, but rather in days. Build around your main use case and then make exceptions, not the other way around.



I agree that is a great approach - build around the main use cases and then make exceptions. I think a lot of companies have legitimate use cases for real-time analytics (outside of their internal decision making), but as you mention, preemptively optimize for the aspiration and leads them towards unnecessary tool and tech sprawl. For example, a marketplace application that shows you the quantity of an item currently available -- you as a consumer use that information to make a decision in seconds, so its a great use-case. Internally, the org probably uses that data for weekly or quarterly forecasting. I've seen use cases like that lead to the "let's make everything real-time", but not every use case benefits the same from real-time.


> they told me the new CDO was already convinced that said "RT-based" datalake was the way to go forward

Is the desire for "RT-based datalake" itself misplaced? Or just that the implementation isn't up to the job? Nobody _wants_ slow data, and reports that are usually fine with T+1 delay can become time critical (for example, a "what's selling?" report on black friday).


Well, it's an engineering decision, so there is no direct answer.

But as an engineering manager, at least I must ask and answer the following question: even when nobody wants _slow_ data, how fast is fast enough? I don't see decision makers choosing and thinking better with a 10 min latency vs 20 min latency, as they are not looking at the reports all the time, even for big events like Black Friday (they have meetings and stuff you know, even their supporting analyst teams do).

For more time-critical matters (i.e. real time BI or real time automatic microdecision making for fraud detection), as I said, we did have the capability to run both more frequent microbatches or do RT processing using Flink connected directly to our app backend messaging system (ironically Confluent, Kafka as a Service). But that is very different to using a complex real time log as Kafka running on "pet" servers as the cornerstone of your data platform and then propagating said data to different engines/datastores (at least 4 as I said) for downstream processing. That's a lot of moving parts running in a low reliability environment.

Overengineering is a thing, and I think it was my responsability at the time to limit the level of complexity considering the reality of the business and the resources we had in the team, even if that meant 20 minutes of latency for a business report. That's my point and why I say I think it was a bad decision to use a Kafka based stack. YMMV obviously.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: