Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I built out a Druid backend for interactive graphing & aggregation of web traffic and application security metrics a few years back. Users could choose arbitrary filters, aggregations, and time slicing. This was a second system replacing a Spark cluster running over timeseries events in Cassandra, which wasn't really practically scalable. Tuning and debugging the Spark queries and Cassandra performance was an endless time sink.

Druid worked really well for almost all use cases, reliably getting sub-second results against many billions of records even on a pretty modest deployment. Being able to use arbitrary Javascript functions in queries was fantastic, so we could do things like filtering on IP subnets or case insensitive prefix/suffix matching, and the like, as needed.

The Docker setup that Druid ships with is deceptively simple - making a production installation was an effort. My thoughts are:

- Build templating and code generation for the many config files for each component early on so you can edit constants in a single place and have all of the sundry config files update to reflect them, and also manage per-host overrides in a sane, version controlled, way.

- Druid will use as much ram as you can throw at it, but in a pinch, reading directly from fast NVME storage is pretty good.

- If you have realtime data ingestion, you will also have to build tooling to re-ingest older data that has changed or needed to be amended. This will end up looking like a 'lambda architecture'



If you had to do it again today, what would you do?

Especially interested in this part:

> If you have realtime data ingestion, you will also have to build tooling to re-ingest older data that has changed or needed to be amended. This will end up looking like a 'lambda architecture'


I'm not a data guy, can you explain this part a bit more if you have time?

> you have realtime data ingestion, you will also have to build tooling to re-ingest older data that has changed or needed to be amended. This will end up looking like a 'lambda architecture'


Lambda architecture for data processing, as popularized by Nathan Marz et al [0], has two components, the Batch layer and the Stream layer. At a high level, Batch trades quality for staleness whilst Stream optimises for freshness at the expense of quality [1].

I believe what GP means by Lambda is that, you'd need a system that batch processes the data to be amended / changed (reprocess older data) but stream processes whatever that's required for real-time [2].

An alternative is the Kappa architecture proposed initially by Jay Kreps [3][4], co-creator of Apache Kafka.

---

[0] https://www.amazon.com/dp/1617290343

[1] https://en.wikipedia.org/wiki/Lambda_architecture

[2] https://speakerdeck.com/druidio/real-time-analytics-with-ope...

[3] https://engineering.linkedin.com/distributed-systems/log-wha...

[4] https://dataintensive.net/


The sources are good and thorough, but very long. Here’s an ok summary of kappa proposal: https://milinda.pathirage.org/kappa-architecture.com/

In theory this sounds great, but you have to account for processing capacity.

While compute is getting cheaper, one of the key reasons streaming in lambda sacrifices quality over throughput is compute capacity (as well as timing). If you have to feed already stored data through the same streaming pipe, you either have to have a lot of excess capacity, be willing to pay for that additional burst or accept latency in your results (assuming you can keep up with your incoming workload and not lose data). There is no free lunch.


Here's a related article: https://medium.com/open-factory/state-of-the-m-art-big-data-...

An excerpt from the article:

Furthermore, the big data tools can be combined using a growing number of data processing architectures — Lambda and Kappa, among others.


Thanks so much for the comment, it was very helpful!


This sounds kind of like splunk. Any similarity?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: