I built out a Druid backend for interactive graphing & aggregation of web traffi...

chrisjc · on April 15, 2020

If you had to do it again today, what would you do?

Especially interested in this part:

> If you have realtime data ingestion, you will also have to build tooling to re-ingest older data that has changed or needed to be amended. This will end up looking like a 'lambda architecture'

battery_cowboy · on April 14, 2020

I'm not a data guy, can you explain this part a bit more if you have time?

> you have realtime data ingestion, you will also have to build tooling to re-ingest older data that has changed or needed to be amended. This will end up looking like a 'lambda architecture'

ignoramous · on April 14, 2020

Lambda architecture for data processing, as popularized by Nathan Marz et al [0], has two components, the Batch layer and the Stream layer. At a high level, Batch trades quality for staleness whilst Stream optimises for freshness at the expense of quality [1].

I believe what GP means by Lambda is that, you'd need a system that batch processes the data to be amended / changed (reprocess older data) but stream processes whatever that's required for real-time [2].

An alternative is the Kappa architecture proposed initially by Jay Kreps [3][4], co-creator of Apache Kafka.

---

[0] https://www.amazon.com/dp/1617290343

[1] https://en.wikipedia.org/wiki/Lambda_architecture

[2] https://speakerdeck.com/druidio/real-time-analytics-with-ope...

[3] https://engineering.linkedin.com/distributed-systems/log-wha...

[4] https://dataintensive.net/

sologoub · on April 15, 2020

The sources are good and thorough, but very long. Here’s an ok summary of kappa proposal: https://milinda.pathirage.org/kappa-architecture.com/

In theory this sounds great, but you have to account for processing capacity.

While compute is getting cheaper, one of the key reasons streaming in lambda sacrifices quality over throughput is compute capacity (as well as timing). If you have to feed already stored data through the same streaming pipe, you either have to have a lot of excess capacity, be willing to pay for that additional burst or accept latency in your results (assuming you can keep up with your incoming workload and not lose data). There is no free lunch.

thekhatribharat · on April 15, 2020

Here's a related article: https://medium.com/open-factory/state-of-the-m-art-big-data-...

An excerpt from the article:

Furthermore, the big data tools can be combined using a growing number of data processing architectures — Lambda and Kappa, among others.

battery_cowboy · on April 14, 2020

Thanks so much for the comment, it was very helpful!

jcims · on April 14, 2020

This sounds kind of like splunk. Any similarity?