Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interestingly, Twitter's in-house system did that too, but they now seem to think it was a mistake because it increased latency.


It all depends. For example with Apache Pulsar, tailing readers are served from an in-memory cache in the serving layer (the Pulsar brokers) and only catch-up readers end up having to be served from the storage layer (Apache BookKeeper). This is a little different from DistributedLog which always required going to BookKeeper for reads.

Apache BookKeeper can add additional latency to catch-up readers, on top of the extra hop, because the data of multiple topics are combined into each ledger. This means that we lose some performance from sequential reads. This is mitigated in BookKeeper by writing to disk in batches and sorting a ledger by topic so messages of the same topic are found together, but it still involves more jumping around on disk.

Also, BookKeeper allows the nice separation of disk IO. The read and write path are separate and can be served by different disks so you can scale your reads and writes separately to a certain extent.

For all those reasons, I would have loved to have seen Twitter look at Apache Pulsar and compare performance profiles with Apache Kafka.


Streamlio published their OpenMessaging benchmarks between Apache Kafka & Apache Pulsar here: https://streaml.io/pdf/Gigaom-Benchmarking-Streaming-Platfor...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: