Flying faster with Twitter Heron

haberman · on June 2, 2015

Skimming the paper, I found it hard to get a handle on exactly what kind of abstraction and guarantees this system implements (I am a co-author on one of the cited papers, so I'm reasonably familiar with this space). Maybe it is easier to understand for people who are already familiar with Storm.

Most notably, I couldn't determine whether the system is stateful, or what kind of guarantees (if any) are provided regarding stateful processing.

For example, the paper says Heron is used to compute real-time active user counts. That implies that the system needs some way to keep track of and "remember" how many unique users it has seen in the last N hours or whatever. How does Heron model this state and how does it guarantee (if it does) that a crashing node will not lose its accumulated state?

In my experience this is the hardest part, by far, of stream processing, so when I see any work in this area it's the first thing I am curious to learn about. A system that guarantees strong consistency (ie. accurate counts) even in the presence of node crashes is way, way harder to get right (and a lot more expensive, resource-wise) than one that assumes it's ok to lose a little bit of data.

It looks like Heron implements only at-most-once and at-least-once semantics, so maybe that is my answer there. You need exactly-once semantics to get robust and reliable answers, and you need to guarantee that state changes are atomic with the exactly-once semantics.

Of course some systems are ok with their output degrading a little when nodes crash. It's not the end of the world if the active user count is a little off. But beware of tolerating this too much -- the bad thing about allowing data loss is that it tends to come in storms (no pun intended). Once something is going wrong, the answers can be way off. The error is not bounded in most cases I've encountered.

vikkyrk · on June 2, 2015

From the paper, "The design of Heron allows supporting exactly once semantics, but the first version of Heron does not have this implementation. One reason for tolerating the lack of exactly once semantics is that Summingbird [8] simultaneously generates both a Heron query and an equivalent Hadoop job, and in our infrastructure the answers from both these parts are eventually merged".

The industry term for this approach is lambda architecture (http://lambda-architecture.net/)

djb_hackernews · on June 2, 2015

I wonder what Nathan Marz thinks about this, as he is the guy that created Storm (which IMO, is some of the best OSS code out there)

filereaper · on June 2, 2015

Interesting to see how Heron compares up to Spark wrt to performance. I keep hearing Storm is slower than Spark, does Heron now catch up and exceed in terms of performance?

bbulkow · on June 2, 2015

I have run into the author in other contexts, and I believe (without quotable proof) that Heron must be massively faster than Storm - if he says it is. Storm has a wide variety of failures (I've presented about "when to use storm" at a few conferences), such as the difficulty of exactly once processing (trident attempted to solve that).

I am looking forward to taking Heron for a spin.

nl · on June 2, 2015

The linked blog posts show that is is faster that Storm. The OP was asking about Spark though. That's a fair questions, since there is a good deal of overlap between the the typical Storm usecases and Spark usecases.

lukeasrodgers · on June 3, 2015

Agreed -- the paper mentions "there are a number of issues with respect to making these systems [Spark and Samza] work in its current form at our scale" but unfortunately does not outline go into any detail.

vicaya · on June 2, 2015

Storm is not even 1.0 yet. Since Heron is API compatible with Storm, why can't Heron be simply a code name for Storm 2.0?

Storm is a much better name than Heron, IMO.

TheHydroImpulse · on June 2, 2015

Storm has been moved under the Apache umbrella so Twitter coming out with Storm 2.0 would present a few problems.

ptgoetz · on June 3, 2015

(DISCLAIMER: I am an Apache Storm PMC Member)

This could be very good thing for Apache Storm depending on how Twitter handles it.

Just to clarify, the Storm version mentioned in the paper and blog post is not an official Apache release and doesn't include many performance improvements included in the newer releases of Apache Storm. There are a lot, and many more on the horizon.

That being said, the performance numbers look impressive, even though there is no way confirm those since no code or benchmarks have been published. IMHO, until that happens, there's not much to see here (not that I doubt it -- I'd just like to see proof/code).

My hope is that Twitter is dedicated to the projects it has open-sourced, and this is not a case of NIH, but rather an honest effort on Twitter's part to contribute back to the open source community.

@haberman:

Storm implements exactly-once processing through a higher-level API called Trident, that I like to call Storm's "Streams API" since it's not unlike Java 8's Streams API (and largely inspired by Cascading). Trident processes data in configurable micro-batches, as opposed to one-at-a-time, which gives it an advantage in terms of throughput, but at the cost of latency. Trident topologies "compile" down to Core Spout/Bolt topologies (The Trident API has a planner implementation that figures that out -- not unlike an SQL query planner).

The Storm Core API provides at-least-once semantics through an acking mechanism described here [1]. The Trident API builds on top of that to support exactly-once semantics by essentially doing a de-dupe [2].

I'm not sure exactly why they don't claim to support this, since Trident is build on top of Storm's Core API.

[1] https://storm.apache.org/documentation/Acking-framework-impl... [2] https://storm.apache.org/documentation/Trident-state.html

@filereaper:

Assuming you are referring to Spark streaming, forget about any benchmarks you may have seen. Either can be faster than the other depending on how you configure it, and what your use case is. See my presentation on the subject here [3]. With either, you can configure yourself into a corner and screw your performance.

Performance tuning distributed systems is a mysterious art. As is benchmarking. Unfortunately, that fact is frequently exploited for "benchmarketing" purposes. Don't trust any benchmark but your own unless it is fully open-sourced (including configuration).

[3] http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-stre...

@vicaya

Version numbers don't necessarily equate to code quality, performance, or stability. I've seen many projects bump to 1.0 only for marketing purposes.

jhugg · on June 2, 2015

Open source? No?

qnm · on June 2, 2015

Not that I can find. I don't see any indications that they intend to open-source Heron either.

mdaniel · on June 2, 2015

Bizarre, their blog post is tagged "open source" but I checked their GitHub page and no sign of it.

jhugg · on June 2, 2015

Storm is OSS, and referenced a few times.

shit_parade2 · on June 2, 2015

Developing using twitter open source and api is essentially asking to to robbed by them if anything you make is successful.

at-fates-hands · on June 2, 2015

Not sure why you're getting downvoted. Twitter has a long and sketchy track record with developers. Exhibit A is TwitPic:

http://www.digitaltrends.com/social-media/twitters-war-on-th...

But what about services that Twitter replicates? The latest applications victimized by Twitter’s in-house team are photo-sharing platforms like TwitPic and YFrog

Back in May, we first heard that Twitter planned to launch its own image posting tool. Up until that point, third party apps were giving Twitter that function. Now that Twitter’s own photo tool has launched, it’s virtually declaring war on the developers who were responsible for creating what has been a very popular aspect of the site.

But you're right, if you take time to develop an app using their API, it might just get their attention and then they'll develop their own version and then cut you completely out.

joefkelley · on June 2, 2015

The crucial difference here is between using a totally open-sourced technology like Storm (or Heron, if they release it) and being a consumer of a public API with a proprietary implementation, or making use of their platform/users/data.

Twitter can absolutely block you from their API. Or come out with a competing product that integrates with their own systems better than your product does. But if you're using Storm (or Scalding or whatever), there's really nothing they can do to screw with that.

saryant · on June 2, 2015

GP was downvoted for failing to differentiate between relying on Twitter's API and their open source software.

The two really can't be compared.

hurrycane · on June 2, 2015

Why is that? Finagle, Mesos and Aurora look like solid open source projects from Twitter.

SEJeff · on June 2, 2015

Aurora is from Twitter, but Mesos is not. They hired some of the team that developed Mesos while in academia, but still. Mesos is external of Twitter.

Disclaimer: I use both Mesos and Aurora.

fernandotakai · on June 2, 2015

I started looking at finagle, finatra and scrooge. i love all three projects. but at the same time, the documentation is really really bad.

if you have the same requirement as twitter, nice -- but if you detour even a little bit, you are going to have a shitty time (for example, for thrift, i wanted to use buffered codec instead of framed and there was not a single document explaining how to do it. i spent ~2h perusing unit tests to find a way which i don't know if it's correct or not).

hurrycane · on June 2, 2015

Also forgot to mention scalding.

d_t_w · on June 3, 2015

Don't forget Netty..

allengeorge · on June 3, 2015

netty is not a twitter project. At various times the major committers have been employed by JBoss/RedHat, Apple, etc. etc.

ohitsdom · on June 2, 2015

What does this have to do with Heron?

briandear · on June 2, 2015

Twitter Bootstrap anyone?