Skimming the paper, I found it hard to get a handle on exactly what kind of abstraction and guarantees this system implements (I am a co-author on one of the cited papers, so I'm reasonably familiar with this space). Maybe it is easier to understand for people who are already familiar with Storm.
Most notably, I couldn't determine whether the system is stateful, or what kind of guarantees (if any) are provided regarding stateful processing.
For example, the paper says Heron is used to compute real-time active user counts. That implies that the system needs some way to keep track of and "remember" how many unique users it has seen in the last N hours or whatever. How does Heron model this state and how does it guarantee (if it does) that a crashing node will not lose its accumulated state?
In my experience this is the hardest part, by far, of stream processing, so when I see any work in this area it's the first thing I am curious to learn about. A system that guarantees strong consistency (ie. accurate counts) even in the presence of node crashes is way, way harder to get right (and a lot more expensive, resource-wise) than one that assumes it's ok to lose a little bit of data.
It looks like Heron implements only at-most-once and at-least-once semantics, so maybe that is my answer there. You need exactly-once semantics to get robust and reliable answers, and you need to guarantee that state changes are atomic with the exactly-once semantics.
Of course some systems are ok with their output degrading a little when nodes crash. It's not the end of the world if the active user count is a little off. But beware of tolerating this too much -- the bad thing about allowing data loss is that it tends to come in storms (no pun intended). Once something is going wrong, the answers can be way off. The error is not bounded in most cases I've encountered.
From the paper, "The design of Heron allows supporting exactly once semantics, but the first version of Heron does not have this implementation. One reason for tolerating the lack of exactly once semantics is that Summingbird [8] simultaneously generates both a Heron query
and an equivalent Hadoop job, and in our infrastructure the
answers from both these parts are eventually merged".
Interesting to see how Heron compares up to Spark wrt to performance. I keep hearing Storm is slower than Spark, does Heron now catch up and exceed in terms of performance?
I have run into the author in other contexts, and I believe (without quotable proof) that Heron must be massively faster than Storm - if he says it is. Storm has a wide variety of failures (I've presented about "when to use storm" at a few conferences), such as the difficulty of exactly once processing (trident attempted to solve that).
The linked blog posts show that is is faster that Storm. The OP was asking about Spark though. That's a fair questions, since there is a good deal of overlap between the the typical Storm usecases and Spark usecases.
Agreed -- the paper mentions "there are a number of issues with respect to making these systems [Spark and Samza] work in its current form at our scale" but unfortunately does not outline go into any detail.
This could be very good thing for Apache Storm depending on how Twitter handles it.
Just to clarify, the Storm version mentioned in the paper and blog post is not an official Apache release and doesn't include many performance improvements included in the newer releases of Apache Storm. There are a lot, and many more on the horizon.
That being said, the performance numbers look impressive, even though there is no way confirm those since no code or benchmarks have been published. IMHO, until that happens, there's not much to see here (not that I doubt it -- I'd just like to see proof/code).
My hope is that Twitter is dedicated to the projects it has open-sourced, and this is not a case of NIH, but rather an honest effort on Twitter's part to contribute back to the open source community.
@haberman:
Storm implements exactly-once processing through a higher-level API called Trident, that I like to call Storm's "Streams API" since it's not unlike Java 8's Streams API (and largely inspired by Cascading). Trident processes data in configurable micro-batches, as opposed to one-at-a-time, which gives it an advantage in terms of throughput, but at the cost of latency. Trident topologies "compile" down to Core Spout/Bolt topologies (The Trident API has a planner implementation that figures that out -- not unlike an SQL query planner).
The Storm Core API provides at-least-once semantics through an acking mechanism described here [1]. The Trident API builds on top of that to support exactly-once semantics by essentially doing a de-dupe [2].
I'm not sure exactly why they don't claim to support this, since Trident is build on top of Storm's Core API.
Assuming you are referring to Spark streaming, forget about any benchmarks you may have seen. Either can be faster than the other depending on how you configure it, and what your use case is. See my presentation on the subject here [3]. With either, you can configure yourself into a corner and screw your performance.
Performance tuning distributed systems is a mysterious art. As is benchmarking. Unfortunately, that fact is frequently exploited for "benchmarketing" purposes. Don't trust any benchmark but your own unless it is fully open-sourced (including configuration).
But what about services that Twitter replicates? The latest applications victimized by Twitter’s in-house team are photo-sharing platforms like TwitPic and YFrog
Back in May, we first heard that Twitter planned to launch its own image posting tool. Up until that point, third party apps were giving Twitter that function. Now that Twitter’s own photo tool has launched, it’s virtually declaring war on the developers who were responsible for creating what has been a very popular aspect of the site.
But you're right, if you take time to develop an app using their API, it might just get their attention and then they'll develop their own version and then cut you completely out.
The crucial difference here is between using a totally open-sourced technology like Storm (or Heron, if they release it) and being a consumer of a public API with a proprietary implementation, or making use of their platform/users/data.
Twitter can absolutely block you from their API. Or come out with a competing product that integrates with their own systems better than your product does. But if you're using Storm (or Scalding or whatever), there's really nothing they can do to screw with that.
I started looking at finagle, finatra and scrooge. i love all three projects. but at the same time, the documentation is really really bad.
if you have the same requirement as twitter, nice -- but if you detour even a little bit, you are going to have a shitty time (for example, for thrift, i wanted to use buffered codec instead of framed and there was not a single document explaining how to do it. i spent ~2h perusing unit tests to find a way which i don't know if it's correct or not).
Most notably, I couldn't determine whether the system is stateful, or what kind of guarantees (if any) are provided regarding stateful processing.
For example, the paper says Heron is used to compute real-time active user counts. That implies that the system needs some way to keep track of and "remember" how many unique users it has seen in the last N hours or whatever. How does Heron model this state and how does it guarantee (if it does) that a crashing node will not lose its accumulated state?
In my experience this is the hardest part, by far, of stream processing, so when I see any work in this area it's the first thing I am curious to learn about. A system that guarantees strong consistency (ie. accurate counts) even in the presence of node crashes is way, way harder to get right (and a lot more expensive, resource-wise) than one that assumes it's ok to lose a little bit of data.
It looks like Heron implements only at-most-once and at-least-once semantics, so maybe that is my answer there. You need exactly-once semantics to get robust and reliable answers, and you need to guarantee that state changes are atomic with the exactly-once semantics.
Of course some systems are ok with their output degrading a little when nodes crash. It's not the end of the world if the active user count is a little off. But beware of tolerating this too much -- the bad thing about allowing data loss is that it tends to come in storms (no pun intended). Once something is going wrong, the answers can be way off. The error is not bounded in most cases I've encountered.