Hacker Newsnew | past | comments | ask | show | jobs | submit | ryzhyk's commentslogin

I'd say the difference is in the type of transaction isolation guarantees each system provides. DBSP can process multiple diffs in parallel, and when it's done it outputs a single diff that captures the effects of all the input diffs. DD can additionally attribute each output diff to a specific input diff by assigning each input diff and matching output diff a logical timestamp. This has a cost in terms of complexity and runtime overhead, but it allows strong isolation of concurrent transactions.


But as gz09 said, both DD and DBSP are data-parallel architectures that can evaluate queries concurrently on multiple threads or multiple machines.


The paradox of IVM is that the concept has been around for a long time; there are hundreds of papers on this topic; hardly anyone would disagree that it's a useful feature for a database, yet no modern database has a satisfactory implementation.


The backend for this app is literally 300 lines of SQL + Rust -- very cool.


The backend for this app is literally 300 lines of SQL + Rust -- very cool.


the code for it can be found here in case anyone is looking for it https://github.com/feldera/techdemo-spreadsheet


[I am the author of the blog] It's been fun working on this demo. FGA is a very cool concept, but building an efficient FGA engine is hard: you basically need to solve a graph reachability problem for each auth request.

So I tried a different approach: precompute all authorization decisions ahead of time and incrementally update the computation in real-time. As the post explains, there's not free lunch; there's a space/time tradeoff involved, but overall I think it's very promising.


Very interesting! Many years ago I implemented a mandatory access control system with complex access rules. In a similar fashion, I had to precompute authorisation, as it was just too damn slow to do it all on the fly. Not as complicated as yours, but same principle.


The computational complexity of running an analytical query on a database is, at best, O(N), where N is the size of the database. The computational complexity of evaluating queries incrementally over streaming data with a well-designed query engine is O(delta), where delta is the size of the *new* data. If your use case is well served by a database (i.e., can tolerate the latency), then you're certainly better off relying on the more mature technology. But if you need to do some heavy-weight queries and get fresh results in real-time, no DB I can think of can pull that off (including "real-time" databases).


The correct way to think about the problem is in terms of evaluating joins (or any other queries) over changing datasets. And for that you need an engine designed for *incremental* processing from the ground up: algorithms, data structures, the storage layer, and of course the underlying theory. If you don't have such an engine, you're doomed to build layer of hacks, and still fail to do it well.

We've been building such an engine at Feldera (https://www.feldera.com/), and it can compute joins, aggregates, window queries, and much more fully incrementally. All you have to do is write your queries in SQL, attach your data sources (stream or batch), and watch results get incrementally updated in real-time.


Is it related to Differential Dataflow / timely dataflow https://github.com/TimelyDataflow/differential-dataflow


We have our own formal model called DBSP: https://docs.feldera.com/papers

It is indeed inspired by timely/differential, but is not exactly comparable to it. One nice property of DBSP is that the theory is very modular and allows adding new incremental operators with strong correctness guarantees, kind of LEGO brick for incremental computation. For example we have a fully incremental implementation of rolling aggregates (https://www.feldera.com/blog/rolling-aggregates), which I don't think any other system can do today.


Fast rolling aggregates are swell. I meet a lot of people who are building trading systems and want this sort of thing, but it usually isn't a great choice because the perfectly rectangular kernel probably isn't the best possible version of the feature and because arbitrary kernels can be well approximated using a state of constant size rather than a large buffer storing a sliding window.


Are you aware of any efforts to apply DBSP's theory to a general programming language/environment? From my perspective, DDlog was the most inspiring project in the field of incremental computation, but it seems like all of these projects just lead to implementations of streaming databases or other similar commercial products that fit into Data™ pipelines (no offense). Incremental computation pops up everywhere, from databases to business logic to UI rendering and video game graphics, and I have this hunch that if the problem could be solved at a fundamental level and in an accessible way, we could have revolutionary gains for programmers and programs.


Thanks for the kind words about DDlog :)

The reason DBSP and Differential Dataflow work so well is because they are specialized to relational computations. Relational operators have nice properties that allow evaluating them incrementally. Incremental evaluation for a general purpose language like Rust is a much, much harder problem.

FWIW, DBSP is available as a Rust crate (https://crates.io/crates/dbsp), so you can use it as an embedded incremental compute engine inside your program.


Indeed. I've experimented a bit with abusing DD/DBSP for my purposes by modeling various kinds of data structures in terms of Z-sets, but these efforts have not yielded very impressive results. :)

For how elegant DBSP is I still found the paper a tough nut to crack, and it really is one of the more accessible theoretical contributions in the space, at least from this grubby programmer's perspective... I hope to devote some time to study and play around more, but in the meantime I'm rooting for you!


Thanks again!

You may want to check out this tutorial for a hands-on introduction to DBSP: https://docs.rs/dbsp/0.28.0/dbsp/tutorial/index.html


Also there's now a DBSP implementation in pure Python! https://github.com/brurucy/pydbsp


(Not who you are replying to) Not sure if it’s specifically related to DBSP but checkout incremental DataFun (slide ~55 of https://www.rntz.net/files/stl2017-datafun-slides.pdf) and the paper cited there: A Theory of Changes for Higher Order Languages: Incrementalizing Lambda-calculi by Static Differentiation (Cai et. al, PLDI 2014).


Thank you, I'll add these to my reading list!


Does this offer any non-SQL programmatic interfaces or ways to do Complex Event Processing (e.g. https://www.databricks.com/glossary/complex-event-processing )? A lot of those scenarios would be tough to express in SQL.


Yes, you can write Rust UDFs with Feldera and even use the dbsp crate directly if you'd like.


Sorry, but I don't see much on https://docs.feldera.com/sql/udf/ that suggests how CEP would work. The example is single-valued and doesn't indicate if state or windows for CEP would be supported.


Hi, I’ve read the DBSP paper and it’s a really well-thought out framework; all the magic seemed so simple with the way the paper laid things out. However, the paper dealt with abelian Z-sets only, and mentioned that in your implementation, you also handle the non-abelian aspect of ordering. I was wondering if you guys have published about how did you that?


Apologies about the confusion. We indeed only solve incremental computation for Abelian groups, and the paper is making a case that database tables can be modeled as Abelian groups using Z-sets, and all relational operators (plus aggregation, recursion, and more) can be modeled as operations on Z-sets.


Yes, I might have misworded my question. My question is in relation to this paragraph on page 12:

"Note that the SQL ORDER BY directive can be modelled as a non-linear aggregate function that emits a list. However, such an implementation is not efficiently incrementalizable in DBSP. We leave the efficient handling of ORDER BY to future work."

My understanding is that Feldera does indeed support ORDER BY, which I imagine it does incrementally, thus my question.

The statement in the paper that ordering is not efficiently incrementalisable seems to makes sense to me. It is clear that even though Z-sets are not naively able to represent diffs of ordered relations (since Z-sets are unordered), ordering can be modelled as an aggregate that first builds up the first row, then the second row, and so on. Even as formulated this way however, I fail to see how the entire "incrementalised" computation would still be practically incremental, in the sense that the size of the output diff (Z-set) is small as long as the diff of the input is small.

For example, consider the query `select x from y order by x asc`, and the following values respectively occur in the stream of x: 5, 4, 3, 2, 1. Now, consider the incremental diff for the last value of 1. If presumably one models order by a list aggregation, then the Z-set for the entire computation seems to be

    - [ 2, 3, 4, 5 ]
    + [ 1, 2, 3, 4, 5 ]
which grows with the size of the output set rather than the size of the input diff. If presumably one models order by e.g. adding an order column, the diff would be

    - 2 (index: 0)
    - 3 (index: 1)
    - 4 (index: 2)
    - 5 (index: 3)
    + 1 (index: 0)
    + 2 (index: 1)
    + 3 (index: 2)
    + 4 (index: 3)
    + 5 (index: 4)
which once again varies with the size of the output set.


Your explanation of why ORDER BY is not efficiently incrementalizable is spot on. At the moment Feldera ignores the outermost ORDER BY clause, unless it is part of the ORDER BY ... LIMIT pattern, which is SQL's way to express the top-k query.

So, a better solution is still future work :)


I was with you and thinking Postgres over and over until the second paragraph. Which isn’t to say anything bad about your product, it sounds very cool.

But i’d work in “just like Postgres”.


Good point. The goal is indeed to be a Postgres of incremental computing: any SQL query should "just work" out of the box with good performance and standard SQL semantics. You shouldn't need a team of experts to use the tool effectively.


A streaming join indeed requires an unbounded buffer in the most general case when inputs keep growing and any input record on one side of the join can match any record on the other side. However, it does not require inputs to be ordered. An incremental query engine such as Feldera or Materialize can handle out-of-order data and offer strong consistency guarantees (disclaimer: I am a developer of Feldera). In practice, unbounded buffers can often be avoided as well. This may require a specialized join such as as-of join (https://www.feldera.com/blog/asof-join) and some GC machinery.


Both reads and writes are O(1) in time complexity. Writes additionally have the log(N) amortized cost of maintaining the LSM tree.


gotcha! thanks for the clarification


That's right, we perform static dataflow analysis to determine what data can get discarded. GC itself is done lazily as part of LSM tree maintenance. For MAX specifically, we don't have this optimization yet. In the general case, incrementally maintaining the MAX aggregate in the presence of insertions and deletions requires tracking the entire contents of the group, which is what we do. If the collection can be proved to be append-only, then it's sufficient to store only the current max element. This optimization is yet coming to Feldera.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: