Hacker Newsnew | past | comments | ask | show | jobs | submit | frankmcsherry's commentslogin

> Sounds like basic memoization and topological sort gets you all the way there?

I don't really want to pull rank here, but for the benefit of other readers: 100% nope.

I personally find the "make toxic comments to draw folks out" rhetorical style frustrating, so I'll just leave you with a video from Clojure/conj about how nice it would be to be able to use DD from Clojure, to get a proper reactive Datomic experience.

https://www.youtube.com/watch?v=ZgqFlowyfTA


My comment was basically (paraphrasing here) "given that my understanding of the problem is it that it can be pulled off using simple constructs X and Y, seems like most people wouldn't need to pull in framework Z".

It's puzzling to me why you _wouldn't_ want to "pull rank", as you say. I did not pretend to be an expert in this domain. I'm really just exposing my knowledge and speculating about why people apparently aren't using this framework, which is what the damn submission is about. Did you even read it?

It seems like I managed to piss off a bunch users of the framework, who - rather than simply explain in clear terms why I'm supposedly wrong - instead just downvote away and make passive-aggressive comments that assume I'm some sort of troll.

Remind me to never engage with the Rust community again. Jfc.

Edit: Oh, so you're the creator of the framework? If you go straight to calling people toxic when they have questions about it, I think I understand why no one wants to use it.


I completely understand where you're coming from and I've been downvoted for expressing non-popular views here, and I relate to your frustration.

That being said, rest assured that your experience says absolutely nothing about the wider Rust community. It's one of the most helpful ones I've engaged with.

So please don't judge it by one strangely toxic framework creator.


> That being said, rest assured that your experience says absolutely nothing about the wider Rust community. It's one of the most helpful ones I've engaged with.

It's very common to see people with toxic attitudes in and around the Rust community, even in their internal communication about how to use Rust (`actix-web`, anyone?). I don't think it's helpful to lie to yourself about the Rust community like this.

The only thing Rust users who don't want to have these conversations can do is to openly recognize and talk about the extreme fanaticism Rust users commonly display and the toxic pattern of communication that sometimes is bundled or separate, when it comes to priorities in software dev.


What is “toxic” about the comment? That sounds like a legitimate question to me.


From what I see it's the dismissive way it was posed, with little curiosity about the real challenges. Similar to the 'oh I could build that in a weekend' style comments that are pretty exhausting for creators to have to deal with.


This submission is literally about why people aren't using some Rust framework. I add my two cents as to why that might be and then that gets called toxic and dismissive.

Seems like many people here aren't actually willing to engage in a discussion. I guess this submission is basically just native advertisement for the framework in question.


Thank you. I find this assumption of bad faith quite frustrating. Every comment I make in this thread seems to be instantly downvoted.


I think maybe they were confused by this text (which I agree has nothing to do with Rust itself breaking):

> With the release of Apache Arrow 3.0.0 there are many breaking changes in the Rust implementation and as a result it has been necessary to comment out much of the code in this repository and gradually get each Rust module working again with the 3.0.0 release.


Ah, yes, that makes sense. I can see how this could have been misread.


aye - it ultimately was my misreading of the commit history. Agree that this wasn't a rust specific change.


> If SQL had a way of picking one row from a group, rather than aggregating over it, that would be immensely useful.

You can do this with a LATERAL join, if you want to avoid the jankiness of window functions. Lateral joins are just a programmatic way to introduce a correlated subquery. For example

    SELECT department, first_name, gross_salary FROM
        (SELECT DISTINCT department FROM salary) depts,
        LATERAL (
            SELECT first_name, gross_salary 
            FROM salary 
            WHERE department = depts.department 
            ORDER BY gross_salary DESC
            LIMIT 3
        )
This uses a limit of 3 to show off top-3 instead of just argmax, but you could clearly set that to one if you wanted. This construction can be pretty handy if you need the per-group rows to be something other than what a window function could handle.


There are a few differences, the main one between Spark and timely dataflow is that TD operators can be stateful, and so can respond to new rounds of input data in time proportional to the new input data, rather than that plus accumulated state.

So, streaming one new record in and seeing how this changes the results of a multi-way join with many other large relations can happen in milliseconds in TD, vs batch systems which will re-read the large inputs as well.

This isn't a fundamentally new difference; Flink had this difference from Spark as far back as 2014. There are other differences between Flink and TD that have to do with state sharing and iteration, but I'd crack open the papers and check out the obligatory "related work" sections each should have.

For example, here's the first para of the Related Work section from the Naiad paper:

> Dataflow Recent systems such as CIEL [30], Spark [42], Spark Streaming [43], and Optimus [19] extend acyclic batch dataflow [15, 18] to allow dynamic modification of the dataflow graph, and thus support iteration and incremental computation without adding cycles to the dataflow. By adopting a batch-computation model, these systems inherit powerful existing techniques including fault tolerance with parallel recovery; in exchange each requires centralized modifications to the dataflow graph, which introduce substantial overhead that Naiad avoids. For example, Spark Streaming can process incremental updates in around one second, while in Section 6 we show that Naiad can iterate and perform incremental updates in tens of milliseconds.


That's very helpful, thanks! I think I still have to wrap my head around _why_ being stateful allows TD to respond faster, but maybe I just gotta dig deeper and see for myself.


It just comes down to something as simple as: "if I have shown you 1M different things, and now show you one more thing, what do you have to do to tell me whether that one thing is new or not?"

If you can keep a hash map of the things you've seen, then it is easy to respond quickly to that one new thing. If you are not allowed to maintain any state, then you don't have a lot of options to efficiently respond to the new thing, and most likely need to re-read the 1M things.

That's the benefit of being stateful. There is a cost too, which is that you need to be able to reconstruct your state in the case of a failure, but fortunately things like differential dataflow (built on TD) are effectively deterministic.

Also, I suspect "Spark" is a moving target. The original paper described something that was very much a batch processor; they've been trying to fix that since, and perhaps they've made some progress in the intervening years.


I see. To my small brain it sounds like TD can intelligently memoize or cache the outputs of each "step" so that it only recalculates when it needs to as the inputs change.

I think Spark does that sometimes these days, but I don't know much about the specifics of how and when Spark does it.

Does TD have to keep _everything_ in memory, or can it be strategic in what it keeps and what it evicts?


TD lets you write whatever logic you want (it is fairly unopinionated on your logic and state).

Differential dataflow plugs in certain logic there, and it does indeed maintain a synopsis of what data have gone past, sufficient to respond to future updates but not necessarily the entirety of data that it has seen.

It would be tricky to implement DD over classic Spark, as DD relies on these synopses for its performance. There are some changes to Spark proposed in recent papers where it can pull in immutable LSM layers w/o reading them (e.g. just mmapping them) that might improve things, but until that happens there will be a gap.


Gotcha. Thanks for answering all my q's!


You should check out Apache Flink. It does a bunch of those things that Spark doesn't, though it's also missing a few things that Spark has. https://flink.apache.org/


I think the main distinction is around "interactivity" and how long it takes from typing a query to getting results out. Once you stand up a Flink dataflow, it should move along a brisk clip. But standing up a new dataflow is relatively heavy-weight for them; typically you have to re-flow all of the data that feeds the query.

Materialize has a different architecture that allows more state-sharing between operators, and allows many queries to spin up in milliseconds. Primarily, this is when your query depends on existing relational data in pre-indexed form (e.g. joining by primary and foreign keys).

You can read a bit more in an overview blog post [0] and in more detail in a VLDB paper [1]. I'm sure there are a number of other quirks distinguishing Flink and Materialize, probably several in their favor, but this is the high-order bit for me.

[0]: https://materialize.com/materialize-under-the-hood/ [1]: http://www.vldb.org/pvldb/vol13/p1793-mcsherry.pdf


It's a good question, but you'd have to ask them I think. Tamas (from Itemis) and I were in touch for a while, mostly shaking out why DD was out-performing their previous approach, but I haven't heard from him since.

My context at the time was that they were focused on doing single rounds of incremental updates, as in a PL UX, whereas DD aims at high throughput changes across multiple concurrent timestamps. That's old information though, so it could be very different now!


Thanks for the reply!

A while ago (2018), the people behind VIATRA performed a cross-technology benchmark where they compared their performance to 9 other incremental and non-incremental solutions (Neo4j, Drools, OCL, SQLite, MySQL, among others) [1]. Perhaps it could be interesting to rerun that benchmark while including Materialize?

This would give us a direct comparison between Materialize and other existing solutions. Their benchmark is however based on a kind of UX case, so the tests might be a bit biased towards that use case.

[1] The Train Benchmark: cross-technology performance evaluation of continuous model queries


Here's my take on this, from a few months back:

https://materialize.com/lateral-joins-and-demand-driven-quer...


It's easier to describe the things that cannot be materialized.

The only rule at the moment is that you cannot currently maintain queries that use the functions `current_time()`, `now()`, and `mz_logical_timestamp()`. These are quantities that change automatically without data changing, and shaking out what maintaining them should mean is still open.

Other than that, any SELECT query you can write can be materialized and incrementally maintained.

https://materialize.com/docs/sql/select/


there are messages like this in the docs:

> "WARNING! LATERAL subqueries can be very expensive to compute. For best results, do not materialize a view containing a LATERAL subquery without first inspecting the plan via the EXPLAIN statement. In many common patterns involving LATERAL joins, Materialize can optimize away the join entirely. "

I take this to mean that Materialize cannot always efficiently maintain a view with lateral joins - that's fine neither can SQL Server, but it would be nice if I could find all these exceptions in one place like I can for SQL Server.

..fwiw I prefer the behavior of failing early rather than letting potential severe performance problems into prod.

[1] https://materialize.com/docs/sql/join/#lateral-subqueries


> I take this to mean that Materialize cannot efficiently maintain a view with lateral joins [...]

Well, no this isn't a correct take. Lateral joins introduce what is essentially a correlated subquery, and that can be surprisingly expensive, or it can be fine. If you aren't sure that it will be fine, check out the plan with the EXPLAIN statement.

Here's some more to read about lateral joins in Materialize:

https://materialize.com/lateral-joins-and-demand-driven-quer...


sorry you missed my ninja-edit - it sounds like SOME lateral join queries CAN be efficiently maintained but not ALL (not the ones that are surprisingly expensive for whatever reason) that's where the promise of "we can materialize any query!" starts to fall apart for me. presumably the surprisingly expensive cases are the ones where some rewrite rules can't guarantee correctness without hiding indexes or predicate pushdowns or whatever - the doc says review the explain plan first but what precisely about the explain plan would tell me that the materialized view won't be efficiently maintained? ideally these cases can be known ahead of time so I can come up with a conformant query rather than trying variations to see what works.

..and more to the point, there are obviously limits to what can be efficiently maintained. I would love to see that list as this is what would give me a good idea of how Materialize compares to my daily driver RDBMS which happens to be SQL Server and whose limits I'm unfortunately intimately familiar.


I don't think there is anything fundamentally different from an existing database. In all relational databases, some lateral joins can be expensive to compute. In Materialize, those same lateral joins will also be expensive to maintain.

I'd be surprised to hear you beat up postgres or SQL Server because they claim they can evaluate any SQL query, but it turns out that some SQL queries can be expensive. That's all we're talking about here.


I am genuinely interested in Materialize's capability to incrementally maintain views and I understand there are all sorts of limitations as to when that's even possible - I can't find a comprehensive list of them. I don't think it's fair to say you support every possible select statement and then just have some of them be slow. The lateral join case was the first warning I encountered in the docs - is that the ONLY case and every other possible select statement can be incrementally maintained?


All queries are incrementally maintained with the property that we do work proportional to the number of records in difference at each intermediate stage of the query plan. That includes those with lateral joins; they are not an exception.

I'm not clear on your "all sorts of limitations"; you'll have to fill me in on them?


> I'm not clear on your "all sorts of limitations"; you'll have to fill me in on them?

this feels like bait but honestly I'm under the impression that incrementally updating materialized views (where optimal = the proportion of changed records) just isn't always possible. for example, max and min aggregates aren't supported in SQL Server because updating the current max or min record requires a query to find the new max or min record - that's not considered an incremental update and so it's not supported and trying to materialize the view fails. there are a number of cases like this and a big part of problem solving with SQL Server is figuring out how to structure a view within these constraints. if you can then you can rest assured that updates will be incremental and performant - this is important because performance is the feature, if the update is slow then my app is broken. if Materialize has a list of constraints shorter than SQL Server's then you're sitting on technology worth billions - it's hard for me to believe that your list of constraints is "there are none" especially when there are explicit-but-vague performance warnings in the docs.


(Disclaimer: I'm one of the engineers at Materialize)

> for example, max and min aggregates aren't supported in SQL Server because updating the current max or min record requires a query to find the new max or min record

This isn't a requirement in Materialize, because Materialize will store values in a reduction tree (which is basically like a min / max heap) so that when we add or remove a record, we can compute a new min / max in O(log (total_number_of_records)) time in the worst case (when a record is the new min / max). Realistically, that log term is bounded to 16 (it's a 16-ary heap and we don't support more than 2^64 records). Computing the min / max this way is substantially better than having to recompute with a linear scan. This [1] provides a lot more details on how we compute reductions in Materialize.

> there are obviously limits to what can be efficiently maintained

I think we fundamentally disagree here. In our view, we should be able to maintain every view either in linear time wrt the number of updates or sublinear time with respect to the overall dataset, and every case that doesn't do so is a bug. The underlying computational frameworks [2] we're using are designed for that, so this isn't just like a random fantasy.

> if Materialize has a list of constraints shorter than SQL Server's then you're sitting on technology worth billions

Thank you! I certainly hope so!

[1]: https://materialize.com/robust-reductions-in-materialize/ [2]: https://github.com/timelydataflow/differential-dataflow/blob...


> In our view, we should be able to maintain every view either in linear time wrt the number of updates or sublinear time with respect to the overall dataset, and every case that doesn't do so is a bug.

This is awesome and I believe that should be technically possible for any query given the right data structure. The reduction tree works for min/max but is it a general solution or are there other data structures for other purposes - n per x and top subqueries come to mind. Is it all handled already or are there some limitations and a roadmap?


I'm not entirely sure what you mean by n per x, but if by top you mean something like "get top k records by group" then we support that. See [1] for more details. top-k is actually also rendered with a heap-like dataflow

When we plan queries we are rendering them into dataflow graphs that consist of one or more dataflow operators transforming data and sending it to other operators. Every single operator is designed to do work proportional to the number of changes in its inputs / outputs. For us, optimizing our performance a little bit less a matter of the right data structures, and more about expressing things in a dataflow that can handle changes to inputs robustly. But the robustness is more a question of "what do are my constant factors when updating results" and not "is this being incrementally maintained or not".

We have a known limitations page in our docs here [2] but it mostly covers things like incompleteness in our SQL support or Postgres compatibility. We published our roadmap in a blog post a few months ago here [3]. Beyond that everything is public on Github [4].

[1]: https://materialize.com/docs/sql/idioms/ [2]: https://materialize.com/docs/known-limitations/ [3]: https://materialize.com/blog-roadmap/ [4]: https://github.com/MaterializeInc/materialize


Min and max work using a hierarchical reduction tree, the dataflow equivalent of a priority queue. They will update, under arbitrary changes to the input relation, in time proportional to the number of those changes.

> [...] it's hard for me to believe that your list of constraints is "there are none" especially when there are explicit-but-vague performance warnings in the docs.

I think we're done here. There's plenty to read if you are sincerely interested, it's all public to both try and read, but you'll need to find someone new to ask, ideally with a less adversarial communication style.


Sorry I’m definitely overly pessimistic when it comes to new database tech - you’ll find us industry hardened rdbms users hard to convince (we’ve been through a lot) thanks for chatting!


Hi! I work at Materialize.

I think the right starter take is that Materialize is a deterministic compute engine, one that relies on other infrastructure to act as the source of truth for your data. It can pull data out of your RDBMS's binlog, out of Debezium events you've put in to Kafka, out of local files, etc.

On failure and restart, Materialize leans on the ability to return to the assumed source of truth, again a RDBMS + CDC or perhaps Kafka. I don't recommend thinking about Materialize as a place to sink your streaming events at the moment (there is movement in that direction, because the operational overhead of things like Kafka is real).

The main difference is that unlike an OLTP system, Materialize doesn't have to make and persist non-deterministic choices about e.g. which transactions commit and which do not. That makes fault-tolerance a performance feature rather than a correctness feature, at which point there are a few other options as well (e.g. active-active).

Hope this helps!


Hi, I work at Materialize.

You can read about Vertica's "Live Aggregate Projections" here:

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/An...

In particular, there are important constraints like (among others)

> The projections can reference only one table.

In Materialize you can spin up just about any SQL92 query, join eight relations together, have correlated subqueries, count distinct if you want. It is then all maintained incrementally.

The lack of caveats is the main difference from the existing systems.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: