The headline refers to "incrementally updated materialize views". How does a company get funding for a feature that has already existed in other DBs for at least a decade?
E.g, Vertica refers to this as Live Aggregate Projections.
It's a cool concept but comes with huge caveats. Keeping tracking of non-estimated cardinality for COUNT DISTINCT type queries, as an example.
(Disclaimer: I'm one of the engineers at Materialize.)
> How does a company get funding for a feature that has already existed in other DBs for at least a decade? ... It's a cool concept but comes with huge caveats.
I think you answered your own question here. Incrementally-maintained views in existing database systems typically come with huge caveats. In Materialize, they largely don't.
Most other systems place severe restrictions on the kind of queries that can be incrementally maintained, limiting the queries to certain functions only, or aggregations only, or only queries without joins—or if they do support maintaining joins, often the joins must occur only on the involved tables' keys. In Materialize, by contrast, there are approximately no such restrictions. Want to incrementally-maintain a five-way join where some of the join keys are expressions, not key columns? No problem.
That's not to say there aren't some caveats. We don't yet have a good story for incrementally-maintaining queries that observe the current wall-clock time [0]. And our query optimizer is still young (optimization of streaming queries is a rather open research problem), so for some more complicated queries you may not get the resource utilization you want out of the box.
But, for many queries of impressive complexity, Materialize can incrementally-maintain results far faster than competing products—if those products can incrementally maintain those queries at all.
The technology that makes Materialize special, in our opinion, is a novel incremental-compute framework called differential dataflow. There was an extensive HN discussion on the subject a while back that you might be interested in [1].
This is one of my favorite types of HN comments: admits the bias upfront, offers a meaningful technical answer, and links to relevant documents for a deeper dive. Thank you so much!
Thanks for the explanation. I'm going to look more into this as I'm working on a new service on top of Vertica. There is a lot I don't like about Vertica and don't see alternatives such as Snowflake to be much of an improvement.
Hi - I'm enjoying reading the discussion around this, and the previous discussion [1] as well. It's possible that Materialize can help us transition a really complex pipeline to real-time.
To the short discussion here [0] about window functions - any update to that in the last 9 months?
Our workloads involve, in a lot of cases, ingesting records, and keeping track of whether N records of a similar type have been seen within any 15 minute interval. The records do not arrive in chronological order. Is this currently a potential use case for Materialize?
What about the other big problem ignored here:
does your streaming platform separate compute and storage?
Because GCP DataFlow does. Flink doesn't.
DataFlow allows you to elastically scale the compute you need (Snowflake, Databricks). If you can't do that, materialized views will be a more niche feature for bigger 24x7 deployments with predictable workflows.
As George points out above, we haven’t added our native persistence layer yet. Consistency guarantees are something we care a lot, so for many scenarios, we leverage the upstream datastore (often Kafka).
But to answer your question, yes, our intention is to support separate cloud-native storage layers.
My dim and distant recollection is that Beam and/or GCP Data Flow require someone to implement PCollections and PTransforms to get the benefit of that magic. That's not a trivial exercise, compared to writing SQL.
In particular, there are important constraints like (among others)
> The projections can reference only one table.
In Materialize you can spin up just about any SQL92 query, join eight relations together, have correlated subqueries, count distinct if you want. It is then all maintained incrementally.
The lack of caveats is the main difference from the existing systems.
> The headline refers to "incrementally updated materialize views". How does a company get funding for a feature that has already existed in other DBs for at least a decade?
They're getting funding for doing it much more efficiently.
I read into the background papers when it first popped up. This is legitimate, deep computer science that other DBs don't yet have.
E.g, Vertica refers to this as Live Aggregate Projections.
It's a cool concept but comes with huge caveats. Keeping tracking of non-estimated cardinality for COUNT DISTINCT type queries, as an example.