More

nchammas · 2025-02-28T02:58:24 1740711504

There is an old project out of Berkeley called BOOM [1] that developed a language for distributed programming called Bloom [2].

I don't know enough about it to map it to the author's distributed programming paradigms, but the Bloom features page [3] is interesting:

> disorderly programming: Traditional languages like Java and C are based on the von Neumann model, where a program counter steps through individual instructions in order. Distributed systems don’t work like that. Much of the pain in traditional distributed programming comes from this mismatch: programmers are expected to bridge from an ordered programming model into a disordered reality that executes their code. Bloom was designed to match–and exploit–the disorderly reality of distributed systems. Bloom programmers write programs made up of unordered collections of statements, and are given constructs to impose order when needed.

[1]: https://boom.cs.berkeley.edu

[2]: http://bloom-lang.net/index.html

[3]: http://bloom-lang.net/features/

jmhucb · 2025-02-28T05:30:54 1740720654

Good pattern matching. Bloom is a predecessor project to the OP's PhD thesis work :-) This area takes time and many good ideas to mature, but as the post hints, progress is being made.

nchammas · on Feb 14, 2024

There is something I don't get about the Polars DataFrame API.

https://docs.pola.rs/user-guide/migration/spark/

Look at the examples on this page of the Spark vs. Polars DataFrame APIs. (Disclaimer: I contributed this documentation. [1])

Having used SQL and Spark DataFrames heavily, but not Polars (or Pandas, for that matter), my impression is that Spark's DataFrame is analogous to SQL tables, whereas Polars's DataFrame is something a bit different, perhaps something closer to a matrix.

I'm not sure how else to explain these kinds of operations you can perform in Polars that just seem really weird coming from relational databases. I assume they are useful for something, but I'm not sure what. Perhaps machine learning?

[1]: https://github.com/pola-rs/polars-book/pull/113

tomtom1337 · on Feb 14, 2024

I have not used spark, but I have written a lot of sql, polars and pandas. I think much more in terms of sql when I write polars than pandas. Do you have any examples of what you are referring to?

nchammas · on Feb 14, 2024

The examples I'm referring to are in that page I linked to in my comment above.

Here's one of them:

  # Polars
  df.select(
    pl.col("foo").sort().head(2),
    pl.col("bar").sort(descending=True).head(2),
  )

In SQL and Spark DataFrames, it doesn't make sense to sort columns of the same table independently like this and then just juxtapose them together. It's in fact very awkward to do something like this with either of those interfaces, which you can see in the equivalent Spark code on that page. SQL will be similarly awkward.

But in Polars (and maybe in Pandas too) you can do this easily, and I'm not sure why. There is something qualitatively different about the Polars DataFrame that makes this possible.

theLiminator · on Feb 17, 2024

Because it's column based vs row based. Definitely can be a bit more of a footgun ("with great power comes great responsibility").

Long story short, the memory model operates on columns of data as opposed to rows, so fields in a conceptual "row" aren't necessarily an atomic unit.

nchammas · on Jan 9, 2024

They _did_ serve the data. From TFA:

> To demonstrate its scale, we operated the instance with 100M bots posting 3,500 times per second at 403 average fanout.

The linked post about their demo [1] has more details about the challenges involved, including rendering home timelines.

[1]: https://blog.redplanetlabs.com/2023/08/15/how-we-reduced-the...

bawolff · on Jan 9, 2024

Meh,its better than nothing, but real world traffic is often very different than simulated traffic. If it isn't actually working with real traffic than i am not impressed.

nchammas · on Nov 21, 2023

This is pretty neat, and it reminds me of my experiment solving the water jug problem from Die Hard 3 using Hypothesis [1] (a Python library for property-based testing).

Though I doubt Hypothesis's stateful testing capabilities can replicate all the capabilities of Datalog (or its elegance for this kind of logic problem), I think you could port the author's expression of the game rules to Hypothesis pretty straightforwardly.

[1]: https://nchammas.com/writing/how-not-to-die-hard-with-hypoth...

dr_kiszonka · on Nov 22, 2023

Neat! For a while, I have been looking for software that could tell whether my reasoning about things other than code is correct. And it looks like Hypothesis might be worth giving it a shot.

nchammas · on Nov 19, 2022

Hypothesis is really a neat library. You can use it to implement stateful testing, which allows you to cover some of the more complex problems that might normally require something like TLA+.

I saw a solution to the water jug problem from the movie Die Hard 3 written in TLA+ and ported it to Hypothesis [0]. It was surprisingly straightforward and easy to read.

[0]: https://nchammas.com/writing/how-not-to-die-hard-with-hypoth...

nchammas · on April 3, 2021

This idea was briefly addressed in PEP 635 [0]:

> _Statement vs. Expression._ Some suggestions centered around the idea of making `match` an expression rather than a statement. However, this would fit poorly with Python's statement-oriented nature and lead to unusually long and complex expressions and the need to invent new syntactic constructs or break well established syntactic rules. An obvious consequence of match as an expression would be that case clauses could no longer have arbitrary blocks of code attached, but only a single expression. Overall, the strong limitations could in no way offset the slight simplification in some special use cases.

[0]: https://www.python.org/dev/peps/pep-0635/

nchammas · on Feb 22, 2021

Author here.

Are you thinking of a specific implementation of materialized views? Most implementations from traditional RDBMSs would indeed be too limiting to use as a general data pipeline building block.

The post doesn't argue that, though. It's more about using materialized views as a conceptual model for understanding data pipelines, and hinting at some recent developments that may finally make them more suitable for more widespread use.

From the conclusion:

> The ideas presented in this post are not new. But materialized views never saw widespread adoption as a primary tool for building data pipelines, likely due to their limitations and ties to relational database technologies. Perhaps with this new wave of tools like dbt and Materialize we’ll see materialized views used more heavily as a primary building block in the typical data pipeline.

nchammas · on Feb 22, 2021

Of the traditional RDBMSs, I believe Oracle has the most comprehensive support for materialized views, including for incremental refreshes [0].

As early as 2004, developers using Oracle were figuring out how to express complex constraints declaratively (i.e. without using application code or database triggers) by enforcing them on materialized views [1].

It's quite impressive, but this level of sophistication in what materialized views can do and how they are used does not seem to have spread far beyond Oracle.

[0]: https://docs.oracle.com/database/121/DWHSG/refresh.htm#DWHSG...

[1]: https://tonyandrews.blogspot.com/2004/10/enforcing-complex-c...

kthejoker2 · on Feb 22, 2021

SQL Server has indexed views, which is "real time" refresh since it's a separate clustered index that's written to at the same time as the base tables.

https://www.sqlshack.com/sql-server-indexed-views/

nchammas · on Feb 22, 2021

Do you know how sophisticated SQL Server is about updating indexed views? How granular are the locks when, for example, an indexed aggregate is updated? That will have a big impact on write performance when there are many concurrent updates.

A long time ago, I tried to use SQL Server indexed views to maintain a bank account balances table based on transaction history [0].

I forget what I ended up doing, but I remember that one of the downsides of using indexed views was that they didn't support any constraints. There are many restrictions on what you can and can't put in a SQL Server indexed view [1].

In this regard, I think Oracle has a more mature materialized view offering, though I personally haven't used Oracle much and don't know how well their materialized views work in practice.

[0]: https://dba.stackexchange.com/q/5608/2660

[1]: https://docs.microsoft.com/en-us/sql/relational-databases/vi...

kthejoker2 · on Feb 22, 2021

Well, indexed views aren't materialized views per se, they are a tradeoff between maintenance and deterministic performance.

A materialized view is nothing more than a snapshot cache, a one time ETL job. So it can abide by any constraints and is completely untethered from the data that created it. So you have to create your own maintenance cycle, including schema validation and any dynamic / non-deterministic aspects of the MV.

An indexed view is modified just like the clustered index of any tablr object upon which it depends, as an affected "partition" of the DML. That's what the SCHEMABINDING keyword is for, binding the view to any DML statements of its underlying base table(s).

So no need to maintain it at all, at the expense of conforming to a fairly rigid set of constraints to ensure that maintenance is ..umm ..maintainable.

In practice most views' logic are perfectly simpatico with the constraints of an indexed view - the tradeoff is write performance vs the "cache hit" of your view.

I do way more OLAP/HTAP engineering in my day job so indexed views are less common vs. Columnstores, but indexed views are a highly underutilized feature of SQL Server.

gfody · on Feb 22, 2021

if sql server can't incrementally maintain your view then it doesn't let you index it. that's why there are so many restrictions but on the flip side if you manage to workaround the restrictions then decent update performance is assured.

trollied · on Feb 22, 2021

As an ex-Oracle DBA, I really do miss native materialized view support in the likes of Postgres. They are /so/ powerful. People love to bash Oracle, but this is a good example of a feature that the open source DBs haven't got close to yet.

nchammas · on Feb 22, 2021

> If you process data one row at a time, that is clearly a streaming pipeline, but most systems that call themselves streaming actually process data in small batches. From a user perspective, it’s an implementational detail, the only thing you care about is the latency target.

Author here. 100% agreed.

As an aside, I just came across your post about how Databricks is an RDBMS [0]. I recently wrote a similar article from a slightly more abstract perspective [1].

Having worked heavily with RDBMSs in the first part of my career, I feel like so many of the concepts and patterns I learned about there are being re-expressed today with modern, distributed data tooling. And that was part of my inspiration for this post about data pipelines.

[0] https://fivetran.com/blog/databricks-is-an-rdbms

[1] https://nchammas.com/writing/modern-data-lake-database

lmm · on Feb 22, 2021

https://www.confluent.io/blog/turning-the-database-inside-ou... might be interesting if you haven't seen it already. The way I see it RDBMSes already had most of the tech, but they wrap it up in an opaque black box that can only be used one way, like those automagical frameworks that generate a full webapp from a couple of class definitions but then fall apart as soon as you want to do something slightly specialised. So the data revolution is really about unbundling all the cool RDBMS tech and moving from a shrinkwrapped database product to something more like a library that gives you full control over what happens when, letting you integrate your business logic wherever it makes sense.

nchammas · on Feb 22, 2021

That talk by Martin Kleppmann is fantastic.

Another, closely related article is "The Log: What every software engineer should know about real-time data's unifying abstraction" by Jay Kreps.

https://engineering.linkedin.com/distributed-systems/log-wha...

jgraettinger1 · on Feb 22, 2021

> I feel like so many of the concepts and patterns I learned about there are being re-expressed today with modern, distributed data tooling

So much this.

I work on Flow [0] at Estuary [1]; you might be interested in what we're doing. It offers millisecond latency continuous datasets that are _also_ plain JSON files in cloud storage -- a real-time data lake -- which can be declaratively transformed and materialized into other systems (RDBMS, key/value, even pub/sub) using precise, continuous updates. It's the materialized view model you articulate, and Flow is in essence composing data lake-ing with continuous map/combine/reduce to make it happen.

I was asked the other day if Flow "is a database" by someone who only wanted a 2-3 sentence answer, and I waffled badly. The very nature of "databases" are so fuzzy today. They're increasingly unboxed across the Cambrian explosion of storage, query, and compute options now available to us. S3 and friends for primary storage; on-demand MPP compute for query and transformation; wide varieties of RDBMSs, key/value stores, OLAP systems, even pub/sub as specialized indexes for materializations. Flow's goal, in this worldview, is to be a hypervisor and orchestrator for this evolving cloud database. Not a punchy elevator pitch, but there it is.

[0] https://estuary.readthedocs.io/en/latest/README.html [1] https://estuary.dev/

kthejoker2 · on Feb 22, 2021

Strong concur, the idea of "database" just doesn't make sense anymore, I've (d?)evolved to "storage, streaming, and compute" (as it seems you do as well) in all my discussions.

Have you read ThoughtWorks material on the "data mesh"? Sounds like your product is looking to be a part of that kind of new data ecology.

https://martinfowler.com/articles/data-mesh-principles.html

It is in definitely anticlimactic after data warehousing, data lakes, data lakehouses, etc. to just throw up your hands and say "data whenever you want it, wherever you want it, in whatever form you want it" (at whatever price you're willing to pay!) So I feel your pain on marketing your product,but I think the next 5 years or so will be heavily focused on automating data quality and standardized pipelines, computational governance and optimizing workloads, and intelligent "just in time" materializaton, caching, HTAP, etc.

Your big play in my mind is helping customers optimize the (literal financial) cost/benefit tradeoff of all those compute and query engines.

snidane · on Feb 22, 2021

Can somebody explain Data Mesh in simple terms? I've watched numerous videos and read the article, but still can't understand what it's supposed to say.

Is it basically to set up one data lake per analytics team, instead of a big centralized data lake?

It seems to me this is already being done from my experience. Each team has their own s3 bucket in which they store their data products and other teams can consume them using their favorite data engines as long as they can read parquet files from s3.

Or am I missing something?

benstopford · on Feb 22, 2021

It's basically a set of 4 principals or best practices: 1. decentralized data ownership. 2. Data as a product. 3. Self-serve data infrastructure. 4. Federated governance. So I don't think of it as some new product thing but rather principals that work well for people with a certain type of use case. It's a bit like Microservices in that regard.

nchammas · on Feb 22, 2021

Interesting project! (The interactive slides are cool btw.)

Could you share a bit about how engineers express data transformations in Flow?

From a quick look at the docs, it doesn't look like you are using SQL to do this, which is interesting since it bucks the general trend in data tooling.

jgraettinger1 · on Feb 22, 2021

You write pure-function mappers using any imperative language. We're targeting TypeScript initially for $reasons, followed by ergonomic OpenAPI / WASM support, but anything that can support JSON => JSON over HTTP will do.

It's only on you to write mapper functions -- combine & reduce happens automatically, using reduction annotations of the collection's JSON-schema. Think of it as a SQL group-by where aggregate functions have been hoisted to the schema itself.

Here's a worked-up narrative example using Citi Bike system data: https://estuary.readthedocs.io/en/latest/examples/citi-bike/...

| it bucks the general trend in data tooling

You can say that again. It's a bet, and I don't know how it will work out. I do think SQL is a poor fit for expressing long-lived workflows that evolve over joined datasets / schemas / transformations, or which are more operational vs analytical in nature, or which want to bring in existing code. Though I _completely_ understand why others have focused on SQL first, and it's definitely not an either / or proposition.

nchammas · on Feb 14, 2021

> providing analytics and a database type experience over a data lake.

It's interesting how the modern data lake is developing in this way, recreating many patterns from the traditional database for distributed systems and massive scale: SQL and query optimization, transactions and time travel, schema evolution and data constraints...

Having started out as a database developer / DBA many years ago, working with data lakes today reminds me in many ways of that early part of my career.

I wrote a post tracing a common interface from the typical relational database to the modern data lake.

https://nchammas.com/writing/modern-data-lake-database