They were talking about the "modern data stack" no doubt. The trend has been to ...

marcosdumay · on Oct 24, 2022

> stricter type systems ... the practices of Python or Scala

I do understand what you are talking about. But I really think you and the OP are both complaining about the wrong problem.

SQL doesn't require bad practices, doesn't inherently harm composability (the way the OP was referring), and don't inherently harm verification. Instead, it has stronger support for many of those than the languages you want to replace it with.

The problems you are talking about are very real. But they do not come from the language. (SQL does bring a few problems by itself, but they are much more subtle than those.)

forgetfulness · on Oct 24, 2022

At least BigQuery does a fair bit of typechecking, and gives error messages in a way that's to the par of application programming (e.g. not letting you pass a timestamp to a DATE function and stating that there's no matching signature).

But a tool that doesn't "require" bad practices but doesn't require good practices either makes your work harder in the long run.

Tooling is poor, the best IDE-similes you got until recently were of the type that connects to a live environment but doesn't tie to your codebase, and encourages you to put your code directly on the database rather than version control, the problems of developing with a REPL and little in the way to mitigate them. I'm talking of course of the problem of having view and function definitions live in the database with no tools to statically navigate the code.

Testing used to be completely hand rolled if anyone bothered with it at all.

That was until now, that data pipeline orchestration tools exist and let you navigate the pipeline as a dependency graph, a marked improvement, but until dbt's Python version is ready for production, we're talking here of a graph of Jinja templates and YAML definitions, with modest support for unit testing.

Dataform is a bit better but virtually unknown and was greatly hindered by the Google acquisition.

Functions have always been clunky and still are.

RDDs and then, to a lesser extent, Dataframes offered a much stronger programming model, but they were still subject to a lack of programming discipline from data engineers in many shops. The results of that, however, are on a different scale with undisciplined SQL programming, and it's downright hard to be disciplined when using it.

The trend to move from ETL to ELT I feel shouldn't have been unquestioningly transitioned to untyped Dataframes and then SQL.

acjohnson55 · on Oct 25, 2022

Could you elaborate on SQL's problems?

marcosdumay · on Oct 25, 2022

The language itself has some issues, like no attention at all paid into modularity and reusability; the old-language distinction between functions and data; lack of expressivity of the the type system (well, not compared with Python and Scala, but just ADTs would already bring a huge gain); and the complex limitations on the symbol literal vs. evaluated use that forces people into metaprograming every time they need to decide on a table at runtime.

The first one would limit the use of widespread best-practices, but in practice it's not the bottleneck, because every SQL-based tooling already creates strictly more constraining issues.

acjohnson55 · on Oct 25, 2022

Those critiques make sense to me, thanks!

morelisp · on Oct 24, 2022

Let me offer the more blunt materialist analysis: Data engineers are being deskilled into data analysts and too blinded by shiny cloud advertisements to notice.

(In this view though, "lack of tests" or whatever is the least concern - until someone figures out how to spin up another expensive cloud tool selling "testable queries".)

forgetfulness · on Oct 24, 2022

The "data engineer" became a distinct role to bring over Software Engineering practices to data processing; such as those practices are, they were a marked improvement over their absence.

Building a bridge from one shore to the other with application programming languages and data processing tools that worked much closer to other forms of programming was a huge part of that push.

Of course, the big data tools were intricate machines that were easy to learn and very hard to master, and data engineers had to be pretty sophisticated.

So, it became cheaper to move much of that apparatus to data warehouses and, as you said, commoditize the building of data pipelines that way.

Software is as widespread as it is today because in every generation the highly skilled priestly classes that were needed to get the job done were displaced by people with less training enabled by new tools or hardware; else it'd be all rocket simulations done by PhD physicists still.

But the technical debt will be hefty from this shift.

acjohnson55 · on Oct 25, 2022

At the end of the day, it's about providing value to businesses. If the same value can be provided with less intensive skillsets and more efficiently, this is a good thing.

chrisjc · on Oct 24, 2022

FYI

> write data pipelines then using dbt (which outcompeted Dataform, though the latter is still kicking), but then you don't have the richer programming facilities, stricter type systems, tooling and the practices of Python or Scala programming, you're in the world of SQL...

Recently announced and limited to only a handful of data platforms, but dbt now supports python models.

https://docs.getdbt.com/docs/building-a-dbt-project/building...

MrPowers · on Oct 24, 2022

> The trend has been to shift as much work possible to the current generation of Data Warehouses, that abstract the programming model that Spark on columnar storage provided with only a SQL interface, reducing the space where you'd use Spark.

I feel like there there are some data professionals that only want to use SQL. Other data professionals only want to use Python. I feel like the trend is to provide users with interfaces that let them be productive. I could be misreading the trend of course.

morelisp · on Oct 24, 2022

It's very unclear to me that anyone is more productive under these new tooling stacks. I'm certain they're not more productive commensurately with new costs and long-term risks.