FROM r JOIN s USING (id)
|> WHERE r.c < 15
|> AGGREGATE sum(r.e) AS s GROUP BY r.d
|> WHERE s > 3
|> ORDER BY d
|> SELECT d, s, rank() OVER (order by d)
Can we call this SQL anymore after this? This re-ordering of things has been done by others too, like PRQL, but they didn't call it SQL. I do think it makes things more readable.
The point of SQL pipe syntax is that there is no reordering. You read the query as a sequence of operations, and that's exactly how it's executed. (Semantically. Of course, the query engine is free to optimize the execution plan as long as the semantics are preserved.)
The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins:
From the paper:
> Each pipe operator is a unary relational operation that takes one table as input and produces one table as output.
Vanilla SQL is actually more complex in this respect because you have, for example, at least 3 different keywords for filtering (WHERE, HAVING, QUALIFY) and everyone who reads your query needs to understand what each keyword implies regarding execution scheduling. (WHERE is before grouping, HAVING is after aggregates, and QUALIFY is after analytic window functions.)
Golly, QUALIFY, a new SQL operator I didn’t know existed. I tend not to do much with window functions and I would have reached for a CTE instead but it’s always nice to be humbled by finding something new in a language you thought you knew well.
Is not common at all, is a non ANSI SQL clause that afaik was created by Teradata, syntactic sugar for filtering using window functions directly without CTEs or temp tables, especially useful for dedup. In most cases at least, for example you can't do a QUALIFY in an query that is aggregating data just as you can't use a window function when aggregating.
Other engines that implement it are direct competitors in that space: Snowflake, Databricks SQL, BigQuery, Clickhouse, and duckdb (only OSS implementation I now). Point is: if you want to compete with Teradata and be a possible migration target, you want to implement QUALIFY.
Anecdote: I went from a company that had Teradata to another where I had to implement all the data stack in GCP. I shed tears of joy when I knew BQ also had QUALIFY. And the intent was clear, as they also offered various Teradata migration services.
> The point of SQL pipe syntax is that there is no reordering.
But this thing resembles other FROM-clause-first variants of SQL, thus GP's point about this being just a reordering. GP is right: the FROM clause gets re-ordered to be first, so it's a reordering.
> The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins
I already think about SQL like this (as operation on lists/sets), however thinking of it like that, and having previous operations feed into the next, which is conceptually nice, seems to make it hard to do, and think about:
> *(the query engine is free to optimize the execution plan as long as the semantics are preserved)
since logically each part between the pipes doesn't know about the others, so global optimizations, such as use of indexes to restrict the result of a join based on the where clause can't be done/is more difficult.
This kind of implies there's better or worse ordering. AFAIK that's pretty subjective. If the idea was to expose how the DB is ordering things, or even make things easier for autocomplete OK, but this just feels like a "I have a personal aesthetic problem with SQL and I think we should spend thousands of engineering hours and bifurcate SQL projects forever to fix it" kind of thing.
The post I was referring to said that this new pipe syntax was a big reordering compared to the vanilla syntax, which it is. But my point is that if you're going to understand the vanilla syntax, you already have to do this reordering in your head because the order in which the the vanilla syntax executes (inside out) is the order in which pipes syntax reads. So it's just easier all around to adopt the pipe syntax so that reading and execution are the same.
This is an extension on top of all existing SQL. The pipe functions more or less as a unix pipe. There is no reordering, but the user selects the order. The core syntax is simply:
query | operator
Which results in a new query that can be piped again. So e.g. this would be valid too:
SELECT id,a,b FROM table WHERE id>1
|WHERE id < 10
The multiple uses of WHERE with different meanings is problematic for me. The second WHERE, filtering an aggregate, would be HAVING in standard SQL.
Not sure if this is an attempt to simplify things or an oversight, but favoring convenience (no need to remember multiple keywords) over explicitness (but the keywords have different meanings) tends to cause problems, in my observation.
I was not there at the original design decisions of the language, but I imagine it was there specifically to help the person writing/editing the query easily recognize and interpret filtering before or after an aggregation. The explicitness makes debugging a query much easier and ensures it fails earlier. I don't see much reason to stop distinguishing one use case from the other, I'm not sure how that helps anything.
I also wasn't there, but I think this actually wasn't to help authors and instead was a workaround for the warts of SQL. It's a pain to write
SELECT * FROM (SELECT * FROM ... GROUP BY ...) t WHERE ...
and they decided this was common enough that they would introduce a HAVING clause for this case
SELECT * FROM ... GROUP BY ... HAVING ...
But the real issue is that in order to make operations in certain orders, SQL requires you to use subselects, which require restating a projection for no reason and a lot of syntactical ceremony. E.g. you must give the FROM item a name (t), but it's not required for disambiguation.
Another common case is projecting before the filter. E.g. you want to reuse a complicated expression in the SELECT and WHERE clauses. Standard SQL requires you to repeat it or use a subselect since the WHERE clause is evaluated first.
I think this stems from the non-linear approach to reading a SQL statement. If it were top-to-bottom linear, like PRQL, then the distinction does not seem merited. It would then always be filtering from what you have collected up to this line.
I think the original sin here is not making aggregation an explicitly separate thing, even though it should be. Adding a count(*) fundamentally changes what the query does, and what it returns, and what restrictions apply.
If you perform an aggregation query in a CTE, then filter on that in a subsequent query, that is different, because you have also added another SELECT and FROM. You would use WHERE in that case whether using a CTE or just an outer query on an inner subquery. HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.
> HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.
Personally I rarely use HAVING and instead use WHERE with subqueries for the following reasons:
1-I don't like repeating/duplicating a bunch of complex calcs, easier to just do WHERE in outer query on result
2-I typically have outer queries anyway for multiple reasons: break logic into reasonable chunks for humans, also for join+performance reasons (to give the optimizer a better chance at not getting confused)
You can always turn a HAVING in SQL into a WHERE by wrapping the SELECT that has the GROUP BY in another SELECT that has the WHERE that would have been the HAVING if you hadn't bothered.
You don't need a |> operator to make this possible. Your point is that there is a reason that SQL didn't just allow two WHERE clauses, one before and one after GROUP BY: to make it clearer syntactically.
Whereas the sort of proposal made by TFA is that if you think of the query as a sequence of steps to execute then you don't need the WHERE vs. HAVING clue because you can see whether a WHERE comes before or after GROUP BY in some query.
But the whole point of SQL is to _not have to_ think of how the query is to be implemented. Which I think brings us back to: it's better to have HAVING. But it's true also that it's better to allow arbitrary ordering of some clauses: there is no reason that FROM/JOIN, SELECT, ORDER BY / LIMIT have to be in the order that they are -- only WHERE vs. GROUP BY ordering matters, and _only_ if you insist on using WHERE for pre- and post-GROUP BY, but if you don't then all clauses can come in any order you like (though all table sources should come together, IMO).
> The second WHERE, filtering an aggregate, would be HAVING in standard SQL.
Only if you aren't using a subquery otherwise you would use WHERE even in plain SQL. Since the pipe operator is effectively creating subqueries the syntax is perfectly consistent with SQL.
Perhaps, however then you eliminate the use of WHERE/HAVING sum(r.e) > 3, so in case you forgot what the alias s means, you have to figure that part out before proceeding. Maybe I'm just used to the existing style but as stated earlier, seems this is reducing explicitness which IMO tends to lead to more bugs.
A lot of SQL engines don't support aliases in the HAVING clause and that can require duplication of potentially complex expressions which I find very bug-inducing. Removing duplication and using proper naming I think would be much better.
I will already use subqueries to avoid issues with HAVING.
> A lot of SQL engines don't support aliases in the HAVING clause
We're moving from SQLAnywhere to MSSQL, and boy, we're adding 2-5 levels of subqueries to most non-trivial queries due to issues like that. Super annoying.
I had one which went from 2 levels deep to 9... not pleasant. CTEs had some issues so couldn't use those either.
I'm surprised you had issues with CTEs -- MS SQL has one of the better CTE implementations. But I could see how it might take more than just trivial transformations to make efficient use of them.
One issue, that I mentioned in a different comment, is that we have a lot of queries which are used transparently as sub-queries at runtime to get count first, in order to limit rows fetched. The code doing the "transparent" wrapping doesn't have a full SQL parser, so can't hoist the CTEs out.
One performance issue I do recall was that a lateral join of a CTE was much, much slower than just doing 5-6 sub-queries of the same table, selecting different columns or aggregates for each. Think selecting sum packages, sum net weight, sum gross weight, sum value for all items on an invoice.
There were other issues using plain joins, but I can't recall them right now.
CTE's (at least in MS SQL land) are a syntax level operation, meaning CTE's get expanded to be as if you wrote the same subquery at each place a CTE was, which frequently impacts the optimizer and performance.
I like the idea of CTE's, but I typically use temp tables instead of CTE's to avoid optimizer issues.
I use them on purpose to "help" the optimizer by reducing the search space for query plan ((knowing that query plan optimization is a combinatorial problem and the optimizer frequently can't evaluate enough plans in a reasonable amount of time).
Should we introduce a SUBSELECT keyword to distinguish between a top-level select and a subquery?
To me that feels as redundant as having WHERE vs HAVING, i.e. they do the same things, but at different points in the execution plan. It feels weird to need two separate keywords for that.
In order for a thing to be considered legacy, there needs to be a widespread successor available.
SQL might have been invented in the 70s but it's still going strong as no real alternative has been widely adopted so far - I'd wager that you will find SQL at most software companies today.
I mean kinda? It's legacy in the "we would never invent this as the solution to the problem domain that's today asked of it."
We would invent the underlying engines for sure but not the language on top of it. It doesn't map at all to how it's actually used by programmers. SQL is the JS to WebAssembly, being able to write the query plan directly via whatever language or mechanism you prefer would be goated.
It has to be my biggest pain point dealing with SQL, having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit! is unbelievably frustrating.
> having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit'
That's not in the domain of SQL. If you're not getting the most optimized query plan, there is something wrong with the DBMS engine or statistics -- SQL, the language, isn't supposed to care about those details.
That's my point, I think we've reached the point where SQL the langage can be more of a hindrance than help because in a lot of cases we're writing directly to the engine but with oven mitts on. If I could build the query from the tree with scan, filter, index scan, cond, merge join as my primitives it would be so nice.
That's the thing though, I still want my data to be relational so NoSQL databases don't fit the bill. I want to interact with a relational database via something other than the SQL language and given that this language already exists (Postgres compiles your SQL into an IR that uses these primitives) I don't think it's a crazy ask.
> It's legacy in the "we would never invent this as the solution to the problem domain that's today asked of it."
I don't think that definition of legacy is useful because so many things which hardly anyone calls "legacy" fit the definition - for example: Javascript as the web standard, cars in cities and bipartisan democracy.
I think many of us would say that that none of these is an ideal solution for the problem being solved, but it's what we are stuck with and I cannot think anyone could call it "legacy systems" until a viable successor is widespread.
Not bad, very similar to dplyr syntax. Personally i’m too used to classic SQL though and this would be more readable as CTEs. In particular how would this syntax fair if it was much more complicated with with 4-5 tables and joins?
Maybe not, just as we don't call "rank() OVER" SQL. We call it SQL:2003. Seems we're calling this GoogleSQL. But perhaps, in both cases, we can use SQL for short?
this is consistent, non-pseudo-english, reusable, and generic. The SQL standard largely defines the aesthetic of the language, and is in complete opposition to these qualities. I think would be fundamentally incorrect to call it SQL
Perhaps if they used a keyword PIPE and used a separate grammar definition for the expressions that follow the pipe, such that it is almost what you’d expect but randomly missing things or changes up some keywords
Honestly SQL screwed things up from the very beginning. "SELECT FROM" makes no sense at all. The projection being before the selection is dumb as hell. This is why we can’t get proper tooling for writing SQL, even autocompletion can’t work sanely. You write "SELECT", what’s it gonna autocomplete?
PRQL gives me hope that we might finally get something nice some day
Doesn’t change anything, you can still have the select at the end, and optional from and joins at the beginning. In your example, the select could be at the end, it’s just that there’s nothing before.
I also hate having SELECT before FROM because I want to think of the query as a transformation that can be read from top to bottom to understand the flow.
But I assume that that’s part of why they didn’t set it up that way — it’s just a little thing to make the query feel more declarative and less imperative
At this point I think that vanilla SQL should just support optionally putting the from before the select. It's useful for enabling autocompletion, among other things.
My initial reaction is that the pipes are redundant (syntactic vinegar). Syntactic order is sufficient.
The changes to my SQL grammar to accomodate this proposal are minor. Move the 'from' rule to the front. Add a star '*' around a new filters rule (eg zero-or-more, in any order), removing the misc dialect specific alts, simplifying my grammar a bit.