Hacker News new | past | comments | ask | show | jobs | submit login
SELECT wat FROM sql (scattered-thoughts.net)
150 points by luu on April 26, 2020 | hide | past | favorite | 118 comments



This one

    jamie=# ((select a from nums) union (select a from nums)) order by b;
    ERROR:  column "b" does not exist
This seems to point to a missunderstanding of how sql works. It's not a table query language, but a set query language. Tables are just a way to store sets while thay are not in use.

    (select a from nums)
This part creates a new set from the "a" column of nums. The column name is carried over as convinience but the nums table is not visable from the union.


Note the example above:

    (select a from nums) order by b;
     a
    ---
     3
     0
     1
     2
    (4 rows)
If things worked as you described, that would also fail.

So Postgres is, in some cases, cheating with the "new set" thing. The weirdness is that it does it inconsistently.


But that's the thing, it's not

    (select a from nums) order by b;
but

    (select a from nums order by b);


What I’ve noticed in my years with various databases is that many SQL parsers error on ambiguity and not necessarily strict compliance. In the case above the parser is likely dropping the parentheses where in the union there’s ambiguity because of the use of two sets and it can’t tell between which set’s b you’d like to order by.

I wonder what it would do if you aliased the subquery and ordered by alias_one.b


I think it's even simpler. Starting a subquey here doesn't make sense since there is no main query. So the parens gets dropped promoting this query to main. Same with the union


ORDER BY actually happens last.

    jamie=# (select a, a+1 as c from nums) order by b,c;
     a | c 
    ---+---
     3 | 4
     0 | 1
     1 | 2
     2 | 3
(4 rows)

It's a separate part of the grammar:

    <query primary> ::=
        <simple table>
      | <left paren> <query expression body> [ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ] <right paren>
But, if "QE is a <query expression body> that is a <query term> that is a <query primary> that is a <simpletable> that is a <query specification>" then the <order by clause> may select columns from the <table expression> even though those don't really exist any more by the time the <order by clause> runs. It's tricky to implement.


I agree, I should have written that it helps to think in sets rather than tables. SQL implementors do a lot of fantastic shit to get our queries to run in a reasonable time, and SQL is not one of those langages that are like "understand this one core principle, and you get it" like lisp, prolog, TCL, Smalltalk, Forth etc. Have a look at the query plan if your DBMS can output it, it's not always easy to guess what it will spit out. In the case of

    (select a from nums) order by b;
my best guess is that the parser helpfully strips out the parens as because the "order by" needs to be a part of a table expression to make sense. This expression is a syntax error in SQLite by the way.


The worst part about SQL is statement order.

IIRC most SQL engines do FROM → WHERE → GROUP BY → HAVING → ORDER BY → SELECT → LIMIT → UNION, which is why you can SELECT after ORDER BY in the previous example but not after union.

If I were to start SQL from scratch I would allow statements in an arbitrary order and execute them in whichever order I coded them.


But then i will not be SQL. SQL is built like this because all these operations are projections on a set. You can not really change their order and have meaningful results.

Some oeprators can be swapped, but this part of the optimization process and nothing you need to be aware of.


that hellscape doesn't exist and I am glad it doesn't. The order should just be considered a spec and accepted and worked around -- I think SQL has proven its staying power over the years.


  jamie=# create table users ("user" text, "password" text);
  CREATE TABLE
  jamie=# insert into users values ('bob', 'whatever');
  INSERT 0 1
  jamie=# select user, password from users;
   user  | password
  -------+----------
   jamie | whatever
  (1 row)

Is because the name /user/ is a pseudo-function keyword, and is an alias for /current_user/

  test=# select "user", "password" from users;
   user | password
  ------+----------
   bob  | whatever
  (1 row)
Newer versions of PostgreSQL make this more clear:

  test=# select user, password from users;
   current_user | password
  --------------+----------
   postgres     | whatever
  (1 row)
Here is a full list of keywords: https://www.postgresql.org/docs/current/sql-keywords-appendi...


And here's what the SQLite manual says about keywords:

The SQL standard specifies a large number of keywords which may not be used as the names of tables, indices, columns, databases, user-defined functions, collations, virtual table modules, or any other named object. The list of keywords is so long that few people can remember them all. For most SQL code, your safest bet is to never use any English language word as the name of a user-defined object.

https://www.sqlite.org/lang_keywords.html


This is the reason why ORMs and SQL generators typically quote all column names in statements.


A great one to pile on here is null. Null is quite the unique thing within SQL, this post from some years ago highlights a lot of the unintuitive behavior around it within Postgres - http://thoughts.davisjeff.com/2009/08/02/what-is-the-deal-wi...


Interesting. The SQL concept of null always made perfect sense to me.

Two people with the same middle name would have equality with respect to the string value of middleName. The same applies with two people known to have no middle name—where you store a zero length string.

Two people who didn’t supply a middle name should not have equality with respect to middleName. Hence null. This acknowledged lack of data cannot be equal to a different lack of data.


> The same applies with two people known to have no middle name—where you store a zero length string.

I think that's the thing. People use NULL to mean empty value, instead of unknown value.

Not necessarily for strings because there's the empty string, but for numeric fields, including foreign keys


Admittedly non-string fields do make me wish for two kinds of NULL so I can differentiate between zero, confirmed-nothing and unknown. But it's never been an issue for me in practice.


Except, Oracle treats zero length string as null.


That's just stupid


Seriously? That's the sort of thing I'd expect to find in MySQL 3-point-something, not Oracle.


Yes. It's horrible. But presumably it's a decades-old "feature" retained for backward compatibility…


  SELECT 'foo' WHERE bar = baz FROM quux
  UNION ALL
  SELECT 'bar' WHERE bar <> baz FROM quux
I would understand returning an error in that case if bar or baz were unknown. The truthiness of bar = baz is unknown, so that's an error.

But that's not what SQL Servers do. They're all pedantic about how NULL means unknown right up until the rubber hits the road of the WHERE clause, when suddenly NULL means false.

Which means that they're not saying that foo = bar is unknown, they're instead making a definitive assertion that both foo = bar is false and foo <> bar is false. And that's disastrously wrong.


Hard disagree. It should be impossible to make any truthful claim about a NULL with respect to any other value including NULL. Thus it is correct that both foo = bar is false and foo <> bar is false.

Were it any other way, joins wouldn’t work properly.


...except that NULL is still conflating the two concepts of "not known to exist" and "known to not exist" (I know some people who would argue for empty strings to be bad style).

Especially for (composite) unique indexes on NULLable foreign keys, where NULL signifies the non-existence of a different table entry (if you don't want to resort to a special "not existing" entry in that table), this behavior can be surprising and definitely got me a few times.


I disagree, because while I agree there are multiple conflated concepts within NULL (which can be rehabilitated in the schema if necessary) in practice it works correctly either way.

If two different products have the price of $0 and therefore free, their prices are equal.

If two different products both have prices which are "not known", their unknown prices are not equal.

If two different products both have prices which are "known to not exist" their undefined prices are also not equal.


The problem is that "Unknown" is so much less useful than None. Date of Death is "none" is much more of a useful concept when I'm building a database of people than Date of Death is "unknown". I have a lot more alive people in my database than we-don't-know-if-they're-dead people.

If SQL servers had algebraic types and I could roll my own "Maybe", then this bizarre decision would work. But they don't, so it doesn't.


Null can mean whatever you want it to mean. It can mean not dead yet if you want. And the logic of NULL works in that case too—two people who haven’t died yet don’t have equal dates of death.

The problem with having a value called “none” is that it would be tempting to make none = none return true, which is madness.

(I do agree it would be nice if we could have multiple kinds of null instead of just one. But I would want them all to have the same equality properties as NULL.)


Let's check out SQL's "acknowledged lack of data" in a case statement:

    create table test(a bool);
    insert into test(a) values (NULL);
    select case a when a then true else false end from test;
The answer should be NULL, right?


No, the answer must be false.

Since null = null is false, your case statement will return the else value. You hard coded it to be false, therefore it will return false.


Yes, I know what it will be, but it isn't really acknowledging the lack of data, is it? It turns NULL input data into a 100% false result.


You’re the one who wrote the logic; YOUR code is what has forcibly collapsed the null to false. It’s absurd for you to write code and then complain when it works exactly as written.

If you want the lack of data to be acknowledged in a case statement, it’s up to you to handle it. SQL isn’t a magical fairy that does what you want without asking.


People generally expect equality to produce either true or false. Not a tri-value pair.

That interpretation makes sense to me when I read the docs, but is way too easy to miss when I'm actually doing comparisons.


You should always think about NULL state for every field when designing your schema. If you don't have a clear sense of what the NULL state shall mean for a given field, the field should not allow null values.

For example: a shopping cart might contain a tee-shirt in blue colour and expedited shipping in NULL colour. That blue tee-shirt might be the same colour as some blue socks... but that NULL-coloured shipping isn't the same colour as a NULL-coloured charitable donation.


For this you can make any field non nullable and offer a default value for the field.

Counter example: SAP HANA doesn't support NULLs, and as a result if I see 0 in any column in the database I have to wonder was it 0 that user submitted or 0 that system added as a placeholder. I would need to go back to documentation and see if the field is mandatory etc. Better have it explicitly stored in the data itself.


Yes, it's perfectly sensible; the only problem is with the name used. If it was called `UNKNOWN` or something instead, nobody would be complaining.


Yeah the big kick in the pants I got this year was using nullable columns in composite keys in MySQL and finding out that when their value is null, the result key is considered unique, because NULL is literally unknowable. So:

  col A, Col B
  null, 1
  null, 1
Is not constrained, but:

  col A, Col B
  1, 1
  1, 1 <--- NOT ALLOWED
I guess this is one of those things that is completely obvious to those that know it.


It's fairly nice but indeed rather implicit. I would kill for a postgres setting where feeding (an evaluated) NULL to the where clause logs a warning. This would force developers to coalesce (at least on top level) and think about the null case when applicable.


Since we are in the context of Postgresql, there is 'IS (NOT) DISTINCT FROM' for those who believe NULL's behaviour is reversed in (in)equality comparisons.


Yes this. I always see people raving about the mathematical beauty of three-value Boolean algebra, but any programming language where testing if foo = foo can return something mechanically equivalent to false is just plain broken.

Yes, I get that it's pure and mathematically beautiful, but over here in the real world I've squished an endless supply of bugs caused by this absurdity, and in languages where there is no null-aware equality test, the composing the proper check is absurdly verbose.


Sigh,

In all other programming languages the code has to be along the lines of:

    if ((jane.car.colour == james.car.colour) and
        (jane.car.colour != null            ) and 
        (           null != james.car.colour)):
       print ("James and Jane have same car color")
or attempted with a second variable to indicate if value is known

    if (( jane.car.colourKnown == True ) and 
        (james.car.colourKnown == True ) and 
        ( jane.car.colour == james.car.colour)):
       print ("James and Jane have same car colour")

I like to think that SQL nulls are the most human way. I think of a scenario: James and Jane each own a car. I don't know neither car colour. Does it mean they are of same colour? All typical programming languages say "Yes". SQL says "I can't tell you, I don't know", and that is what a human would say.


Plenty of languages have types that let you ask those kind of operations in a nice way. E.g. in Haskell you'd do

    traverse (flip when (putStrLn "James and Jane have same car color")) $
        liftA2 == james.car.colour jane.car.colour
The problem isn't that SQL supports tri-state logic, it's that it gives you no way of opting out of it. Over 90% of the time a given value should never be null, but there's no way to put SQL into a mode where a null in a given intermediate stage is an error. Instead it will silently give you a strange answer with no indication that anything went wrong.


Then why on earth is NULL being treated the same as false for the purposes of a WHERE clause.

Simple point:

  SELECT 'foo' WHERE bar = baz FROM quux
  UNION ALL
  SELECT 'bar' WHERE bar <> baz FROM quux
To me, there are two acceptable outcomes of this situation:

1) The result is a single row containing "foo" or "bar".

2) The result is an error message that the truthiness of 'bar = baz' cannot be determined.

However, ansi-null-compatible-SQL just will give you an empty resultset if bar or baz are null. That's not the server saying "I don't know" to "bar = baz", that's the server saying "neither" which is very freaking different. "I don't know" is an acceptable answer. "neither" is a lie.

That is unacceptable, and I'm utterly bewildered that people can insist that this is okay.


> Then why on earth is NULL being treated the same as false for the purposes of a WHERE clause.

Because, by definition, the WHERE clause returns a record when it evaluates to TRUE. FALSE and UNKNOWN are not TRUE.

Second, you absolutely can do what you're trying to here, but not the way you've done it:

  SELECT CASE WHEN bar = baz THEN 'foo' ELSE 'bar' END AS corge
  FROM quux
You're asking the wrong question and wondering why you're getting the wrong answer.


It sounds like it is operating according to a closed-world assumption, in which this is reasonable behavior.


Your brain is going to explode like mine did when you learn about Option types[1].

[1] https://en.m.wikipedia.org/wiki/Option_type


This doesn't really solve GP's problem, as the Nothing type is generally still equal to itself. Certainly useful for general purpose programming languages, though.


The problem is really that you're using the wrong equality type here. You get the right semantics if you use `= : A -> A -> Bool` and then apply it to a `Maybe A` and a `Maybe A` by lifting it: you end up with `Nothing` if either input was `Nothing`, and `Just answer` if both were known.


For what it's worth, in your "all other programming languages" example you don't need the second null check

    null != james.car.colour
It's redundant when we know the first two are true.


> any programming language where testing if foo = foo can return something mechanically equivalent to false is just plain broken

Isn't this pretty much any language? You're describing how IEEE float NaNs work. I'm not aware of many languages that don't have IEEE-compliant floats.


IEEE floats with non-signalling NaNs make some very aggressive tradeoffs (in the name of performance) that were maybe a reasonable default in 1985 but are certainly not so today. General-purpose programming languages should not be offering them unless very explicitly opted in to.


Can you actually name a mainstream language that doesn't have NaNs with that behavior? Python, Ruby, Go, Rust, Java, JavaScript, etc all do.

Or maybe what you're saying is really just that they shouldn't have them, and every modern language has gotten this wrong?


> Can you actually name a mainstream language that doesn't have NaNs with that behavior? Python, Ruby, Go, Rust, Java, JavaScript, etc all do.

Rust doesn't have that behaviour: Rust floating-point types simply don't have Eq implementations (they have PartialEq only), passing them somewhere where you need an equality-comparable type is a compilation error. OCaml is the only other example I'm aware of (it treats nan as a normal value with a well-behaved place in the order).

> Or maybe what you're saying is really just that they shouldn't have them, and every modern language has gotten this wrong?

Yes. Similar to how modern high-level languages are starting to default to arbitrary-precision integers rather than fixed-size integers (e.g. Python), we should be defaulting to something like Python's decimal or Java's BigDecimal, and have IEEE floating-point as an opt-in case for when you need it.


> Rust floating-point types simply don't have Eq implementations (they have PartialEq only), passing them somewhere where you need an equality-comparable type is a compilation error.

While this is true, you can still say `NaN == NaN` and get false. I claim that that's still the behavior I was describing.

> OCaml is the only other example I'm aware of (it treats nan as a normal value with a well-behaved place in the order).

Again, kind of true - for physical and structural equality it does a bitwise-comparison. However, most people recommend against those equality operators because they are misleading in many cases, and the float-specific equality operator does what I described.

In what ways are bitwise-comparison wrong? For structured data (say, binary-tree backed maps) this is clearly bad because there are potentially multiple representations that we would consider equivalent. But even for floats it is bad - `0 == -0` should be true, but will not be. Further, two different NaNs will be compare as false even though two of the same NaN will compare as true (you're not supposed to be able to distinguish between different representations of NaN).


> While this is true, you can still say `NaN == NaN` and get false. I claim that that's still the behavior I was describing.

Point taken; I do think it would be better for the method on PartialEq to not be called "==". But Rust does offer a standard distinction between well-behaved equality and not, and floats are not considered to implement well-behaved equality, which comes close to what I'm asking for. Like, if JavaScript had `NaN === NaN` then I'd consider that morally a counterexample to what you're talking about, because === is (in a sense) the standard equality operator in JavaScript; in the same way, I'd consider Eq::== to be the standard equality operator in Rust.

> Again, kind of true - for physical and structural equality it does a bitwise-comparison. However, most people recommend against those equality operators because they are misleading in many cases, and the float-specific equality operator does what I described.

I think that's the right way to do it. Generic code using the generic == expects it to have the standard properties - in particular x == x for any x. Code that uses a float-specific operator can be prepared for float-specific behaviour.

> In what ways are bitwise-comparison wrong? For structured data (say, binary-tree backed maps) this is clearly bad because there are potentially multiple representations that we would consider equivalent. But even for floats it is bad - `0 == -0` should be true, but will not be. Further, two different NaNs will be compare as false even though two of the same NaN will compare as true (you're not supposed to be able to distinguish between different representations of NaN).

This is true as far as it goes. Equally, having 0 == -0 is bad (it breaks substitutability, since 1 / 0.0 != 1 / -0.0), and having x != x is super bad. Really there is just no good general-purpose comparison for IEEE 754 floats, and so they should either not implement generic == at all (in languages where that's possible), be avoided by default and hidden away in an "unsafe" area of the standard library, or both.


The fact that null != null (and NaN != NaN) is, to me, a helpful reminder that the mathematical model behind SQL is not as simple and elegant as newcomers would like it to be. It reminds me of the classic "proof" that 1 = 2, based on simple algebra:

https://www.math.toronto.edu/mathnet/falseProofs/first1eq2.h...

If even basic algebra has a trap like that, then all of us need to be prepared for similar traps.


There is not a trap in algebra that is shown in that proof: there is a mistake in that proof. They do something that is not allowed, even if it's not obvious what that was at first glance.

I don't think one can compare mistakes with thoughtful and powerful modelling.


By trap I mean a surprising exception to the common rules. Most high school graduates learn algebra, yet they rarely learn how to avoid accidental division by zero. It's a surprising exception.

"null != null" is also a surprising exception to the common rules. I honestly wouldn't expect most developers to know it until their code breaks because of it. That doesn't mean the surprising rule is wrong, it just means people have to learn it. Once they learn it, they may try to find the deeper model where the rule is no longer a surprise but rather an important feature.


> composing the proper check is absurdly verbose.

You're kidding, right? Instead of

  WHERE Field = 1
You have to specify

  WHERE Field = 1 OR Field IS NULL
You might even have to add parenthesis! It's so verbose it puts Java to shame!

Yes, I understand that real queries are much longer, but that's not the language's fault. That's the complexity of your data coming through.

NULL doesn't behave the way it does out of some need for mathematical purity. Understanding the model just tells you when you know you need to handle your NULLs. The system behaves the way it does because the system doesn't know what you want to do when a value doesn't exist or is unknown. Logically, there is no answer. That's what you're complaining about here. It's like asking the system to add two numbers and only giving it one of them.

What does a NULL datetime mean? The value wasn't recorded? It wasn't observed? It hasn't happened yet? That field doesn't make sense to record with the other data on the record (i.e., you denormalized your data)? That the value is actually known to be unknown? Or some combination of those? How is the system supposed to know?

When you're telling the system what to do with NULL values, you're telling the system what NULL means for that query. That's all you're doing. You're just telling the system what your data actually means in that query. That's not hard. Complaining that the system that stores structured data actually requires you to understand what your data represent and how your data are being used is... strange.


> You might even have to add parenthesis! It's so verbose it puts Java to shame!

It's a serious problem, because in the real world programmers are not willing to double the length of their code for the sake of catching bugs that happen maybe 1 in 100 times. If the language doesn't make it easier to do the right thing than not, you'll end up with bugs. http://www.haskellforall.com/2016/04/worst-practices-should-...

> NULL doesn't behave the way it does out of some need for mathematical purity. Understanding the model just tells you when you know you need to handle your NULLs. The system behaves the way it does because the system doesn't know what you want to do when a value doesn't exist or is unknown. Logically, there is no answer. That's what you're complaining about here. It's like asking the system to add two numbers and only giving it one of them.

The problem is that there's no way to have a model that doesn't include that. At least 80% of the time, probably more, having no value for a given field is just an error, there is no correct way to handle it, so you want to mark it as non-nullable and get on with your life. But in SQL there's no type system that can distinguish nullable things from non-nullable, so you have to either treat every single column and intermediate value as nullable, or make error-prone manual checks, or just get weird results every so often.


> It's a serious problem, because in the real world programmers are not willing to double the length of their code for the sake of catching bugs that happen maybe 1 in 100 times.

No, the problem is that programmers don't like to understand the problem before writing the solution. The problem is worshiping code brevity instead of understanding what their code does. No, that's not a new problem, but it doesn't mean that it becomes the language's problem that you have to actually understand what you're telling something to do. Platitudes like "worst practices should be hard" are all well and good, but they're just that. It should be really difficult to include a buffer overflow, or a dereference a null pointer. In practice, languages that do that have a cost overhead. You still had to write that extra code, the compiler or library just did it for you.

I get it. Developers have a lot to know these days, and stacks and frameworks change rapidly. But the relational model has been largely unchanged for 50 years, and SQL has been basically the same for 30 years. The relational model is very old, and it hasn't changed for very good reasons. There's like 5 to 10 basic patterns for queries that you have to know, and that will cover you in 95% of cases if your system is well designed and you actually understand the data.

Either you understand what your application is doing and you understand what your query is doing, or you don't understand either of those things. I'm sorry, there isn't a middle ground, and there never will be regardless of what data store you choose, what language you use, etc. The guy at the keyboard still has to know what he's doing.

> At least 80% of the time, probably more, having no value for a given field is just an error

Then set the column to have a NOT NULL constraint and now it's impossible to store a null value. You'll get an error just like trying to store a character value into an integer field.

> But in SQL there's no type system that can distinguish nullable things from non-nullable

That's not true. Every column in a table can have a NOT NULL constraint. If your data is actually required and guaranteed to have a value, you just specify that the table's column can't be NULL. You will have to literally break the database engine to get a null value stored in that column.

Yes, you might have situations where you need to use an outer join which will have null values. Yes, you will have to handle NULLs from an Nth normal form database in your application in this case. It's still not difficult. You're still going to have non-nullable fields in your output unless your design is incredibly poor.


> The problem is worshiping code brevity instead of understanding what their code does. No, that's not a new problem, but it doesn't mean that it becomes the language's problem that you have to actually understand what you're telling something to do.

Good language design means accepting that programmers are human and writing for use by programmers as they actually exist, not for some imaginary perfect programmer.

> It should be really difficult to include a buffer overflow, or a dereference a null pointer. In practice, languages that do that have a cost overhead. You still had to write that extra code, the compiler or library just did it for you.

This is a dangerous myth. Most of the problems with C - and SQL - are just deadweight losses. You actually can just use languages that don't have those problems.

> Either you understand what your application is doing and you understand what your query is doing, or you don't understand either of those things. I'm sorry, there isn't a middle ground, and there never will be regardless of what data store you choose, what language you use, etc. The guy at the keyboard still has to know what he's doing.

Languages are tools for understanding. There are good and bad languages, and it's easier to make better languages than make smarter programmers.

> Then set the column to have a NOT NULL constraint and now it's impossible to store a null value. You'll get an error just like trying to store a character value into an integer field.

Right, but there's no way to distinguish that at the point of querying, and no practical way to have NOT NULL constraints on your intermediate variables in the middle of a query.


Do that for two arbitrary variables.

Test that

    t1.f1 = t2.f2
including nulls, for an arbitrary type of f1 and f2.

This is actually a hellscape of gotchas as people try to find a short way to do it and then discover that their short version is treating NULL as equivalent to FALSE, which means it will fail when negated. Or they throw in a COALESCE and then fail because they've now treated a special value as equivalent to NULL.

The correct answer is

    ((t1.f1 IS NULL AND t2.f2 IS NULL) OR (t1.f1 IS NOT NULL AND t2.f2 IS NOT NULL AND t1.f1 = t2.f2))


First of all, you can shorten that to this:

  ((t1.f1 IS NULL AND t2.f2 IS NULL) OR (t1.f1 = t2.f2))
That's logically equivalent because TRUE OR UNKNOWN evaluates to TRUE, while FALSE OR UNKNOWN evaluates to UNKNOWN, which is not TRUE and therefore gets excluded.

Second, the complicated one that shows up in MERGE or UPSERT operations is the other way:

  ((t1.f1 IS NULL AND t2.f2 IS NOT NULL) OR (t1.f1 IS NOT NULL AND t2.f2 IS NULL) OR (t1.f1 <> t2.f2))
Third, this is exactly why many RDBMSs support the IS [NOT] DISTINCT FROM syntax defined in ANSI SQL, which replaces the first:

    t1.f1 IS NOT DISTINCT FROM t2.f2
Or this which replaces the second:

    t1.f1 IS DISTINCT FROM t2.f2
MS SQL Server and MySQL don't, but Oracle and PostgreSQL do. It's been in the standard for over 20 years at this point. So it's not the language that sucks; it's your vendor.


And then somebody negates your "is equal" because they want "not equal" statement and tears their hair out.


Wouldn't one just use coalesce()? e.g. coalesce(Field, 1) = 1?


No, that's a bad pattern that should be avoided.

First of all, what happens if the value of Field is actually 1? Did you want that record affected? Because now it will be. It's not exactly clear from your query that you did, is it? This is explicit:

  WHERE (Field = 1 OR Field IS NULL)
And it's not actually that much shorter:

  WHERE COALESCE(Field, 1) = 1
Second of all, if Field is indexed, well, you probably just ignored that index or turned a seek into a scan (i.e., a slower operation). Putting a field into a function will often mean the query engine can't use the index to help with query performance. It means the query engine will have to do a whole lot more work. The term in the industry is "sargable": https://en.wikipedia.org/wiki/Sargable


That’s all great but it’s not like we haven’t figured out how to solve this in other languages with `Option` or `Maybe` types. Obviously SQL predates this, but that shouldn’t stop us from lamenting the fact that SQL repeated the Billion Dollar Mistake and discussing a future alternative that’s learned from this and doesn’t feature such a sharp edge.


Sum types like Option/Maybe (or C's NULL) don't do the same thing as SQL's NULL, though. By all means, if you don't need NULLs, then don't use them — `NOT NULL` should probably have been the default in table statements — but they serve a purpose.


Super interesting!

Although it sure would be nice to have explanations for amateurs such as me.


I highly recommend puzzling through some of these on your own first (it's a great way to build intuition for SQL's craziness), but here's a few explanations if you're stuck.

    jamie=# select a+1 from nums group by a+1;
     ?column?
    ----------
            2
            4
            1
            3
    (4 rows)
    
    jamie=# select a+1 from nums group by 1+a;
    ERROR:  column "nums.a" must appear in the GROUP BY clause or be used in an aggregate function
    LINE 1: select a+1 from nums group by 1+a;
In general, a query like `SELECT a FROM t GROUP BY a` is valid, while a query `SELECT b FROM t GROUP BY a` is not valid. Because you didn't group by column b, there are multiple values of b within each group, and it's not clear how to combine them. What you usually meant is something like

    SELECT a, sum(b) FROM t GROUP BY a
where you can see the sum of all b's for each distinct value of a.

The error message explains it really nicely: if there is a GROUP BY clause, every column in a SELECT list must either be used in the GROUP BY clause or appear in an aggregate function. This is kind of annoying, though; what if you want to group by a value derived from a column, like `a + 1`? (More realistically, maybe you want to trim whitespace or something.) I think the SQL standard would have you write

    SELECT a1 FROM (SELECT a + 1 AS a1 FROM t) _ GROUP BY a1
but that's annoying. PostgreSQL (and maybe other SQL implementations; not sure!) has some smarts about detecting when an expression is used verbatim in a SQL query, so you can just write

    SELECT a+1 FROM t GROUP BY a+1
and it will notice that the "a+1" in the SELECT list is literally the same as the "a+1" in the GROUP BY clause. But it's not that smart, because it doesn'that realize that "a+1" and "1+a" are semantically equivalent, and so this is an error:

    SELECT a+1 FROM t GROUP BY 1+a
----

    jamie=# select b from nums group by a;
     b
    ---
    (0 rows)
    
    jamie=# select a from nums group by b;
    ERROR:  column "nums.a" must appear in the GROUP BY clause or be used in an aggregate function
    LINE 1: select a from nums group by b;
An extension of the previous example, where PostgreSQL can infer that when grouping by a all b's are unique. It can't infer the converse. The reason is that a is the primary key and so guaranteed to be unique, and therefore `GROUP BY a` is meaningless and can be elided.

(This one really confused me for a while. I only learned this recently.)

----

    jamie=# select a, b from nums order by 2;
    a |  b
    ---+-----
     3 | bar
     0 | foo
     1 | foo
     2 | foo
    (4 rows)
    
    jamie=# select a, b from nums order by 1+1;
     a |  b
    ---+-----
     0 | foo
     1 | foo
     2 | foo
     3 | bar
    (4 rows)
`ORDER BY <n>` means "order by the Nth column", but `ORDER BY <expr>` means "order by the value of <expr>`. So you get the goofy behavior that `ORDER BY 2` in the above query means order by column b, but `ORDER BY 1 + 1` means order by the literal value "2", i.e., don't order by anything useful at all. (It just so happens that you get primary key order in the example above, but that's not guaranteed by any means.)

----

Most of the others have a similar feel. Let me know if you're particularly puzzled by any others.


> PostgreSQL (and maybe other SQL implementations; not sure!) has some smarts about detecting when an expression is used verbatim in a SQL query

I think this one is in the SQL 2016 spec. There's definitely a section about how to put the AST into a normal form to compare expressions for syntactic equality.

EDIT found it - https://news.ycombinator.com/item?id=22991487


I’m not sure the sub query would be necessary if the weird ast-based naming didn’t exist. Couldn’t you just write:

  SELECT a+1 AS a1 FROM t GROUP BY a1

?


You can do that in PostgreSQL, but not according to the spec, apparently, and the PostgreSQL developers regret the feature. [0] (As someone who's spent a lot of time trying to simulate PostgreSQL's name resolution rules... it's reassuring to know that the rules are painful inside of PostgreSQL too.)

[0]: https://www.postgresql.org/message-id/7608.1259177709%40sss....


I never figured out why AS isn't allowed in the GROUP BY clause. It would be much simpler to implement `SELECT a1 FROM nums GROUP BY a+1 AS a1`.


I was confused by "select b from nums group by a;" indeed.

I am a little bit surprised by "select array[null] = array[null];" too.

The rest did seem like 'undefined behaviour' (or well-defined behaviour involving NULLs, literal-matching in the query planner [used in matching partial indexes too]...).


> select b from nums group by a;

MySQL allows selecting columns that don't appear in GROUP BY clause. It is not good and allows you to shoot yourself in the foot. Gives you false sense of correctness, because, well, query works.

It just takes _one_ arbitrary value you happen to have in b and selects it.

As I'm mostly working with MSSQL and sometimes have to shoot queries at MySQL - is there any mode / strictness I can set to error out on these kind of queries rather than silently continue?


I think I’ve been drinking the SQL kool aid for too long, only a couple at the end surprised me. The rest seem perfectly logical.

Was a lot of fun to read through them though.


So I looked at these, I didn't notice any examples where the problem was with ANSI SQL? Please let me know if there were.

It seems like the problems were with PostgreSQL flavored queries.

Speaking of which - I know Microsoft has Transact-SQL and Oracle has PL/SQL, is PostgreSQL just PostgreSQL?


I think a lot of these are in the ANSI 2016 spec. I haven't looked at the older specs, maybe they were simpler.

Eg here is the `a+1` vs `1+a` thing explained for ORDER BY - https://i.imgur.com/bklpjkP.png. Here are the conditions under which ORDER BY can reach inside SELECT to pick columns from the FROM clause - https://i.imgur.com/aoeqDGn.png.

Chapter 2 of the spec is 1732 pages of that kind of language, so it's hard to be sure that any particular behaviour is not in there somewhere. But I'm fairly sure that all the examples with set-valued functions are postgres-isms.


Wow, I never would have guessed that ORDER BY stuff was part of the standard. Seems like one of the special corners of SQL.

From that brief read of the ANSI Spec it seems like one would need a great deal of time and some advanced mathematics to understand it.


There isn't really any math involved. It's just a lot of edge cases written out in lots of detail.

I think a lot of what makes SQL hard to implement comes down to consistently prioritizing local convenience over global complexity, combined with not being able to break backwards compatibility to simplify the design.


When I'm writing procedural code for PostgreSQL, I'm usually writing PL/pgSQL... which is very, very similar to Oracle PL/SQL (having written more than my share of PL/SQL); My recollection is the similarity is intentional.

However, there are other PL's for PostgreSQL. PL/Python, PL/Perl, and PL/Tcl are also distributed with the database and there are others available (https://www.postgresql.org/docs/12/external-pl.html... though I've only used JavaScript)


PL/SQL isn't Oracle’s SQL, it's Oracle's SQL-based procedural language for triggers, stored procs, etc.; the postgres equivalent is PL/pgSQL.

In both Oracle and Postgres, the SQL dialect, as opposed to the procedural language, is just called “SQL”.


What are the differences between the SQL dialect implementations called then? Extensions?


They are called either pieces of the standard one or the other hasn't implemented or extensions, depending on which they are, sure. This is pretty explicit in the Postgres docs which directly address standards compliance throughout the SQL reference and specifically address bits of the standard they don't support or Postgres extensions to the standard. Oracle, IIRC, doesn't talk about it as directly in the main docs (though it's been longer since I read Oracle docs.)


Many of the cases seemed to be focused on arrays, which in my non-exhaustive experience you'll see more of in postgres than in other platforms. Same for lateral join, hstore, jsonb...


PL/SQL and T-SQL are the names of the stored procedure languages - postgres' on of those is PL/PGSQL.

All three have variations from ANSI SQL in their non-sproc SQL too, oracle's far worse than the other two, but pg and SQL Server document where they deviate pretty damn well.

There definitely were ANSI examples in there, "ORDER BY 2" is in ANSI (and IIRC plenty of pre-ANSI dialects as well) for a start, I'm too tired (and/or lazy) to go back and look for others, sorry.


Most of these are “play stupid games, win stupid prices”, not things you’d encounter in normal usage.

Queries like `select a, b from nums order by 2` are nonsense. And sure, you can create tables with weird names if you want, but that’s not really a problem.


Some of those queries are nonsense, but `select a, b from nums order by 2` is not one of them. `order by 2` here means "order by the second field in the select statement."


One note: While using field number for anything is perfecrly valid SQL, this will not pass a code review if I'm the one doing the review. Reason being, it's not clear without looking at the field list what is being ordered. For extra shame do something like:

    select * from foo order by 2
Order of the fields when doing "select *" may be guaranteed (I don't know) but they have been dependent on when the feild was added to the database on every system I have worked with, and sometimes patches are applied in diffrent orders on diffrent databases.


If you have a Python script create your tables

    AND 1=1 
    AND this script has the table information stored in dicts in the {column name: column type} form
    AND this script is run by a python version where the dict order is undefined
you will get funny outputs from your queries that break production deployments in subtle and unique ways.

So yeah, just use column names (and name your computed columns!).


Yes, name your computed columns. Makes your life easier.


No, it is using the constant expression 2 as a sort key, resulting in a random order which should be expected

Edit -- this is wrong see below


That is not correct. The parent has it right. The output is in fact sorted by the second field; it is not random.

> Each expression can be the name or ordinal number of an output column (SELECT list item), or it can be an arbitrary expression formed from input-column values.

> The ordinal number refers to the ordinal (left-to-right) position of the output column. This feature makes it possible to define an ordering on the basis of a column that does not have a unique name. This is never absolutely necessary because it is always possible to assign a name to an output column using the AS clause.

(emphasis mine)

https://www.postgresql.org/docs/current/sql-select.html#SQL-...


Some SQL Servers, like MS-SQL have been slowly going into banning order by a number / constant expression, and flagging column numbers as error prone constructs.

https://www.sql-server-performance.com/error408/


Interesting -- didn't know it did this


One way I've seen it used is when you need to order by all or nearly all of the columns in a select statement. Sometimes

select a.1, a.2, a.3, ... a.10 GROUP BY 1,2,3,4,5,6,7,8,9,10

conveys the intent more clearly than listing out all the column names


It is also very helpful to avoid re-typing expressions.

select a || 'baz', avg(b) from foo group by 1 order by 2;


could '2' be a column name?

... meaning 'order by 2' would order it by the values in the column named '2', but which wasn't selected.


In PostgreSQL, the grammar does not permit unquoted identifiers to start with a digit. So this is a syntax error:

    SELECT 'val' AS 2
Instead you must escape the identifier with double quotes:

    SELECT 'val' AS "2"
So there is no ambiguity on that front in the ORDER BY clause. Either you have `ORDER BY 2`, which means order by the second column, or you have `ORDER BY "2"`, which means order by the column named "2".


It's not nonsense, but it is a bit of a footgun if you put it into production.

Order by 2 is the kind of thing I use all the time in investigations and POCs, but if you put it into prod, you have to account for someone else updating the select and getting things out of order.


Right but if you’re trying to be Postgres compatible then you have to care and if you are making a compiler off some sort which outputs sql then you have to care a lot because compiler often generates things that are not natural


Exactly. We're often on both ends of this - some BI tool will generate a SQL expression that a human would never have typed, and we have to handle it the same way that postgres would.


All of them seem valid errors. I’m don’t get the idea of the post lol.


The author of this seems to be surprised that the DB doesn't just sort things (which could be billions of rows) automatically. SQL deals with "sets", which are unordered, and the user needs to specify the order by if they need sorted output. I think this is an excellent design.


You're not looking at the examples closely enough, if that's your takeaway. I assure you they are much weirder than forgetting to specify an ORDER BY clause and expecting sorted output.


> SQL deals with "sets"

SQL deals with tables, and all operations are working on tables. However... these tables are bags of rows, not sets, as evidenced by:

1. "distinct" transforms bags into sets

2. "union all" preserves bags

3. the ability to create a table with no primary key, allowing entirely duplicate rows in a table

4. the ability to select columns not including a primary key, allowing duplicate rows in the resulting table

> ... which are unordered

While (base) tables themselves guarantee no order:

1. every "select" is itself returning a table, which can absolutely have an order.

2. a "view" is a (derived) table, and it can guarantee an order

3. aggregate functions rely on ordering

> I think this is an excellent design.

Your approach would be an excellent design.


    SQL deals with "sets", which are unordered, 
    and the user needs to specify the order by if 
    they need sorted output. I think this is an 
    excellent design.
Well-said, and I completely agree. However, I'm not sure that the author is complaining about SQL, so much as he is kvetching about some of the weird-looking things one needs to deal with while "working on a postgres-compatible query compiler."


I'm not sure thats a point the author is trying to make. To which example are you refering?


Choosing an ineffective tool, or not knowing being familiar with its effective use says more about the crafter than the tool.


I don’t understand what point you’re trying to make.


The examples in the article are given as things the author learned while "working on a postgres-compatible query compiler."

When writing any tool, particularly something as complex as a compiler, there will of course be many of these weird edge cases. They are not presented as flaws in SQL or anything else.


SQL is indeed pretty crufty. After all, it dates back to the 70's, and is generally regarded as an 'entry level' language for analysts.

It's nowhere near as crazy as Javascript (and, arguably, Ruby):

https://www.destroyallsoftware.com/talks/wat


I don't think SQL is particularly crufty, but I'm willing to admit that's highly debatable -- depends on your definition of "crufty," really.

However, the examples in the linked article are not examples of cruft.

I'm using this example of "cruft":

    Cruft is a jargon word for anything that 
    is left over, redundant and getting in the 
    way. It is used particularly for defective, 
    superseded, useless, superfluous, or dysfunctional 
    elements in computer software. 
Nearly all of the examples in the linked article are based on core SQL sorting and grouping concepts (with a sprinkle of trivalue NULL logic) that have been around since Day One.

I also don't think they really illustrate flaws/gotcha/etc in SQL itself. Nor were they meant to, I don't think... these highly pathological examples are examples of the sorts of pathologically ambiguous things a compiler needs to handle, but are not representative of anything I've seen in 20+ years of real world SQL experience.


There are parts of SQL that are crufty by that definition. For example, COMMIT; and COMMIT WORK; are two different statements that do the exact same thing according to the SQL standard.

Perhaps trivial cruft but cruft nonetheless.


I personally wouldn't say that example fits the "getting in the way" part of the definition.

As a user, I've been using SQL for many years and it's never been in the way. It's really just a syntactic alias, so I can't imagine it really impedes the work of the folks maintaining RDBMS's under the hood.

But, I wouldn't argue either. Like you say... I suppose that, however tiny, the burden is non-zero! =)


Huh? SQL isn’t meant to be a general purpose programming language like JS/Ruby, it’s a DSL. Many have tried to create “better” DSLs for interfacing with DBs, and they’ve basically all failed, because SQL is just too god damn useful and well designed. Also, it’s every bit as useful for application developers as it is for analysts, the statement that SQL “generally regarded as an ‘entry level’ language for analysts” couldn’t be more wrong.


Comparing SQL to JavaScript or Ruby is like comparing apples to oranges. They have completely different purposes and applications. Rather, it would make more sense to compare SQL to MySQL (or any other derivative). It's like comparing BASIC to QBASIC.

I would argue that SQL itself isn't very crufty. Rather, it's fairly simple and well thought out since it's based on many years of thought and consideration from the creators of the relational database model. E.F. Codd, for one, wrote at length about how these models can work for years before SQL appeared, refining the concept for maybe a decade afterwards. His goal was to make a system that made it easy for developers to store and request information with as little redundancy as needed.

Vendors and developers of database systems have implemented various functions and features that expand on or alter SQL, but those are reflections more on those particular systems rather than on SQL. For example, MySQL's GROUP BY does not verify if the extra columns in the selectlist are dependent on the group by columns, which is against SQL's definition. It wouldn't make sense to count that as an issue with SQL itself.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: