I’m a huge proponent of sql, for me it’s virtually impossible to beat as a query...

davidgould · on Oct 5, 2018

HAVING is not the same thing as WHERE. WHERE filters the query to produce a result set. HAVING filters the result set.

    select id, count(*) - 1 as nduplicates
       from parts where color = 'red'
       group by id
       having count(*) > 1;

WHERE cannot filter on the count(*) because WHERE is applied during the query before the count is known.

I agree SQL is kind of warty, and it would be great if it were easier to compose from programs, but HAVING is not redundant to WHERE.

irishsultan · on Oct 5, 2018

But that's equivalent to the following query which doesn't use having, only select

    select * from (select id, count(*) - 1 as nduplicates
       from parts where color = 'red'
       group by id) _inner
       where nduplicates > 0;

Now obviously that doesn't read as easily, but WHERE can filter out the count(*) perfectly.

aidos · on Oct 5, 2018

I believe this should have been in response to its sibling. The set theorist in me agrees - the two operations are working on 2 different sets at 2 different levels of abstraction.

davidgould · on Oct 7, 2018

Yes, replied to the off by one post.

barrkel · on Oct 4, 2018

I find it very easy to imagine a better SQL.

Two things I'd fix:

* make it explicit when a join is expected to have 1:1 semantics vs 1:n, because it's very easy to end up with duplicated rows otherwise

* increase composability of the syntax; why do we need HAVING, why isn't WHERE enough? We have relational algebra, projections, filters, joins, folds, union, intersect etc. But the syntax of SQL is so idosyncratic, and at times antagonistic to the stream and set-like intuitions behind relational algebra.

aidos · on Oct 5, 2018

Regarding your first point, I’m trying to imagine what that would look like. So you mean something you could put on there to say (join x [1:1] on blah)? In general the mistake that’s easiest to make is to get duplicate rows from another join later, which this wouldn’t help with - but maybe I’m misunderstanding.

barrkel · on Oct 5, 2018

I don't really understand your comment about it always being join x+n that causes the problem instead of join x. What is different about later joins that is not true when the later join is the current join, when you get around to writing it? If joins for tables T1::T2 and T2::T3 are 1:1, they don't magically become 1:n when you do T1::T2::T3.

To do it right, you'd need to mark up the relations with expected cardinality (effectively unique constraints), so that the DB would be able to verify ahead of time whether a join is going to be 1:1 or 1:n. That would be a better solution than a join working fine up until the cardinality expectation is violated and it suddenly stops working.

If we had to stick with SQL syntax, it might be something like:

   select a.*, b.*
   from a
   join one b on b.id = a.b_id

(Unique constraint from primary key on b.id. It may still filter down the set of rows in a, but it won't duplicate.)

Or:

   select a.*, b.*
   from a
   join many b on b.some_key = a.some_key

(No unique constraint on b.some_key)

I think the interesting cardinality distinctions are 1:1, 1:n and 1:0.

(The above syntax is ambiguous with aliases, so it wouldn't fly as is. But it gives a flavour.)

aidos · on Oct 5, 2018

So the idea is that when you write the queries with these explicit join requirements, you'll want the query to fail if someone changes the schema later in such a way that the requirement is no longer met? If so, it seems like something that comes up so infrequently as to not warrant the noise, but maybe I'm still misunderstanding.

What I was saying was that often people have T1::T2 (1-1) and then throw in a later join T2::T3 (1-many) without considering the implications for the T1 (maybe they were doing a count(T1.id)). At least that seems to be an aspect that confuses developers newer to sql (that the many join is duplicating other parts of the data).

barrkel · on Oct 6, 2018

It comes up quite a bit if you have repeated data in batches where the batches are a first class concept. Any relations need to be qualified by both normal fk id and batch id, or else you bring back across all batches.

Think something like measurements samples that come in an envelope with metadata, or account updates that come in a statement, that kind of thing.