Hacker News new | past | comments | ask | show | jobs | submit login
Let's talk about joins (cghlewis.com)
169 points by r4um on Jan 20, 2024 | hide | past | favorite | 76 comments



> “Here we typically expect that the combined dataset will have the same number of rows as our original left side dataset.”

For left join this isn’t entirely true. If there are more matching cases on the right joined table then you’ll get additional rows for each match. That is unless you take steps to ensure only at most one row is matched per row on the left (eg using something like DISTINCT ON in Postgres)


This is my pet peeve. Top google search results and ChatGPT suggest this "same number of rows" mental model which is incomplete and breaks.

I wrote about this: https://minimalmodeling.substack.com/p/many-explanations-of-...


The reason why this misunderstanding is so pervasive is probably because joins are effectively function application (or function composition, bit of the same thing really).

I also wrote about this ;-) https://pragmathics.nl/2023/10/24/putting-the-relational-bac...


Yes I picked that up and was tempted to comment. But then half way down it addresses this with the section “Many relationships”:

> Until now we have discussed scenarios that are considered one-to-one merges. In these cases, we only expect one participant in a dataset to be joined to one instance of that same participant in the other dataset.

> However, there are scenarios where this will not be the case…


Got it, in my opinion the structure is likely to mislead people because it doesn’t make clear that it refers to ‘one-to-one’ relationships at the outset (in fact not even, it deals with the cases of one-to-one and one-to-none) - it only refers to ‘left join’.


Yeap, the whole article is confusing to us who are already very familiar with eg SQL joins and things. It’s not using mainstream terminology. But I guess we are not the audience.


I agree that this article is introductory, but it makes the many non-standard and anti-standard terms (e.g. "vertical joins") less forgivable: misleading learners is worse than irritating experts.


The hidden requirement is that you need to join on a key, in an ideal world this would be the primary key of the right table and a foreign key in the left table.

Of course this requirement isn't ever enforced because the real world isn't kind enough to give strictly modelled data. It would simplify the query language a lot though.


ClickHouse has support for this as an "ANY" JOIN modifier: https://clickhouse.com/docs/en/sql-reference/statements/sele...


If I was going to add anything -

"Vertical join" is normally called union.

"Right join" is just (bad) syntax sugar and should be avoided. Left join with the tables reversed is the usual convention.

The join condition for inner is optional- if it's always true then you get a "cross join". Can be useful to show all the possible combinations of two fields.


Of the tens of thousands of queries I've written I've needed right join the exactly once. It's a feature which is neat in that it exists, but the prevalence in teaching materials is entirely unjustified. Cross joins are massively more practical and enable some efficient transformations, but are usually taught only as all to all without a clear position on why they are useful.


> , but the prevalence in teaching materials is entirely unjustified.

I agree that it is seldom good query-writing practice, but I think it makes sense in education because it rounds out the join variations to name them symmetrical, and thus easier to remember.


How would you rewrite a left join followed by a right join? I don't think right joins are always sugar.


Make the "middle" table first and do to left joins from that one to the other two


Consider a LEFT JOIN b RIGHT JOIN c. You'll get all the records from c.

Now consider b LEFT JOIN a LEFT JOIN c. The result is not the same at all.


c LEFT JOIN (a LEFT JOIN b) is the same.


I didn't know this was syntactically valid, but maybe it is.


It most definitely is. A join is a binary operator and hence has only two operands. Joining three tables requires an associativity rule, and joins are left associative. The ability to omit braces could arguably be considered syntactic sugar as well. In other words

  a LEFT JOIN b RIGHT JOIN c
is equivalent to

  (a LEFT JOIN B) RIGHT JOIN c.
Once you have that, you can flip the RIGHT JOIN by swapping the operands, giving you

  c LEFT JOIN (a LEFT JOIN b).


SQL grammar could easily specify that braces are just not allowed. In fact, I thought that was the case. Now that I know they're allowed, now I agree that right joins have no reason to exist.


A CTE would work too.


left join requires an `on` clause, which can feel out of place when doing it like this, but makes sense when you consider the parenthesized join to be a relation itself

  c left join (  
    a left join b on a.a_id = b.a_id  
  ) on c.a_id = a.a_id
The parenthesis syntax is also super useful when you want to left join to a group of tables that all need to inner join together:

  a left join (
    b inner join c 
     on b.b_id = c.b_id
    inner join d
      on c.c_id = d.c_id
    inner join e
      on d.e_id = e.e_id
  )
  on a.b_id = b.b_id
that way, the b/c/d relation is all or nothing instead of the possibility of get values from b but not c and d.


All the other non-left joins are just syntactic sugar and can be expressed using only left join…?


> All the other non-left joins are just syntactic sugar and can be expressed using only left join…?

A “vertical join” (SQL UNION) is one of the “non-left joins”. How can you transform a “vertical join” to a “left join”?


This feels like an interesting interview question. I think you could simulate this with a full outer join on the entire select list and coalesce? For a UNION ALL you could put some literal column in the selects from both tables that you set to different values and include that in the join so you'd get a result set that will have all nulls in the right table columns for the rows in the left table and vice versa. Something like

  WITH top_t AS (
   SELECT
     a
    ,b
    ,c
    ,'top' as nonexistent_col
   FROM table_1
  ), bottom_t AS (
   SELECT
     a
    ,b
    ,c
    ,'bottom' as nonexistent_col
   FROM table_2
  )
  SELECT
    COALESCE(top_t.a, bottom_t.a) AS a
   ,COALESCE(top_t.b, bottom_t.b) AS b
   ,COALESCE(top_t.c, bottom_t.c) AS c
  FROM top_t FULL OUTER JOIN bottom_t 
    ON top_t.a = bottom_t.a
    AND top_t.b = bottom_t.b
    AND top_t.c = bottom_t.c
    AND top_t.nonexistent_col = bottom_t.nonexistent_col -- remove this for a normal UNION


This is a nice trick using full outer join! My question to wood_spirit still stands, though.


A fun challenge :)

Assuming a null we can use as sentinel:

      SELECT COALESCE(a.col, b.col) AS col
      FROM a
      LEFT JOIN b ON (TRUE)
      WHERE (a.col IS NULL) IS DISTINCT FROM (b.col IS NULL)

(Getting that sentinel might take effort, depending on eg whether there are useful window functions. Here is a way to do it with only left joins and the assumption the column has no duplicate values:

  WITH crossed AS (
     SELECT * FROM UNNEST([1, 2]) AS sentinal
  )
  SELECT IF(sentinal = 1, col, NULL) AS col
  FROM a
  LEFT JOIN crossed ON TRUE
  WHERE sentinal = 1 OR col = (SELECT * FROM a LIMIT 1)


Bravo!!


Excellent learning material, thanks for sharing. I've noticed an interesting trend: JOINS are crucial for data analytics, yet many new open-source data analytics products, or open-source OLAP products, offer limited support for JOINS. Examples include ClickHouse, Apache Druid, and Apache Pinot. It seems that currently, only Trino and StarRocks provide robust support for JOINS. Commercial products tend to have better support in this area. I hope the various open-source projects will also enhance their JOIN capabilities.


If you want arbitrarily powerful adhoc query support, you need to wait for data to land in an offline warehouse or lake environment where you have access to e.g. Presto/Trino and Spark. If you want a near-real-time view then you’re going to need to design the data layout around your query pattern - do a streaming join or other enrichment prior to OLAP ingestion.


Yep. There are always going to be constraints about how well a system like clickhouse can support arbitrary joins. Queries in clickhouse are fast because the data is laid out in such a way that it can minimize how much it needs to read.

Part of this is the columnar layout that means it can avoid reading columns that are not involved in the query. However it’s also able to push query predicates into the table scan, using metadata (like bloom filters) that tell it what values are in each chunk of data.

But for joins, you typically end up needing to read all of the data and materialize it in memory.

For realtime joins the best option is to do it in a steaming fashion on ingestion, for example in a system like Flink or Arroyo [0], which I work on.

[0] https://github.com/ArroyoSystems/arroyo


Something I have found pretty annoying is that Flink works great for joining a stream against another stream where messages to be joined are expected to arrive within a few minutes of each other, but there is actually ~no platform solution for joining a small, unchanging or slowly changing table against a stream. We end up needing a service to consume the messages, make RPC calls, and re-emit them with new fields.


Our (Estuary; I'm CTO) streaming transformation product handles this quite well, actually:

https://docs.estuary.dev/concepts/derivations/

Fully managed, UI and config driven. Write SQLite programs using lambdas that are driven by your source events, in order to join data, do streaming transaction processing, and no doubt lots of other things we haven't thought of.


There are a few options here, but I agree this is a weakness with existing systems.

One option in Flink is to load the entire fact table into the pipeline (using the filesystem source or a custom operator) and join against that. This provides good performance, but at the cost of managing additional long-running state in the pipeline (and potentially long startup times). This works pretty well for very small fact tables (stuff like currency conversions, B2B customer data, etc.).

The other option is to store the fact table in a database and query it dynamically. Flink SQL has explicit support for this (called "lookup joins") but this requires careful tuning to not overwhelm your database with high-volume streaming traffic (particularly when doing bootstrapping or recovery).

Doing these sorts of joins is a huge use case, and definitely something we're trying to improve in Arroyo.


You can join against a static (or slowly changing) table in Flink, including AS OF joins [1], i.e. joining the correct version of the records from the static table, as of the stream's event time for instance. You need to keep an eye on state size and potential growth of course. It's a common use case we see in particular for users of change data capture at Decodable (decodable.co).

[1] https://nightlies.apache.org/flink/flink-docs-master/docs/de...


I think you're talking about doing denormalization before importing data into an OLAP system to avoid subsequent joins. However, this greatly limits the flexibility of data modeling. Moreover, denormalization can be a headache-inducing process. In fact, I have tested StarRocks (https://github.com/StarRocks/starrocks), and it is capable of performing joins while streaming data imports, and the speed is very fast. It's worth giving it a try.


I've also been frustrated when testing out tools that kinda keep you locked into one predetermined view, table, or set of tables at a time. I made a semantic data modeling library that puts together queries (and of course joins) for you as it uses a drill-across querying technique, and can also join data across different data sources in a secondary execution layer.

https://github.com/totalhack/zillion

Disclaimer: this project is currently a one man show, though I use it in production at my own company.


zillion looks fantastic. I've wanted for a long time to build something similar for warehousing the education and social science research data I work with, but have found it difficult to spend time building infrastructure as someone currently spending most of their time in academia. What does your company do, if you don't mind me asking? Any interest in expanding into research data management? I'd love to chat with you sometime if you're at all interested... (my contact is in my profile)


Thanks! I'm not sure if I'll have the bandwidth to help yet, but interested to hear about the problem you are facing. I'll reach out this week.


The reason those tools have more limited support for joins is mainly because they are making intentional trade offs in favor of other features, eg performance in a particular domain.


A related note, PostgreSQL is very good at joins, but MySQL - with at the time much larger share - was never very good at them (at the time). (I last explored this 2016 and before). But a lot of web interfaces to data exploration (then) were based on MySQL and its quirks, and that colours perspectives a lot.


I’m curious about the limitations you’ve encountered with ClickHouse’s JOINS, I’ve found it sufficiently robust for dealing with the typical operations


The article omits the "natural join". A natural join automatically joins tables on columns having the same name, and the columns in the resulting output appear exactly once. Natural joins often simplify queries, and they automatically adapt to schema changes in source tables. Natural joins also encourage standardized naming conventions, which has longer-term benefits.


Natural joins also automatically break your queries when two columns happen to share a name but not the same meaning.

Step 1: use natural join. Life is great. Step 2: someone adds a `comment` field on table A. Life is great. Step 3: someone adds a `comment` field on table B. Ruh roh.

I'll use them in short-lived personal projects, but not on something where I'm collaborating with other people on software that evolves over several years.


It seems like the database should be able to figure this out when a foreign key constraint is explicitly declared in the DDL.


Should, but in practice I've rarely seen fk contraints used in analytics data warehouses (mostly for etl performance reasons)


a defense against collisions like this is through CTEs that select a minimal set of columns, with column names suitably selected and standardized:

  CTE_A AS (SELECT ... comment as comment_a from A...),
  CTE_B AS (SELECT ... comment as comment_b from B...)


Isn't this a lot more work both for the user and the RDBMS than just using a relatively simple left join?


I feel like natural joins simplify writing queries, but not exactly reading them (especially if you are not familiar with the database). IMO a good compromise is the USING clause, which acts like a natural join, but columns have to be named explicitly.


Maybe some SQL server should require a token like “USING_WILD_GUESS” when doing a natural join :)


The problem with USING is that it isn't supported everywhere.


> Natural joins also encourage standardized naming conventions, which has longer-term benefits.

This is a very positive spin on "you have to manage column names very rigorously for this strategy to be sustainable".


That’s not a join type, that’s syntactic sugar.


A right join is also syntactic sugar for a left join


Exactly, that's why they shouldn't be used. If you feel the need to use a right join, swap the direction around. And only use inner joins if it is the only type of join you use, otherwise specify conditions in the where clause. Both serve to significantly lower mental overhead when dealing with queries.

Natural joins are naturally more implicit, and while SQL tends to be a little bit verbose, given the significance of data integrity and the dififculty of testing SQL, the trade-off goes very clearly towards being explicit.


> the columns in the resulting output appear exactly once

No, records with the same join-column value are multiplied.


Good material.

Joins can be also categorized by the used join algorithm.

The simplest join is a nested loop join. In this case a larger dataset is iterated and another small dataset is combined using a simple key lookup.

Then there is a merge join. In this case two larger datasets are first sorted using merge sort and then aligned linearly.

Then there is a hash join. For the hash join a hash table is generated first based on the smaller dataset and the larger dataset is iterated and the join is made by making a hash lookup to the generated hash table.

The difference between nested loop join and hash join might be confusing.

In case of a nested loop join the second table is not loaded from the storage medium first, instead an index is used to lookup the location of the records. This has O(log n) complexity for each lookup. For hash join the table is loaded and hash table is generated. In this case each lookup has O(1) complexity but creation of hash table is expensive (it has O(n) complexity) and is only worth the cost when the dataset is relatively large.


In distributed systems, there's even a broadcast hash join, in which the hash table is not distributed across shards and assigned to workers but copied in full to every worker. That way, the other side of the join need not be sharded and shuffled to align join keys across workers. This strategy can save a lot of time and network bandwidth when one side of the join is small enough to fit into a worker's RAM and the other side is staggeringly massive (e.g., logs). It lets the massive side be processed in place, as it's streamed from storage.


There are lots of other joins not mentioned there.

A popular one with time series is the asof join.

There are also literal joins, which are generalizations of adjacent difference.


meant lAteral sorry


when I saw the term "horizontal joins", I immediately went to, what? what's a "vertical join?" must be a UNION, scrolled down and sure enough.

Is there some reason to use non-standard terminology for posts that are trying to be in-depth, authoritative tutorials ?


Never heard a union be called a “vertical join” before.


It seems that lots of people independently coin this expression. I did it too a couple of years ago. Just checked Google, and there are lots of hits for this.


It’s like calling a sum an additive product. Doesn’t really make a lot of sense.


Dplyr actually supports some really cool join functionalities that I wish were in SQL implementations, including:

- Ability to specify whether your join should be one-to-one, many-to-one, etc. So that R will throw an error instead of quietly returning 100x as many rows as expected (which I've seen a lot in SQL pipelines).

- A direct anti_join function. Much cleaner than using LEFT JOIN... WHERE b IS NULL to replicate an anti join.

- Support for rolling joins. E.g. for each user, get the price of their last transaction. Super common but can be a pain in SQL since it requires nested subqueries or CTEs.


Polars does this too!


Its funny that the older i get 40 the more it seems like my surrounding doesn't even understand the basics of sql. In my 20s everyone knew sql. It was required for almost every job. And it is still very useful knowledge. Difference between left, right and inner joins is ultra basic. If you dont know that i would be very worried.


This article is targeted at people using R to analyse data which is probably more as-hoc and not in a sql database. I don’t think the target audience (might include eg scientists who are working with experiment results) should be expected to know SQL, and even if they do, it’s still useful to understand the dplyr functions,


I see more and more people coming out of college (or through apprenticeships) with miserable SQL skills. Relational databases are boring technology, and people want to work with shiny new things. People don't want to learn it, and they rely on ORMs so they don't have to. The result is miserable performance and reimplementing parts of SQL badly in the application itself.


> Let’s try this again using R.

This should actually be "Let's try this again using dplyr, one of the most elegant pieces of software ever written".

Hadley Wickham is a treasure to humanity.


asof join:

https://code.kx.com/q/ref/asof/

https://code.kx.com/q/learn/brief-introduction/#time-joins

https://clickhouse.com/docs/en/sql-reference/statements/sele...

https://duckdb.org/docs/guides/sql_features/asof_join.html

> Do you have time series data that you want to join, but the timestamps don’t quite match? Or do you want to look up a value that changes over time using the times in another table?

DuckDB blog post on temporal joins:

https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-look...

note: this idea is very useful for timestamps but can also be used for other types of values


The ASOF LEFT JOIN is the main join used to create training data, where the labels are in the left table, and features are in the tables on the RHS. Note, we use DuckDB for its left asof join capabilities, but most vendors use Spark which is hideously slow for left asof joins.


From the opening of the "Vertical Joins" section:

> Similar to horizontal joins, there are many use cases for joining data horizontally, also called appending data.

Shouldn't this read "joining data vertically"? This seems like a typo.


This is definitely a typo! Thanks for catching this!


Beginners talking about SQL joins. What is this newbienews or hackernews? "But but ChatGPT says"...

Disgusting


I like reading well-written articles about basic things, even as a senior dev. My job involves mentoring people, and it is always good to have some articles at hand to give them to read if they struggle with a topic. Moreover, it is interesting to see how the concepts are thought. I like to give people several explanations for the same thing in the hope that at least one sticks.


Someone even coined the term vertical join!

Anyway, I declare and coin that VLOOKUP is a LEFT JOIN




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: