Hacker News new | past | comments | ask | show | jobs | submit login

Instead of joining tables and using distinct or similar to filter rows, consiser using subquery "columns", ie in SELECT list.

What does this mean? Running

    SELECT
      column1,
      (
        SELECT column2, column3, ...
        FROM table_b
        WHERE table_a.id = table_b.a_id
      )
    FROM table_a
Results in "subquery must return only one column" as I expected. You mean returning the multiple columns as a record / composite type?

Keep in mind GROUP BY clause usually dictates index use.

The reason for this wasn't immediately apparent to me. For those who were curious, this blog post walks through it step by step: https://www.brentozar.com/archive/2015/06/indexing-for-group...




> > Keep in mind GROUP BY clause usually dictates index use.

> The reason for this wasn't immediately apparent to me.

The key thing to remember is that grouping is essentially a sorting operation, and it happens before your other sorts (that last part isn't necessarily as obvious).


Sorry, was on mobile so hadn't patience to type examples.

    SELECT
      column1,
      (
        SELECT column2
        FROM table_b
        WHERE table_a.id = table_b.a_id
      ) as b_column2,
      (
        SELECT column3
        FROM table_b
        WHERE table_a.id = table_b.a_id
      ) as b_column3
    FROM table_a
It might look like a lot more work, but in my experience it's usually a lot faster. YMMV but check it.


How well that performs compared to a JOIN can vary massively depending on the data sizes of table_a & tale_b, how table_b is indexed, and what else is going on in the query.

If table_b has an index on id,column2,column3 (or on id INLUDEing column2,column3) I would expect the equivalent JOIN to usually be faster. If you have a clustered index on Id (which is the case more often than not in MS SQL Server and MySQL/InnoDB) then that would count for this unless the table is much wider than those three columns (so the index with its selective data would get many rows per page more than the base data).

Worst (and fairly common) case with sub-selects like that is the query planner deciding to run each subquery one per row from table_a. This is not an issue if you are only returning a few rows, or just one, from table_a, but in more complex examples (perhaps if this fragment is a CTE or view that is joined in a non-sargable manner so filtering predicates can't push down) you might find a lot more rows are processed this way even if few are eventually returned due to other filters.

There are times when the method is definitely faster but be very careful with it (test with realistic data sizes and patterns) because often when it isn't, it really isn't.


> perhaps if this fragment is a CTE or view

Yeah I guess I should have specified that this technique usually works best when done in the outer query, not buried deep inside.

It can be particularly effective if you fetch partial results, ie due to pagination or similar.

That said, these things aren't set in stone. I shared my experience, but my first tip goes first :)


You have to be careful here that a one-to-many relationship doesn't exist and returns more than 1 row -- it'll cause an error and halt your query


Yes, as I noted.

Frequently this is trivial, sometimes it's not.

If there will be multiple hits but it doesn't matter that much, there's the obvious TOP 1 or MIN(col) and such.

It's a tradeoff between accidentally breaking the query and returning unexpected data.

Note that if you used join you could have bigger issues as the join would succeed but now you got multiple rows where you didn't expect.


Are there any tools or tips to help speed up the "which JOIN is duplicating data" hunt?

Usually my biggest problem is getting all the query parameters lined up to reproduce the issue! (Being able to flip on extended logging or a profiler can make this easy.)

Cutting out the result columns when disabling JOINs to narrow it down is straightforward but tracking columns down in WHERE clauses quickly tends not to be.


Good question. Obviously a profiler or similar that can capture the details when it happens helps, as you note.

If you can reproduce the issue then what I tend to do is to include the unique id column from each joined table (we try to avoid natural keys).

If it doesn't have a unique id column I replace the join with a subquery that includes row_number(), so I can se which one that doesn't repeat.

But without being able to replicate, I don't know of any better way than just studying the ON conditions carefully.


Would a cross apply accomplish the same result without the risk of multiple rows?

Cross apply (select top 1 ... ) x




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: