> > Keep in mind GROUP BY clause usually dictates index use.
> The reason for this wasn't immediately apparent to me.
The key thing to remember is that grouping is essentially a sorting operation, and it happens before your other sorts (that last part isn't necessarily as obvious).
Sorry, was on mobile so hadn't patience to type examples.
SELECT
column1,
(
SELECT column2
FROM table_b
WHERE table_a.id = table_b.a_id
) as b_column2,
(
SELECT column3
FROM table_b
WHERE table_a.id = table_b.a_id
) as b_column3
FROM table_a
It might look like a lot more work, but in my experience it's usually a lot faster. YMMV but check it.
How well that performs compared to a JOIN can vary massively depending on the data sizes of table_a & tale_b, how table_b is indexed, and what else is going on in the query.
If table_b has an index on id,column2,column3 (or on id INLUDEing column2,column3) I would expect the equivalent JOIN to usually be faster. If you have a clustered index on Id (which is the case more often than not in MS SQL Server and MySQL/InnoDB) then that would count for this unless the table is much wider than those three columns (so the index with its selective data would get many rows per page more than the base data).
Worst (and fairly common) case with sub-selects like that is the query planner deciding to run each subquery one per row from table_a. This is not an issue if you are only returning a few rows, or just one, from table_a, but in more complex examples (perhaps if this fragment is a CTE or view that is joined in a non-sargable manner so filtering predicates can't push down) you might find a lot more rows are processed this way even if few are eventually returned due to other filters.
There are times when the method is definitely faster but be very careful with it (test with realistic data sizes and patterns) because often when it isn't, it really isn't.
Are there any tools or tips to help speed up the "which JOIN is duplicating data" hunt?
Usually my biggest problem is getting all the query parameters lined up to reproduce the issue! (Being able to flip on extended logging or a profiler can make this easy.)
Cutting out the result columns when disabling JOINs to narrow it down is straightforward but tracking columns down in WHERE clauses quickly tends not to be.
What does this mean? Running
Results in "subquery must return only one column" as I expected. You mean returning the multiple columns as a record / composite type?Keep in mind GROUP BY clause usually dictates index use.
The reason for this wasn't immediately apparent to me. For those who were curious, this blog post walks through it step by step: https://www.brentozar.com/archive/2015/06/indexing-for-group...