Use SQL subqueries to count distinct 50x faster

taspeotis · on Jan 24, 2014

It would be useful if the data was loaded into different RDBMS' and their execution plans compared. My understanding is that MySQL has the worst query planner and Oracle the best. MSSQL sits somewhere in between, near PgSQL but better in a few areas.

I think MySQL basically gives up optimizing the moment you ask it to do a correlated subquery.

I use MSSQL at work and given it some really dumb schemas and really dumb queries and it still manages to pick good plans. It wouldn't surprise me if MSSQL generated the "optimal" plan from the first query.

It doesn't choose good plans if you break 3NF by a long shot or avoid the most obvious indexes. Otherwise the only thing to watch out for is stale stats and parameter sniffing.

Oracle 12c has adaptive plans which would be a great addition to MSSQL in the fight against parameter sniffing. Basically it figures out that it's probably got the wrong plan ("this nested loop join stopped making sense about 10,000 rows ago") and adjusts it accordingly.

Again, wouldn't be surprised if Oracle got it "right" from the first query.

slowmover · on Jan 24, 2014

For all its faults, this is one thing I like about Oracle. No matter how many levels of nested correlated subqueries or other convoluted junk I throw at it, it (almost) always untangles the query into a nicely optimized plan. Of course, this makes it less likely I'll spend time actually thinking about how the query could be cleaned up and streamlined. Which means if I tried to move over to MSSQL or MySQL my queries would probably take forever or fail completely. Ergo, I'll stick to Oracle... ?

spinlock · on Jan 24, 2014

Vendor lock in through dumbing down your clients ... I like it.

hnriot · on Jan 25, 2014

no, vendor lock in through having a better product

pradocchia · on Jan 24, 2014

I agree with everything you've said, however I want to add that the general technique of compressing your data through pre-aggregation is a sound one, whether it is performed auto-magically by the optimizer, or manually through whatever hooks your DMBS provides.

bentcorner · on Jan 24, 2014

While it's hard to deny the results, the human readability of the final solution leaves a lot to be desired, especially for SQL casuals.

calinet6 · on Jan 25, 2014

At least with Postgres (and MySQL to some degree, but Postgres is probably a better launching point), you could hack the query optimizer to handle situations like this if you really wanted to. I like to think this is how Postgres would like you to approach the problem, instead of creating a less comprehensible query.

hglaser · on Jan 26, 2014

OP here again. This was such a good idea, we decided to do it. :)

Here are the same queries tested on MySQL, MSSQL and Oracle as well: https://periscope.io/blog/count-distinct-in-mysql-postgres-s...

tl;dr: Oracle and MSSQL do very well, and even better on the more specific queries. Postgres's naive query is much worse than all the others.

taspeotis · on Jan 26, 2014

Thanks for the extra testing.

My biggest problem with these benchmarks and your queries I cover in this [1] post.

If databases.name is unique then you'd have a unique index/constraint on it. Doing this in the environment I set up causes SQL Server to generate identical execution plans for all three queries.

If databases.name is not unique then your query's group by name doesn't make sense. I'll just quote what I wrote verbatim:

> No online retailer is looking at their sales reports and thinking that John Smith is their best customer and Amorette McFredson is their worst.

> ... I don’t know whether the query that was chosen for optimization because it’s good for demonstrating the technique on PgSQL or because it represents a query that is actually in use. I don’t like group by name without knowing whether dashboard names are unique.

> With that said, if it’s not unique then the query is arguably “wrong” because it aggregates results for distinct dashboard names rather than distinct dashboards.

> Removing the unique index on dashboards.name does change the execution plan of the first query – it’s slightly worse than the other two (which remain identical). “Fixing” the first query to group by dashboards.id and dashboards.name causes the better plan to be generated again.

[1] http://ta.speot.is/2014/01/25/use-subqueries-to-count-distin...

michaelmior · on Jan 24, 2014

Historically the MySQL query planner has been pretty bad, although it's improved significantly in the past couple years. It was particularly bad dealing with subqueries as you mentioned, although several of the performance issues have been solved in MySQL 5.6.

https://mariadb.com/kb/en/subquery-optimizations-map/

falcolas · on Jan 24, 2014

Yup, this. Of course, it hurts that there are probably less than 10 people in the world who fully understand the optimizer (and the majority are employed by Monty); though I'm sure the MySQL folks at Oracle are getting there.

hglaser · on Jan 24, 2014

OP here. Great points -- especially around MySQL and subqueries, which has been a frustration for us.

Comparing across different RDBMS's is a great idea. Maybe we can squeeze it into a future blog post. :)

taspeotis · on Jan 25, 2014

I tried to do a fair job in recreating your queries and techniques with MSSQL. Turns out MSSQL picks the best plan from the start [1].

[1] https://news.ycombinator.com/item?id=7119976 or http://ta.speot.is/2014/01/25/use-subqueries-to-count-distin...

bottompair · on Jan 24, 2014

This is why heavy ORM's like Hibernate are only good for the most simplistic (and tiny data) use cases. Every situation I've been in where queries can't be optimized appropriately for specific use cases like this involve engineers who "don't do Sql" and have made heavy use of something like Hibernate. Then scratch their heads when dashboards take minutes to load.

The next step is usually to jump into a column oriented data store, document store, or some other technology they understand even less than Sql.

markkanof · on Jan 24, 2014

You seem to be calling out Hibernate as the problem, but really the problem is the developers that either don't understand or refuse to interact with a major component of their application (the SQL database). A good ORM will make CRUD actions quick to implement and will get out of the way when more complex queries are needed.

bottompair · on Jan 24, 2014

True. Hibernate isn't really the issue - more the existence of Hibernate sometimes creating lazy engineering. I guess I'm old school - my preference (even with simple CRUD) is to build the queries myself if nothing else to avoid object models in code being translated into often flawed data schema by the ORM (if you happen to let your ORM create database objects for you).

Even with simple CRUD I've had to unwind faulty implementations where a smallish (50GB) database was being eager fetched into memory because of poor ORM config and overuse of FK relationships - when a user did something as simple as logging into the app.

And I'm not talking junior developers - some of these guys were 10+ years at Amazon.

BigTuna · on Jan 26, 2014

I've been rattling my cage about this problem for a few years now. There's room and need for an "ORM/DBA" position in our industry. Most of the common production packages are both complex and flawed enough to justify a full time position, and yet I've seen so many companies lose hours of productivity floor-wide because the dev group's [N]Hibernate knowledge has been pressed beyond its limits.

twistedpair · on Jan 24, 2014

Hibernate can work quite well with my/i batis as in a CQRS pattern. Let hibernate eliminate all the crap CRUD you don't want to write, while using tuned SQL via xBatis to do those complex analytic fetches. Works well for us. :)

freeman478 · on Jan 24, 2014

I find it disapointing that the optimizer of the database is not able to get to the fastest result by itself.

A broken promise of declarative languages I guess.

jeffdavis · on Jan 24, 2014

In postgres, distinct aggregates are a blind spot in the planner.

DISTINCT itself is handled well, e.g. "SELECT DISTINCT x, y ...".

But the distinct aggregates are not really seen by the planner, and they are always implemented with a sort. So this is a peculiar case for postgres.

One reason is because distinct aggregates can be more complicated to plan when you have a case like: "SELECT COUNT(distinct x), COUNT(distinct y) ..." because you actually need two different distinct sets as input.

Some databases have solutions for that problem, but postgres does not.

aaronblohowiak · on Jan 24, 2014

Well-documented MIT-licensed code. You can help!

jeffdavis · on Jan 24, 2014

Agreed. There's a commit fest going on right now, anyone should feel free to jump in and review a couple of patches with us. Even if you don't know C.

https://commitfest.postgresql.org/action/commitfest_view?id=...

Review is a good way to learn about new features, by the way. That's the main reason I'm able to actually write recursive queries, for instance.

elliptic · on Jan 24, 2014

Serious question - how realistic is that? How many people make significant changes to the planner/executor besides Tom Lane?

jeffdavis · on Jan 24, 2014

This particular change, in it's complete form, would be challenging in my opinion. Many other kinds of planner changes aren't so bad though, but people (including me) tend to avoid them.

Executor changes are much less intimidating and lots of people make executor changes. In my opinion, the executor is the most approachable/easiest major component.

I would encourage people to not get intimidated though. For every project where I was determined enough, I succeeded. Tom Lane does especially amazing things, of course, but I think reasoning like "I'm not Tom Lane, therefore the planner is too hard" is the wrong lesson to take away.

calinet6 · on Jan 25, 2014

For reference, a college-level CS databases course I took had us hack on the Postgres query planner to improve a very specific case such as this.

It's large and slightly complicated, but damn it's clean and understandable. Or at least it was back in 2005...

danielweber · on Jan 24, 2014

Yeah, in theory the database is supposed to take care of things like this for you.

In practice, every time I've seen a slow DB query, I've always been able to get huge improvements on it, even though the DB in theory had all the knowledge necessary to realize that I don't care about the intermediate results.

6cxs2hd6 · on Jan 24, 2014

But isn't `DISTINCT` sort of shoving an imperative square peg into the round hole of declaritive set relations?

At least in my time with MSSQL, seeing `SELECT DISTINCT ...` was a red flag. It nearly always meant something needed to be refactored using GROUP BY.

schrodinger · on Jan 24, 2014

FYI, in mssql at least, distinct and group by are entirely equivalent to the query planner. I'm not saying you're wrong though, since group by let's you explicitly pick which columns to go by, whereas distinct uses them all.

alinhan · on Jan 24, 2014

I don't understand what you mean by "group by let's you explicitly pick which columns to go by, whereas distinct uses them all". Can you give an example?

schrodinger · on Jan 24, 2014

select userid, name from users group by userid, name

is equivalent to

select distinct userid, name from users

It's easy to add another column to the "distinct" based query and change the behavior accidentally. With the "group by" query, you'd get an error saying that the new column needs to be either aggregated or added to the group by.

pdubs · on Jan 24, 2014

SELECT COUNT(DISTINCT ...) is a very different thing. It's counting distinct values of a particular column in a result set and totally not a square peg/round hole situation.

jeffdavis · on Jan 24, 2014

I've never heard DISTINCT described as imperative.

It still leaves room to determine how and when to achieve the DISTINCT property, perhaps even doing no work at all.

* It can do a sort+unique or a hash, depending on cardinality, availability of memory, and a few other factors.

* It can push down a distinct or pull up a distinct, resulting in a faster plan.

* If there is another part of the plan that already implies distinct, like a GROUP BY, then it doesn't need to do any work at all. (Or similarly, if you're doing a distinct on a primary key).

darksaints · on Jan 24, 2014

How is that imperative in the slightest?

lvh · on Jan 24, 2014

The declarative language promise was that it would be possible to perform such optimizations. Indeed, by-and-large, query planners do a pretty great job. I don't think it was ever a promise that query planners would be able to always find the fastest query for what you want.

jerf · on Jan 24, 2014

I read the declarative language promise as you will never have to know what's going on under the hood. This is never true in any non-trivial declarative language being used for non-trivial work. I have developed a very negative initial reaction to anything that claims to be "declarative".

Pxtl · on Jan 24, 2014

Likewise. The more declarative a platform is, the more I'm trusting its abstractions to never leak, because when I do I'm going to have to learn every agonizing minute detail of how it works under the hood.

SQL is a tiny kernel of beautiful relational mathematics wrapped up in a 3-foot ball of ugly implementation hacks.

NoSQL is what happens when you throw out the beautiful kernel and just keep the hacks, so it's not really better I guess.

joevandyk · on Jan 24, 2014

I wish more people were working on planner optimization/stats. It's a hard problem though, I know I'm nowhere near smart enough for it.

darksaints · on Jan 24, 2014

I don't think it is a matter of intelligence, just skillset. Query planners could go a long way with a few people with strong relational logic skills that could prove logical equivalence of expressions.

baudehlo · on Jan 24, 2014

There are some settings worth tweaking in postgres' config to improve the query planner. The one that made the biggest difference to us was "cpu_tuple_cost". Setting it to 0.2 (the default is 0.01) made it choose more optimal query plans more often.

lincolnq · on Jan 24, 2014

Huh, interesting! I just looked up the parameter - "the planner's estimate of the cost of processing each row during a query". So increasing this will tend to cause the query planner to forgo complete table scans in favor of indexed queries more often. Increasing the parameter by a factor of 20 seems like you're telling it "never do complete table scans unless it's completely unavoidable, always use an index".

I can see why this would make the query plans look better (in that index scans make more sense to humans) but are you sure you're actually getting better performance? If you are getting better performance, why is the default so wrong? Computers are really fast at sequential scans, less so at random-access stuff.

hsitz · on Jan 25, 2014

"Computers are really fast at sequential scans, less so at random-access stuff."

This is true if you're simple talking about raw IO throughput, but in terms of algorithmic searching the truth is almost exactly the opposite. For example, to find a record in a 100 million record table with a unique key using sequential search would require on average a scan of 50 million records. To access the same record when that key is indexed requires accessing/testing a maximum 23 or 24 entries in the index (2^23 is roughly 10M) and then direct access of the exact record needed in the table. Indexes are used because they speed things up exponentially, even though it may be true that random access IO is much slower than sequential IO.

People, I suspect, also have much higher throughput when scanning sequentially compared to random access of items. But imagine trying to find the single person in a copy of the Manhattan white pages (is there still such a thing?) with the name 'Horace Greeley'. Would you prefer to scan sequentially from the beginning (no use of an index) or make use of the fact that white page phone books are indexed by lastname, firstname?

Of course, because sequential scanning is so much faster than random access it will always be faster on tiny tables. It will depend on the system and record size, but I'm guessing indexing will take over as faster method as table size grows beyond 5k records or so, give or take an order of magnitude. Not sure of exact numbers but you get the idea. As number of records goes up from there the speed advantage of indexing over scanning grows at exponential rate.

corresation · on Jan 24, 2014

The default is configured for magnetic disks, where random IO is brutally slow (on the order of 30-100 per second) and sequential reads are reasonably fast. But what if you're running it on a Micron P420m with 700,000 IOPS per second?

jdub · on Jan 25, 2014

700,000 I/Os per second per second? That is some serious storage acceleration you've got there.

Sami_Lehtinen · on Jan 26, 2014

With proper SSD drives you can get seriously high IOPS. https://www.fusionio.com/products/iodrive2-duo/

corresation · on Jan 26, 2014

I think he was humorously noting that I erroneously said IOPS per second, akin to saying ATM machine, CD disc, etc.

jdub · on Feb 5, 2014

corresation · on Jan 24, 2014

It's worth noting that SQL Server, presuming you've configured uniques and relationships correctly, actually runs exactly the same query plan for their first query and their last query.

I detailed this at http://dennisforbes.ca/index.php/2014/01/24/your-query-plann...

sehrope · on Jan 24, 2014

If the author is reading please post the EXPLAIN ANALYZE results for these queries. I'd much prefer the raw text psql output to the GUI representation in pgAdmin. More importantly it has the actual numbers for the intermediate steps to see which is the culprit.

Also, any idea how many times each variation was executed? 10M rows is not that big. At 1K per row (which is probably an overestimate) that comes out to 100 MB. That easily fits in memory. After the first full scan all of it would come from the cache. It's not a fair comparison unless you flush the cache between tests.

hglaser · on Jan 24, 2014

OP here. It's no EXPLAIN ANALYZE, but here are the EXPLAINs at least. The originals were averages of 10 runs. These EXPLAINs are just quick runs on my laptop, so no promises. :)

First query:

                                             QUERY PLAN
  -----------------------------------------------------------------------------------------------------
 Sort  (cost=3475858.60..3475859.56 rows=385 width=22)
   Sort Key: (count(DISTINCT time_on_site_logs.user_id))
   ->  GroupAggregate  (cost=3370756.94..3475842.06 rows=385 width=22)
         ->  Sort  (cost=3370756.94..3405784.03 rows=14010837 width=22)
               Sort Key: dashboards.name
               ->  Hash Join  (cost=44.90..558337.80 rows=14010837 width=22)
                     Hash Cond: (time_on_site_logs.dashboard_id = dashboards.id)
                     ->  Seq Scan on time_on_site_logs  (cost=0.00..365621.11 rows=14016911 width=8)
                     ->  Hash  (cost=29.40..29.40 rows=1240 width=22)
                           ->  Seq Scan on dashboards  (cost=0.00..29.40 rows=1240 width=22)

(10 rows)

Second Query:

                                             QUERY PLAN
  -----------------------------------------------------------------------------------------------------
 Sort  (cost=2709630.36..2709631.09 rows=291 width=26)
   Sort Key: (count(DISTINCT time_on_site_logs.user_id))
   ->  Merge Join  (cost=2604392.06..2709618.45 rows=291 width=26)
         Merge Cond: (time_on_site_logs.dashboard_id = dashboards.id)
         ->  GroupAggregate  (cost=2604392.06..2709521.80 rows=291 width=8)
               ->  Sort  (cost=2604392.06..2639434.34 rows=14016911 width=8)
                     Sort Key: time_on_site_logs.dashboard_id
                     ->  Seq Scan on time_on_site_logs  (cost=0.00..365621.11 rows=14016911 width=8)
         ->  Index Scan using dashboards_pkey on dashboards  (cost=0.00..87.00 rows=1240 width=22)

Third Query:

                                                   QUERY PLAN
  -----------------------------------------------------------------------------------------------------------------
 Sort  (cost=436381.86..436382.36 rows=200 width=26)
   Sort Key: log_counts.ct
   ->  Hash Join  (cost=436308.72..436374.22 rows=200 width=26)
         Hash Cond: (dashboards.id = log_counts.dashboard_id)
         ->  Seq Scan on dashboards  (cost=0.00..29.40 rows=1240 width=22)
         ->  Hash  (cost=436306.22..436306.22 rows=200 width=12)
               ->  Subquery Scan on log_counts  (cost=436302.22..436306.22 rows=200 width=12)
                     ->  HashAggregate  (cost=436302.22..436304.22 rows=200 width=4)
                           ->  HashAggregate  (cost=435705.66..435944.28 rows=23862 width=8)
                                 ->  Seq Scan on time_on_site_logs  (cost=0.00..365621.11 rows=14016911 width=8)

nerfhammer · on Jan 25, 2014

Why is it all hash and merge joins? No indexes? I think I know another way to speed up your queries...

np422 · on Jan 24, 2014

Since they are using postgresql, it might be relevant to know that count distinct have for a very long time been a weak spot for pg.

Since release 9.3 index-only scans have been implemented and performance of count distinct queries improved significantly - no more full table scans.

Wiki page: https://wiki.postgresql.org/wiki/Slow_Counting

It would be fun to have access to the same dataset or a more detailed description of the tables involved for further tinkering ....

Fishkins · on Jan 24, 2014

Just a nit pick, but index-only scans were added in 9.2 https://wiki.postgresql.org/wiki/Index-only_scans

727374 · on Jan 24, 2014

My first thought when reading this was that MSSQL (which i'm most experienced with) would automatically optimize this query. I'd think postgres can too. If I saw this happen in MSSQL I'd start by clearing the statistics for the table.

My second thought was that 20 seconds is a long time for a query and there's probably a better solution using a materialized view.

joevandyk · on Jan 24, 2014

In PostgreSQL, you can use CTEs to force certain subqueries to run before others. That technique would be useful here, and would clean up the code a bit.

brunnsbe · on Jan 24, 2014

CTEs are found in SQL Server as well:

http://technet.microsoft.com/en-us/library/ms190766(v=sql.10...

pradocchia · on Jan 24, 2014

but CTEs in SQL Server are not an optimization boundary like they are in Postgres. This makes a big difference in their proper use.

In SQL Server, aside from recursion, CTEs are purely syntactic sugar and thus a tool for organizing code. Whereas in Postgres, due to the optimization boundary, they are a tool for query tuning. If you rewrite a query in Postgres using CTEs just to clean it up, you are doing it wrong, since the CTEs will also affect the planner.

radiowave · on Jan 24, 2014

Right, but that's assuming that performance is the most pressing constraint. Sometimes it's confidence in the query producing the right result, or maintenance further down the line.

bgilroy26 · on Jan 24, 2014

I didn't know what Common Table Expressions were. I had always called them WITH queries.

It's good to learn the name.

http://www.postgresql.org/docs/8.4/static/queries-with.html

atwebb · on Jan 24, 2014

Am I incorrect in stating that this isn't necessarily true? Just because a statement is earlier in a script doesn't mean it will be optimized first.

joevandyk · on Jan 24, 2014

In Postgres, common table expressions are planned and executed separately.

I think of it as each section builds a temp table.

This can be a good or bad thing, but it's sometimes important to know that it's happening.

Shivetya · on Jan 24, 2014

Anyone have the table layouts they are using? I was going to attempt to run them through Visual Explain which is the tool I use on the iSeries which employs DB2.

A simple analysis over a very rotten table, doing a count(distinct) over a very varied field resulted in

Table Scan -> Temp Distinct Hash Table -> Hash Scan -> Final Results (330.022 ms) for a table with no keys and slightly over three million records.

I would love to do the same exact statements he is but need the table layout to replicate. I am curious if our analyzer would see a difference, it is really hard to "trick it" or beat it.

dasil003 · on Jan 24, 2014

Very nice first blog post. Definitely caught my attention right away and made me curious about the company.

Jailbird · on Jan 24, 2014

On the assumption that aside from publishing some useful information, the blog post is also meant to drive awareness for Periscope...

even a small landing page with some expected features or more information about the product-to-be would be a good idea.

(I saw the one-liner at the bottom of the post but would suggest a bit more to pique the interest of more people. If I missed it, my apologies.)

aritraghosh007 · on Jan 24, 2014

Amazon RedShift (based on Postgres) has an approximate function that can be used to return results faster while calculating distinct values from a large table with an error approximation of +2% (http://docs.aws.amazon.com/redshift/latest/dg/r_COUNT.html). To give some perspective, we ran a test query on a RedShift table with > 900 million records to fetch unique user id's. With the approximate function applied, the results were returned in 8 seconds while it took the usual distinct query 51 seconds, ~84% faster !

morgo · on Jan 24, 2014

My understanding is that it just implements the postgres protocol, its query planner is now completely different.

"ParAccel's DBMS engine is built for analytics, initially based on PostgreSQL" -> http://en.wikipedia.org/wiki/ParAccel

darksaints · on Jan 24, 2014

Is there any relational logic reasoning that would prove that the query planner couldn't theoretically optimize the first query into the execution plan of the third?

yeukhon · on Jan 25, 2014

I wonder how efficient ORMs out there doing. My believe is that once performance profiling is done, and the crucial bottleneck is identified, people will replace the original call to ORM with a custom SQL query.

But heck, sometimes db language driver (like pymongo, python-mysql - just giving out example what I mean by language driver) are not optimized either. :) so a careful post-profiling analysis is required.

j-kidd · on Jan 25, 2014

With SQLAlchemy, I have done similar optimizations against mssql by changing a few lines of ORM code. Without SQLAlchemy, I imagine I'd have to change dozens of hand written SQL queries.

A good ORM helps you to generate the exact SQL you need.

yeukhon · on Jan 25, 2014

I am really surprised that you are using SA as ORM on MSSQL (sorry, since I don't touch mssql :) and most people I deal with are either MySQL or Postgres users).

> A good ORM helps you to generate the exact SQL you need.

Right, but sometimes it's the driver that behave badly. (One of those days when I claim to have free time I should dig deeper, do a good performance analysis.. one day...)

tomaac · on Jan 24, 2014

Is there any reason why would one Start Every Word In Sentence With Capital Letter? I find it hard to read. Sorry for offtopic.

sleepyhead · on Jan 24, 2014

By 'every word' you mean 'every word in the subheadings'?

tomaac · on Jan 24, 2014

Yes..

muteh · on Jan 24, 2014

It's a fairly normal thing to do: http://en.wikipedia.org/wiki/Letter_case#Headings_and_public...

simonw · on Jan 24, 2014

Title Case is still very common in the USA (see http://www.nytimes.com/ for an example, their house style is to use it for all of their headlines). You hardly ever see it in the UK these days.

damien · on Jan 28, 2014

They're headings, not sentences...

brazzy · on Jan 24, 2014

An email subscription function for blog updates but no RSS or Atom feed? Seriously?

hglaser · on Jan 24, 2014

https://periscope.io/blog/atom.xml

brazzy · on Jan 24, 2014

Thanks! Now you just have to link it in the HTML header and give the feed a name more descriptive than BLOG_NAME :)

platz · on Jan 24, 2014

I'm curious if the optimizer in MS SQL Server handles this better.

chris_wot · on Jan 25, 2014

So much for SQL being an declarative language. Whilst I'm impressed, I'm curious why the query optimizer can't pick up this as an optimization.

wcummings · on Jan 25, 2014

You need not look further than this to understand why denormalized data is crucial for High Availability/Low Latency applications.

besenthil · on Jan 26, 2014

DB2 9.7 optimzer doest this as part of query optimization

ionforce · on Jan 24, 2014

This technique seems quite obvious...

You would only write the first query if you didn't know what you were doing. So... The advice here seems to be, know what you're doing.

squintychino · on Jan 24, 2014

If you're someone who doesn't know what they're doing, then this wouldn't seem quite obvious, would it? So this post is probably not meant for you, but rather for people that are not as experienced in SQL.

This logic seems quite obvious...

corresation · on Jan 24, 2014

I love that they're discussing this, and kudos to them, but unless the specific details are described (e.g. "and there are lots more log lines than dashboards and users" - what does that mean? That dashboard_id is nullable and many rows log entirely different things? That would be a significant problem unto itself. What is the cardinality of users, dashboards, and sets thereof), it's hard to really take much from it.

As a general recommendation I advise database developers to avoid exactly this sort of "outsmart the planner" behavior because it paints you into a corner -- data cardinality changes, schemas change, and you have large, clever queries that turn into a significant net detriment. They can absolutely be necessary in some cases, but this does not look like one of them.

jeffdavis · on Jan 24, 2014

"As a general recommendation I advise database developers to avoid exactly this sort of "outsmart the planner""

Probably wise in general. This particular case is peculiar because postgres doesn't optimize distinct aggregates at all (see my other comment: https://news.ycombinator.com/item?id=7115707 ), so it's possible that the rewritten query is actually more optimizable than the original.

The original query won't adapt to changing data sizes or distributions as well as the rewritten one.

This is not fundamental, nor universal. It's a specific limitation of postgres, so I still agree with your general advice. And postgres may fix this, so your advice may prove to be true even in this situation in the long run.

pradocchia · on Jan 24, 2014

I think this was more an exposition on the general technique than specific recommendation.

The general theme should be, "how do I make my database do less work?" followed by a list of techniques, including "pre-aggregate to reduce volume". In this case, the pre-aggregates are in-line. Whether in-line pre-aggregates are appropriate will depend on context, just as you say.

corresation · on Jan 24, 2014

OP - Can you please run your original query with a minor variation-

select dashboards.name, count(distinct time_on_site_logs.user_id) from time_on_site_logs join dashboards on time_on_site_logs.dashboard_id = dashboards.id group by time_on_site_logs.dashboard_id, dashboards.name order by count desc

hglaser · on Jan 24, 2014

Good idea. Times are from my laptop, which is less beefy than the test rig we used:

  First query: 996s
  First query with your variation: 593s
  Second query: 67s
  Third query: 21s

corresation · on Jan 24, 2014

Do you have a FK from dashboard_id to dashboards.id by chance? I assume no, as the query is presumably still forced to do an early existence test of the inner join before aggregating.

mattmanser · on Jan 24, 2014

SQL what?

MySQL? SQL Server? Postgre? Oracle DB? SQL Lite?

troels · on Jan 24, 2014

The very first paragraph on the page says:

NB: These techniques are universal, but for syntax we chose Postgres. Thanks to the inimitable pgAdminIII for the Explain graphics.

jeroen · on Jan 24, 2014

Structured Query Language

See wikipedia: https://en.wikipedia.org/wiki/SQL

alexchamberlain · on Jan 24, 2014

I agree this is extremely important; every optimiser is different!

brazzy · on Jan 24, 2014

I disagree; the article is essentially about manual query optimization. Theoretically, a great optimizer could do this by itself, but the fully optimized query will run the same way pretty much everywhere; query execution engines are not very different, basically grace hash joins and B-trees.

falcolas · on Jan 24, 2014

> the fully optimized query will run the same way pretty much everywhere

This is not correct. The query will still be run through an optimizer, and the optimizer will still pick its own path of execution based on indexes and data uniqueness (as identified from the table metadata or its own dives into the data).

There are ways to force an optimizer to take a certain path, but those are dependent on which DB you're using.

Unfortunately, the original dataset doesn't exist online, so we can't verify their claims that it really is database agnostic.

buckbova · on Jan 24, 2014

And then there are table statistics or fragmented indexes that cause the optimizer to choose a different path. Even when you've done everything and you think the optimizer should choose a particular index it will surprise you.

corresation · on Jan 25, 2014

As I detailed elsewhere, the problem with this assumption is that you are one-off optimizing by encoding metadata about the data in the query. So while this specific query has been made more optimal, the underlying problem with the database remains, and thus you will invariably face and then have to work around the same problem numerous times.

Or more often one won't because the legacy of hundreds of slow but not terrible queries accumulate. It becomes a death by a thousand cuts, which is the end result of many heavily used databases.

My example in SQL Server sees the simple addition of a single distinct index to give that same metadata to the query planner, yielding the identical plan for the first and last. And this same benefit will flow to every query that follows the same path.