It would be useful if the data was loaded into different RDBMS' and their execution plans compared. My understanding is that MySQL has the worst query planner and Oracle the best. MSSQL sits somewhere in between, near PgSQL but better in a few areas.
I think MySQL basically gives up optimizing the moment you ask it to do a correlated subquery.
I use MSSQL at work and given it some really dumb schemas and really dumb queries and it still manages to pick good plans. It wouldn't surprise me if MSSQL generated the "optimal" plan from the first query.
It doesn't choose good plans if you break 3NF by a long shot or avoid the most obvious indexes. Otherwise the only thing to watch out for is stale stats and parameter sniffing.
Oracle 12c has adaptive plans which would be a great addition to MSSQL in the fight against parameter sniffing. Basically it figures out that it's probably got the wrong plan ("this nested loop join stopped making sense about 10,000 rows ago") and adjusts it accordingly.
Again, wouldn't be surprised if Oracle got it "right" from the first query.
For all its faults, this is one thing I like about Oracle. No matter how many levels of nested correlated subqueries or other convoluted junk I throw at it, it (almost) always untangles the query into a nicely optimized plan. Of course, this makes it less likely I'll spend time actually thinking about how the query could be cleaned up and streamlined. Which means if I tried to move over to MSSQL or MySQL my queries would probably take forever or fail completely. Ergo, I'll stick to Oracle... ?
I agree with everything you've said, however I want to add that the general technique of compressing your data through pre-aggregation is a sound one, whether it is performed auto-magically by the optimizer, or manually through whatever hooks your DMBS provides.
At least with Postgres (and MySQL to some degree, but Postgres is probably a better launching point), you could hack the query optimizer to handle situations like this if you really wanted to. I like to think this is how Postgres would like you to approach the problem, instead of creating a less comprehensible query.
My biggest problem with these benchmarks and your queries I cover in this [1] post.
If databases.name is unique then you'd have a unique index/constraint on it. Doing this in the environment I set up causes SQL Server to generate identical execution plans for all three queries.
If databases.name is not unique then your query's group by name doesn't make sense. I'll just quote what I wrote verbatim:
> No online retailer is looking at their sales reports and thinking that John Smith is their best customer and Amorette McFredson is their worst.
> ... I don’t know whether the query that was chosen for optimization because it’s good for demonstrating the technique on PgSQL or because it represents a query that is actually in use. I don’t like group by name without knowing whether dashboard names are unique.
> With that said, if it’s not unique then the query is arguably “wrong” because it aggregates results for distinct dashboard names rather than distinct dashboards.
> Removing the unique index on dashboards.name does change the execution plan of the first query – it’s slightly worse than the other two (which remain identical). “Fixing” the first query to group by dashboards.id and dashboards.name causes the better plan to be generated again.
Historically the MySQL query planner has been pretty bad, although it's improved significantly in the past couple years. It was particularly bad dealing with subqueries as you mentioned, although several of the performance issues have been solved in MySQL 5.6.
Yup, this. Of course, it hurts that there are probably less than 10 people in the world who fully understand the optimizer (and the majority are employed by Monty); though I'm sure the MySQL folks at Oracle are getting there.
This is why heavy ORM's like Hibernate are only good for the most simplistic (and tiny data) use cases. Every situation I've been in where queries can't be optimized appropriately for specific use cases like this involve engineers who "don't do Sql" and have made heavy use of something like Hibernate. Then scratch their heads when dashboards take minutes to load.
The next step is usually to jump into a column oriented data store, document store, or some other technology they understand even less than Sql.
You seem to be calling out Hibernate as the problem, but really the problem is the developers that either don't understand or refuse to interact with a major component of their application (the SQL database). A good ORM will make CRUD actions quick to implement and will get out of the way when more complex queries are needed.
True. Hibernate isn't really the issue - more the existence of Hibernate sometimes creating lazy engineering. I guess I'm old school - my preference (even with simple CRUD) is to build the queries myself if nothing else to avoid object models in code being translated into often flawed data schema by the ORM (if you happen to let your ORM create database objects for you).
Even with simple CRUD I've had to unwind faulty implementations where a smallish (50GB) database was being eager fetched into memory because of poor ORM config and overuse of FK relationships - when a user did something as simple as logging into the app.
And I'm not talking junior developers - some of these guys were 10+ years at Amazon.
I've been rattling my cage about this problem for a few years now. There's room and need for an "ORM/DBA" position in our industry. Most of the common production packages are both complex and flawed enough to justify a full time position, and yet I've seen so many companies lose hours of productivity floor-wide because the dev group's [N]Hibernate knowledge has been pressed beyond its limits.
Hibernate can work quite well with my/i batis as in a CQRS pattern. Let hibernate eliminate all the crap CRUD you don't want to write, while using tuned SQL via xBatis to do those complex analytic fetches. Works well for us. :)
In postgres, distinct aggregates are a blind spot in the planner.
DISTINCT itself is handled well, e.g. "SELECT DISTINCT x, y ...".
But the distinct aggregates are not really seen by the planner, and they are always implemented with a sort. So this is a peculiar case for postgres.
One reason is because distinct aggregates can be more complicated to plan when you have a case like: "SELECT COUNT(distinct x), COUNT(distinct y) ..." because you actually need two different distinct sets as input.
Some databases have solutions for that problem, but postgres does not.
This particular change, in it's complete form, would be challenging in my opinion. Many other kinds of planner changes aren't so bad though, but people (including me) tend to avoid them.
Executor changes are much less intimidating and lots of people make executor changes. In my opinion, the executor is the most approachable/easiest major component.
I would encourage people to not get intimidated though. For every project where I was determined enough, I succeeded. Tom Lane does especially amazing things, of course, but I think reasoning like "I'm not Tom Lane, therefore the planner is too hard" is the wrong lesson to take away.
Yeah, in theory the database is supposed to take care of things like this for you.
In practice, every time I've seen a slow DB query, I've always been able to get huge improvements on it, even though the DB in theory had all the knowledge necessary to realize that I don't care about the intermediate results.
FYI, in mssql at least, distinct and group by are entirely equivalent to the query planner. I'm not saying you're wrong though, since group by let's you explicitly pick which columns to go by, whereas distinct uses them all.
I don't understand what you mean by "group by let's you explicitly pick which columns to go by, whereas distinct uses them all". Can you give an example?
select userid, name from users group by userid, name
is equivalent to
select distinct userid, name from users
It's easy to add another column to the "distinct" based query and change the behavior accidentally. With the "group by" query, you'd get an error saying that the new column needs to be either aggregated or added to the group by.
SELECT COUNT(DISTINCT ...) is a very different thing. It's counting distinct values of a particular column in a result set and totally not a square peg/round hole situation.
I've never heard DISTINCT described as imperative.
It still leaves room to determine how and when to achieve the DISTINCT property, perhaps even doing no work at all.
* It can do a sort+unique or a hash, depending on cardinality, availability of memory, and a few other factors.
* It can push down a distinct or pull up a distinct, resulting in a faster plan.
* If there is another part of the plan that already implies distinct, like a GROUP BY, then it doesn't need to do any work at all. (Or similarly, if you're doing a distinct on a primary key).
The declarative language promise was that it would be possible to perform such optimizations. Indeed, by-and-large, query planners do a pretty great job. I don't think it was ever a promise that query planners would be able to always find the fastest query for what you want.
I read the declarative language promise as you will never have to know what's going on under the hood. This is never true in any non-trivial declarative language being used for non-trivial work. I have developed a very negative initial reaction to anything that claims to be "declarative".
Likewise. The more declarative a platform is, the more I'm trusting its abstractions to never leak, because when I do I'm going to have to learn every agonizing minute detail of how it works under the hood.
SQL is a tiny kernel of beautiful relational mathematics wrapped up in a 3-foot ball of ugly implementation hacks.
NoSQL is what happens when you throw out the beautiful kernel and just keep the hacks, so it's not really better I guess.
I don't think it is a matter of intelligence, just skillset. Query planners could go a long way with a few people with strong relational logic skills that could prove logical equivalence of expressions.
There are some settings worth tweaking in postgres' config to improve the query planner. The one that made the biggest difference to us was "cpu_tuple_cost". Setting it to 0.2 (the default is 0.01) made it choose more optimal query plans more often.
Huh, interesting! I just looked up the parameter - "the planner's estimate of the cost of processing each row during a query". So increasing this will tend to cause the query planner to forgo complete table scans in favor of indexed queries more often. Increasing the parameter by a factor of 20 seems like you're telling it "never do complete table scans unless it's completely unavoidable, always use an index".
I can see why this would make the query plans look better (in that index scans make more sense to humans) but are you sure you're actually getting better performance? If you are getting better performance, why is the default so wrong? Computers are really fast at sequential scans, less so at random-access stuff.
"Computers are really fast at sequential scans, less so at random-access stuff."
This is true if you're simple talking about raw IO throughput, but in terms of algorithmic searching the truth is almost exactly the opposite. For example, to find a record in a 100 million record table with a unique key using sequential search would require on average a scan of 50 million records. To access the same record when that key is indexed requires accessing/testing a maximum 23 or 24 entries in the index (2^23 is roughly 10M) and then direct access of the exact record needed in the table. Indexes are used because they speed things up exponentially, even though it may be true that random access IO is much slower than sequential IO.
People, I suspect, also have much higher throughput when scanning sequentially compared to random access of items. But imagine trying to find the single person in a copy of the Manhattan white pages (is there still such a thing?) with the name 'Horace Greeley'. Would you prefer to scan sequentially from the beginning (no use of an index) or make use of the fact that white page phone books are indexed by lastname, firstname?
Of course, because sequential scanning is so much faster than random access it will always be faster on tiny tables. It will depend on the system and record size, but I'm guessing indexing will take over as faster method as table size grows beyond 5k records or so, give or take an order of magnitude. Not sure of exact numbers but you get the idea. As number of records goes up from there the speed advantage of indexing over scanning grows at exponential rate.
The default is configured for magnetic disks, where random IO is brutally slow (on the order of 30-100 per second) and sequential reads are reasonably fast. But what if you're running it on a Micron P420m with 700,000 IOPS per second?
It's worth noting that SQL Server, presuming you've configured uniques and relationships correctly, actually runs exactly the same query plan for their first query and their last query.
If the author is reading please post the EXPLAIN ANALYZE results for these queries. I'd much prefer the raw text psql output to the GUI representation in pgAdmin. More importantly it has the actual numbers for the intermediate steps to see which is the culprit.
Also, any idea how many times each variation was executed? 10M rows is not that big. At 1K per row (which is probably an overestimate) that comes out to 100 MB. That easily fits in memory. After the first full scan all of it would come from the cache. It's not a fair comparison unless you flush the cache between tests.
OP here. It's no EXPLAIN ANALYZE, but here are the EXPLAINs at least. The originals were averages of 10 runs. These EXPLAINs are just quick runs on my laptop, so no promises. :)
My first thought when reading this was that MSSQL (which i'm most experienced with) would automatically optimize this query. I'd think postgres can too. If I saw this happen in MSSQL I'd start by clearing the statistics for the table.
My second thought was that 20 seconds is a long time for a query and there's probably a better solution using a materialized view.
In PostgreSQL, you can use CTEs to force certain subqueries to run before others. That technique would be useful here, and would clean up the code a bit.
but CTEs in SQL Server are not an optimization boundary like they are in Postgres. This makes a big difference in their proper use.
In SQL Server, aside from recursion, CTEs are purely syntactic sugar and thus a tool for organizing code. Whereas in Postgres, due to the optimization boundary, they are a tool for query tuning. If you rewrite a query in Postgres using CTEs just to clean it up, you are doing it wrong, since the CTEs will also affect the planner.
Right, but that's assuming that performance is the most pressing constraint. Sometimes it's confidence in the query producing the right result, or maintenance further down the line.
Anyone have the table layouts they are using? I was going to attempt to run them through Visual Explain which is the tool I use on the iSeries which employs DB2.
A simple analysis over a very rotten table, doing a count(distinct) over a very varied field resulted in
Table Scan -> Temp Distinct Hash Table -> Hash Scan -> Final Results (330.022 ms) for a table with no keys and slightly over three million records.
I would love to do the same exact statements he is but need the table layout to replicate. I am curious if our analyzer would see a difference, it is really hard to "trick it" or beat it.
Amazon RedShift (based on Postgres) has an approximate function that can be used to return results faster while calculating distinct values from a large table with an error approximation of +2% (http://docs.aws.amazon.com/redshift/latest/dg/r_COUNT.html). To give some perspective, we ran a test query on a RedShift table with > 900 million records to fetch unique user id's. With the approximate function applied, the results were returned in 8 seconds while it took the usual distinct query 51 seconds, ~84% faster !
Is there any relational logic reasoning that would prove that the query planner couldn't theoretically optimize the first query into the execution plan of the third?
I wonder how efficient ORMs out there doing. My believe is that once performance profiling is done, and the crucial bottleneck is identified, people will replace the original call to ORM with a custom SQL query.
But heck, sometimes db language driver (like pymongo, python-mysql - just giving out example what I mean by language driver) are not optimized either. :) so a careful post-profiling analysis is required.
With SQLAlchemy, I have done similar optimizations against mssql by changing a few lines of ORM code. Without SQLAlchemy, I imagine I'd have to change dozens of hand written SQL queries.
A good ORM helps you to generate the exact SQL you need.
I am really surprised that you are using SA as ORM on MSSQL (sorry, since I don't touch mssql :) and most people I deal with are either MySQL or Postgres users).
> A good ORM helps you to generate the exact SQL you need.
Right, but sometimes it's the driver that behave badly. (One of those days when I claim to have free time I should dig deeper, do a good performance analysis.. one day...)
Title Case is still very common in the USA (see http://www.nytimes.com/ for an example, their house style is to use it for all of their headlines). You hardly ever see it in the UK these days.
If you're someone who doesn't know what they're doing, then this wouldn't seem quite obvious, would it?
So this post is probably not meant for you, but rather for people that are not as experienced in SQL.
I love that they're discussing this, and kudos to them, but unless the specific details are described (e.g. "and there are lots more log lines than dashboards and users" - what does that mean? That dashboard_id is nullable and many rows log entirely different things? That would be a significant problem unto itself. What is the cardinality of users, dashboards, and sets thereof), it's hard to really take much from it.
As a general recommendation I advise database developers to avoid exactly this sort of "outsmart the planner" behavior because it paints you into a corner -- data cardinality changes, schemas change, and you have large, clever queries that turn into a significant net detriment. They can absolutely be necessary in some cases, but this does not look like one of them.
"As a general recommendation I advise database developers to avoid exactly this sort of "outsmart the planner""
Probably wise in general. This particular case is peculiar because postgres doesn't optimize distinct aggregates at all (see my other comment: https://news.ycombinator.com/item?id=7115707 ), so it's possible that the rewritten query is actually more optimizable than the original.
The original query won't adapt to changing data sizes or distributions as well as the rewritten one.
This is not fundamental, nor universal. It's a specific limitation of postgres, so I still agree with your general advice. And postgres may fix this, so your advice may prove to be true even in this situation in the long run.
I think this was more an exposition on the general technique than specific recommendation.
The general theme should be, "how do I make my database do less work?" followed by a list of techniques, including "pre-aggregate to reduce volume". In this case, the pre-aggregates are in-line. Whether in-line pre-aggregates are appropriate will depend on context, just as you say.
OP - Can you please run your original query with a minor variation-
select
dashboards.name,
count(distinct time_on_site_logs.user_id)
from time_on_site_logs
join dashboards on time_on_site_logs.dashboard_id = dashboards.id
group by time_on_site_logs.dashboard_id, dashboards.name
order by count desc
Do you have a FK from dashboard_id to dashboards.id by chance? I assume no, as the query is presumably still forced to do an early existence test of the inner join before aggregating.
I disagree; the article is essentially about manual query optimization. Theoretically, a great optimizer could do this by itself, but the fully optimized query will run the same way pretty much everywhere; query execution engines are not very different, basically grace hash joins and B-trees.
> the fully optimized query will run the same way pretty much everywhere
This is not correct. The query will still be run through an optimizer, and the optimizer will still pick its own path of execution based on indexes and data uniqueness (as identified from the table metadata or its own dives into the data).
There are ways to force an optimizer to take a certain path, but those are dependent on which DB you're using.
Unfortunately, the original dataset doesn't exist online, so we can't verify their claims that it really is database agnostic.
And then there are table statistics or fragmented indexes that cause the optimizer to choose a different path. Even when you've done everything and you think the optimizer should choose a particular index it will surprise you.
As I detailed elsewhere, the problem with this assumption is that you are one-off optimizing by encoding metadata about the data in the query. So while this specific query has been made more optimal, the underlying problem with the database remains, and thus you will invariably face and then have to work around the same problem numerous times.
Or more often one won't because the legacy of hundreds of slow but not terrible queries accumulate. It becomes a death by a thousand cuts, which is the end result of many heavily used databases.
My example in SQL Server sees the simple addition of a single distinct index to give that same metadata to the query planner, yielding the identical plan for the first and last. And this same benefit will flow to every query that follows the same path.
I think MySQL basically gives up optimizing the moment you ask it to do a correlated subquery.
I use MSSQL at work and given it some really dumb schemas and really dumb queries and it still manages to pick good plans. It wouldn't surprise me if MSSQL generated the "optimal" plan from the first query.
It doesn't choose good plans if you break 3NF by a long shot or avoid the most obvious indexes. Otherwise the only thing to watch out for is stale stats and parameter sniffing.
Oracle 12c has adaptive plans which would be a great addition to MSSQL in the fight against parameter sniffing. Basically it figures out that it's probably got the wrong plan ("this nested loop join stopped making sense about 10,000 rows ago") and adjusts it accordingly.
Again, wouldn't be surprised if Oracle got it "right" from the first query.