> CTEs won’t always be quite as performant as optimizing your SQL to be as concise as possible. In most cases I have seen performance differences smaller than a 2X difference, this tradeoff for readability is a nobrainer as far as I’m concerned. And with time the Postgres optimizer should continue to get better about such performance.
So, my knowledge here might be out of date, but this behavior isn't "the optimizer isn't smart enough": using a CTE in PostgreSQL causes an explicit boundary in the optimizer which prevents some optimizations from being performed. In some cases you really need/want this behavior, and it might increase the performance of your query; in other cases, this is the last thing you would want. People keep asking for optimizer hints, which PostgreSQL refuses to add, and yet they leave this quirk in explicitly as an escape hatch.
Yep, it's a huge improvement for complicated pipelines.
One thing to be aware of is "NOT MATERIALIZED". Here's what the docs say:
A useful property of WITH queries is that they are normally evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work. Another possible application is to prevent unwanted multiple evaluations of functions with side-effects. However, the other side of this coin is that the optimizer is not able to push restrictions from the parent query down into a multiply-referenced WITH query, since that might affect all uses of the WITH query's output when it should affect only one. The multiply-referenced WITH query will be evaluated as written, without suppression of rows that the parent query might discard afterwards. (But, as mentioned above, evaluation might stop early if the reference(s) to the query demand only a limited number of rows.)
However, if a WITH query is non-recursive and side-effect-free (that is, it is a SELECT containing no volatile functions) then it can be folded into the parent query, allowing joint optimization of the two query levels. By default, this happens if the parent query references the WITH query just once, but not if it references the WITH query more than once. You can override that decision by specifying MATERIALIZED to force separate calculation of the WITH query, or by specifying NOT MATERIALIZED to force it to be merged into the parent query. The latter choice risks duplicate computation of the WITH query, but it can still give a net savings if each usage of the WITH query needs only a small part of the WITH query's full output.
Any issues with recursive and NOT MATERIALIZED? Recursive is pretty much the reason to use CTEs for me and NOT MATERIALIZED is essential there in most cases.
You are right about recursive CTE - that's a functionality that's not possible without CTE
However, I will argue that an equally great benefit is code readability.
That allows you to give good names to pieces of SQL block is like having small functions in normal programming
Indeed it is, although as cascade of create temp view statements within transaction works almost the same way for readability. besides, not everyone can count the open/closing braces for the CTE parts.
CTEs also cannot have indexes on the intermediate results, while views/mviews can and this benefits great cascaded analysis in SQL.
speaking from experience, where we had to optimize a particular network analysis for urban planning, which was initially impossible in qgis/argis, then first version ran for 3-10mins, and initially with intermediate indices the whole thing went down to 45sec-1.5min. this is not possible with CTE to my knowledge, even though I've searched for ways to tell it to build intermediate indices.
Used judiciously, a macro system in front of your SQL is often a good approach here. You gain the "linguistically common subtable" capabilities of a CTE, the ability to name constants, and the ability to name/parameterize common sub-expressions, even in databases not supporting CTEs. With LSP support being what it is, you can even whip up editor integration for your new language in a day or less.
Compared to CTEs (assuming you're not using recursion and don't _want_ the materialization), that extra syntactic sugar helps with readability, and the "macro" nature of the thing ensures you don't have any slowdowns from running the thing due to a different query plan.
Downsides include the ability for juniors to turn your SQL into an awful mess, the fact that you'll still occasionally _want_ materialized CTEs for performance and have to write them anyway, plus if you implement it wrong you'll have runtime overhead and very few introspection capabilities (the simplest version I've seen that's decent to work with is creating SQL files as artifacts from the templates using the build system, obviously depending on how much of a rube goldberg your particular builds are).
> CTEs won’t always be quite as performant as optimizing your SQL to be as concise as possible. In most cases I have seen performance differences smaller than a 2X difference, this tradeoff for readability is a nobrainer as far as I’m concerned.
This is the key trade-off you need to keep in the back of your head. Pre-mature performance optimization is the root of all evil; prefer readability first. But the fact remains that you should be setting timeouts and other time budgets, tracing calls from API through to the database, seeing what's taking the most time. If you get to the point where you need to optimize a CTE-based query, be prepared to rewrite it.
WITH clauses are a somewhat recent feature, at least on PostgreSQL. If you learned SQL before they became commonly available, it's unsurprising if you tend to think of other SQL techniques first.
Well, it wasn't there when I first learned SQL by reading the PostgreSQL manual. I recall that this was back when the size of TEXT columns was limited, so it was before version 7.1 (which introduced TOAST tables), released 13 April, 2001.
So yeah, if you first learned SQL around 25 years ago, something added only 15 years ago is "somewhat recent". And according to a link posted in another comment, other databases introduced this feature (non-recursive WITH clauses) even later.
I'm all but certain that a CTE and an unindexed view are treated identically in the planning phase.
You're describing predicate pushdown, which can get hung up under certain conditions, like when using window functions or top. If I had to take a shot in the dark, you were dealing with something like that.
I use CTEs all the time, find them way more readable than subqueries. I've refactored a few complex queries to use them and had the reaction from the original author of "wtf is this crap?". It does take some getting used to ;)
This feels to me like a "in other news, water is wet" kind of story. Maybe it's just me, but based on the other comments here CTEs seem to be common knowledge. I went on a SQL trial by fire this year after landing in a role that required massive amounts of weird querying for ad-hoc reporting purposes. Before this my experience and knowledge stopped at different join types.
I naturally discovered CTEs over time after seeking solutions to my problems on the usual online resources. For me the use is two-fold. 1/ I use it to create ad-hoc lookups on data, where the lookups are not available in the DB in the first place. And 2/ In complex queries with lots and lots of conditional joins, I can use a CTE to basically build up my temporary base table, getting all my data in one place, then my main `SELECT` becomes a lot cleaner and easier to read.
I've not looked at recursive CTEs yet, supposedly they can be quite handy. Not sure for what though.
We recently migrated from MySQL to Postgres, and the data reporting went from direct-db querying to Snowflake. And CTEs work on for me on all 3 platforms, so this does not even seem like a Postgres only thing.
So, my knowledge here might be out of date, but this behavior isn't "the optimizer isn't smart enough": using a CTE in PostgreSQL causes an explicit boundary in the optimizer which prevents some optimizations from being performed. In some cases you really need/want this behavior, and it might increase the performance of your query; in other cases, this is the last thing you would want. People keep asking for optimizer hints, which PostgreSQL refuses to add, and yet they leave this quirk in explicitly as an escape hatch.