Learn your DB server. Check the query plans often. You might get surprised. Tweak and recheck.
Usually EXISTS is faster than IN. Beware that NOT EXISTS behaves differently than EXCEPT in regards to NULL values.
Instead of joining tables and using distinct or similar to filter rows, consiser using subquery "columns", ie in SELECT list. This can be much faster even if you're pulling 10+ values from the same table, even if your database server supports lateral joins. Just make sure the subqueries return at most one row.
Any query that's not a one-off should not perform any table scans. A table scan today can mean an outage tomorrow. Add indexes. Keep in mind GROUP BY clause usually dictates index use.
If you need to filter on expressions, say where a substring is equal something, you can add a computed column and index on that. Alternatively some db's support indexing expressions directly.
Often using UNION ALL can be much faster than using OR, even for non-trivial queries and/or multiple OR clauses.
edit: You can JOIN subqueries. This can be useful to force the filtering order if the DB isn't being clever about the order.
The most useful thing is learning your DBMS. There's no escaping the performance and isolation quirks of each one, and there are different bonus features in each.
One interesting thing I found about Postgres that's probably true of others too, often you can manually shard INSERT (SELECT ...) operations to speed them up linearly with the number of CPU cores, even when you have like 10 joins. EXPLAIN first, find the innermost or outermost join, and kick off a separate parallel query operating on each range of rows (id >= start AND id < end). For weird reasons, I relied on this a lot for one job 6 years ago. Postgres has added parallelism in versions 10+, but it's still not this advanced afaik.
It would be cool to see something automating this conversion (and other similar performance workarounds) available as a first-class feature in a SQL IDE.
I ended up building jank Python-based tooling around it to sorta automate it. You select your key and it decides the ranges for you.
Wonder if it'd be useful at all for live applications or just data processing. For the former, would need to somehow execute all reads at the same MVCC version even though they're separate connections.
> > Keep in mind GROUP BY clause usually dictates index use.
> The reason for this wasn't immediately apparent to me.
The key thing to remember is that grouping is essentially a sorting operation, and it happens before your other sorts (that last part isn't necessarily as obvious).
Sorry, was on mobile so hadn't patience to type examples.
SELECT
column1,
(
SELECT column2
FROM table_b
WHERE table_a.id = table_b.a_id
) as b_column2,
(
SELECT column3
FROM table_b
WHERE table_a.id = table_b.a_id
) as b_column3
FROM table_a
It might look like a lot more work, but in my experience it's usually a lot faster. YMMV but check it.
How well that performs compared to a JOIN can vary massively depending on the data sizes of table_a & tale_b, how table_b is indexed, and what else is going on in the query.
If table_b has an index on id,column2,column3 (or on id INLUDEing column2,column3) I would expect the equivalent JOIN to usually be faster. If you have a clustered index on Id (which is the case more often than not in MS SQL Server and MySQL/InnoDB) then that would count for this unless the table is much wider than those three columns (so the index with its selective data would get many rows per page more than the base data).
Worst (and fairly common) case with sub-selects like that is the query planner deciding to run each subquery one per row from table_a. This is not an issue if you are only returning a few rows, or just one, from table_a, but in more complex examples (perhaps if this fragment is a CTE or view that is joined in a non-sargable manner so filtering predicates can't push down) you might find a lot more rows are processed this way even if few are eventually returned due to other filters.
There are times when the method is definitely faster but be very careful with it (test with realistic data sizes and patterns) because often when it isn't, it really isn't.
Are there any tools or tips to help speed up the "which JOIN is duplicating data" hunt?
Usually my biggest problem is getting all the query parameters lined up to reproduce the issue! (Being able to flip on extended logging or a profiler can make this easy.)
Cutting out the result columns when disabling JOINs to narrow it down is straightforward but tracking columns down in WHERE clauses quickly tends not to be.
Agreed. Learn how to use EXPLAIN and interpret using whatever tools you prefer. Also monitor your queries. Something I did in a previous start up was install PgHero, and man did that tool help me optimise and prioritise performance.
> helpful to identify slow queries in production, remove duplicate indexes, see missing indexes, keep an eye on table size, etc
--
I haven't recently put in the effort to find a copy of SQL Sentry from back when the full-featured edition was briefly free but even the "always free" version was helpful working with MSSQL query plans.
> Learn your DB server. Check the query plans often. You might get surprised. Tweak and recheck.
Oftentimes the well-designed queries behave unexpectedly, because the column statistics are not updated or when the data is fragmented for big tables (e.g. random PK insertion).
> Any query that's not a one-off should not perform any table scans. A table scan today can mean an outage tomorrow
I disagree.
There are queries where a table scan is the most efficient access strategy. These are typically analytical/aggregation queries that usually query the whole table. And sometimes getting only 50% of all rows is better done using a table scan as well.
I also don't see how a (read only) "table scan" could leave to an outage. It won't block concurrent access. The only drawback is that it results in a higher I/O load - but if the server can't handle that, it would assume it's massively undersized.
> Just make sure the subqueries return at most one row.
The JSON functions most RDBMS offer are awesome for that. One subquery to get a JSON if you have multiple results for the field then you only have to decode it on the app side.
Relevant for applications as well, when a table only has a few thousand entries, a scan is not the end of the world and not even an outage in waiting.
I agree with you that one should seek when possible as part of normal query optimization, but depending on your data, it could also just easily be something you can live with forever.
You can’t control the growth rate of your tables, you can only estimate it.
When we design for reliability we want to remove single cause failures and make a best effort to reduce dual cause failures. We definitely don’t want two failures from a single cause.
What reason might the lack of indexes suddenly become a critical issue? And what other things might you be scrambling to deal with at the same time? Tables might fill quickly when a favorable review comes in, or some world even results in churn in your system.
Just make the damned index. You Are Going to Need It. And what’s the harm in making it?
Depending on your application, you can very accurately estimate it. And in a case I had yesterday, it involved stringy numbers because of a third party system, so I could indeed add a computed persisted column that converts our number to a VARCHAR, add a 9th index with a lot of fields on that computed column and then save… almost nothing compared to just scanning 6k rows.
We had one such table for years. The one day I get an emergency call from support, big customer don't get their responses and it's critical for their workflow.
After some digging I found the service generating the responses got killed due to being unresponsive.
Turns out our customer got a new client which caused them to suddenly generate 100x as much data as others in this module. And that caused a lot more data in a table that joined this non-indexed table.
So everything was working, it was just the performance went over a cliff in a matter of days due to the missing index.
Added the required index and it's been humming ever since.
I've had similar experiences, and so these days I'm very liberal with indexes.
We have read-heavy workloads, if you mostly insert then sure be conservative.
As usual, there are well-qualified exceptions. If you are very certain the table scan can't hurt, sure. But in my experience, an index wouldn't hurt any in those cases.
The "readability" section has 3 examples.
The first 2 are literally sacrificing readability so it's easier to write, and the last has an unreadable abomination that indenting is really not doing much.
> The first 2 are literally sacrificing readability so it's easier to write, (...)
The leading comma format brings benefits beyond readability. For example, in version control systems the single-argument-per-line-with-leading-comma format turns any change to those arguments as a one-line diff.
I think developers spend as much time looking at commit historyas they do to the actual source code.
If you’re still using a diff tool that can’t do sub-line diffs it’s time to join the 20’s. I haven’t been forced to use one of those in over ten years.
> If you’re still using a diff tool that can’t do sub-line diffs it’s time to join the 20’s.
I think you failed to understand what I wrote.
Leading comma ensures one line diffs, but trailing comma forces two-line diffs when you add a trailing argument. With trailing comma, you need to touch the last line to add a comma, and then add the new argument in the line below.
We are not discussing bundling all arguments in a single line. I don't know where you got that idea from.
It does not. It just moves the edge case to a different position: trailing comma has the "issue" when adding an argument to the end of the list while leading comma has it when adding an argument to the beginning.
Also, as pointed out by the other commenter, any decent modern diff tool will make it obvious that the change to the existing line is just the addition of a comma, which makes the difference basically moot.
What’s the value in doing this unless it makes the diff clearer?
It only makes the diff clearer if you don’t have single character highlighting in your diff tool. Which most have now. Have had for a decade.
Also it’s not going to be a single line anyway. You add a line to the query and one to the caller. At a minimum. So you’re really arguing for three versus four. Which is false economy.
I'm not the biggest fan of how the first two conventions look, but they are real conventions used by real SQL people. And I can understand why they exist.
I've seen them enough to not be bothered by them any more.
Yeah, unfortunately you're right that they are real conventions. Quite common too.
I also _understand_ why they exist. It's simple: It makes code marginally easier to write.
But writing confusing, unintuitive and honestly plain ugly code. Just so you can save a second after clicking run and the compiler tells you the mistake is a bad reason.
A lot of "readability" depends on what you're used to and what you expect. I don't think these conventions are inherently "ugly" or "confusing", but they are different to what I've been doing for a long time, and thus unexpected, and thus "ugly". But that's extremely subjective.
I've done plenty of SQL, and I've regularly run in to the "fuck about with fucking trailing commas until it's valid syntax"-problem. It's a very reasonable convention to have.
What should really happen is that the SQL standard should allow trailing commas:
> A lot of "readability" depends on what you're used to and what you expect.
Yes. Typically shared sense of "readability" in a community for language X translates to "idiomatic patterns when writing X". There's no real thing as readability in a universal sense. It's a placeholder statement for "it's easier for ME to understand", double emphasis on "ME". Within a community, "readability" standards are merely channeling the idiomatic patterns within that community as for most members they'll be easier for the person to understand as it's what they're used to seeing.
Yes pleeease allow trailing commas. The amount of times I have stubbed my foot on that...
Especially when removing or commenting out a line.
Doesn't help that many environments where SQL used has poor syntax highlights on errors. Looking at you, Grafana...
Alternatively, write a mess of SQL like a three year old child that just discovered MSPaint then push the "beautifier" button and knock off for an early lunch.
Who splits column per line in the SELECT block and still leave 150 character wide lines? This is a fucked up definition of legibility. I can’t even get started on the commas.
NOBODY CHECKS LONG LINES IN CODE REVIEWS. That was the biggest problem with AngularJS. People mishandling merges and breaking everything because the eyes start to glaze over at column 90. I’ve been on more than half a dozen teams with CRs and it’s always the same. I’m exquisitely aware of this and try not to do it, and I still fuck it up half as often as the next person.
Split your shit up. Especially when trying to set an example for others.
This could be a great comment if the tone was different. I'll try to give my perspective.
SQL, unfortunately, is very verbose and has a strange mix of super-high and very low abstraction. There is also no SQL formatter out there that does a decent job, and no real consensus about how good SQL is supposed to look.
If I look at the 'indent' guideline, it contains e.g.:
, IFF(DATEDIFF(DAY, timeslot_date, CURRENT_DATE()) >= 29,
LAG(overnight_fta_share, 2) OVER (PARTITION BY timeslot_date, timeslot_channel ORDER BY timeslot_activity),
NULL) AS C28_fta_share
Immediate SQL failures: 1) it has no easy facility to pull that DATEDIFF clause in a different variable/field. 2) The LAG line is verbose, especially if your DB doesn't allow to pull out the WINDOW clause.
I'll try to give some constructive criticism instead of a drive by pot shot. I'm sorry, it's just that the leading commas make my eyes bleed and I really hope the industry moves away from it.
On point 3:
What I do is use CTEs to create intermediate columns (with good names) and then a final one creating the final column. It's way more readable.
```sql
with intermediate as (
select
DATEDIFF(DAY, timeslot_date, CURRENT_DATE()) > 7 as days_7_difference,
DATEDIFF(DAY, timeslot_date, CURRENT_DATE()) >= 29 as days_29_difference,
LAG(overnight_fta_share, 1) OVER (PARTITION BY timeslot_date, timeslot_channel ORDER BY timeslot_activity) as overnight_fta_share_1_lag,
LAG(overnight_fta_share, 2) OVER (PARTITION BY timeslot_date, timeslot_channel ORDER BY timeslot_activity)as overnight_fta_share_2_lag
from timeslot_data)
select
iff(days_7_difference, overnight_fta_share_1_lag, null) as C7_fta_share,
iff(days_29_difference, overnight_fta_share_2_lag, null) as C28_fta_share
I appreciate the feedback, no offence taken. I'm an analyst so I often find the leading comma useful when I'm testing something and want to quickly comment a column out but I take your point.
And I agree, I should have used CTEs for this query, I was just trying to save lines of code which had the unintended consequence of quite an ugly query. However I did want to use it as an example of indentation being useful to make it slightly easier to read. Although perhaps I'm the only one who thinks so.
My tips for working with complex Stored Procedures.
1. At the beginning of the proc, immediately copy any permanent tables into temporary tables and specify/limit/filter only for the rows you need.
2. In the middle of the proc, manipulate the temporary tables as needed.
3. At the end of the proc, update the permanent tables enclosed within a transaction. Immediately rollback transaction/exit the proc, if an error is detected. (By following all three steps, this will improve concurrency and lets you restart the proc without manually cleaning up any data messes).
4. Use extreme caution when working with remote tables. Remote tables do not reside in your RDBMS and most likely will not utilize any statistics/indexes your RDBMS has. In many cases, it is more performant to dump/copy the entire remote table into a temporary table and then work with that. The most you can expect from a remote table is to execute a Where clause. If you attempt Joins or something complicated, it will likely timeout.
5. The Query Plan is easily confused. In some cases, the Query Plan will resort to perform row by row processing which will bring performance to a halt. In many cases, it is better to break up a complex stored procedure into smaller steps using temporary tables.
6. Always check the Query Plan to see what the RDBMS is actually doing.
I've significantly improved the performance of queries by undoing someone who did #5 when it wasn't strictly needed. Sometimes breaking a query into many smaller queries is significantly less efficient than giving the query optimizer the entire query and letting it find the best route to the data.
If you've done #5 without doing #6 then you'll likely not see that you're doing something not optimal. My advice is avoid premature optimization and do things the most straight forward way first and then only optimize if needed. Most importantly, don't code in SQL procedurally -- you're describing the data you want not giving the engine instructions on how to get it.
I hate having to use a bunch of temp tables, but I regularly run into queries that would never finish if you let the query planner do its thing. Like compilers their ability is highly overrated. Meanwhile, microsoft places constant warnings against trying to even tune their query planner because it supposedly knows best.
And remember that these rules may be completely valid in one vendor's database, but another's may have very different priorities/characteristics/trade offs.
Also, the version of the database can matter, too.
1-3 are nice if you can guarantee your data is reasonably sized, but if it gets too big for your hardware taking copies of large datasets and then doing updates on large datasets can add a lot of overhead.
I'm not a fan of "just in case" development. Not when it comes to interfaces and also not regarding this `where 1=1` placeholder. Do things when you need them. Not if you think you might need them someday in the futures. Also, production code is not the place to keep dev helpers around. Do what you want in dev time, but for prod code readability and clear intent is much more important.
Do you fully qualify all table + column name references? I've found it often increases readability by at least an order of magnitude but quickly becomes very verbose and incredibly painfully tedious to write.
With autocomplete (like with Jetbrain's Datagrip) I found that verbose table names aren't that much of a problem; they in fact really help readability.
Maybe fully qualifying all entity references is table stakes these days, I'll have to give Datagrip a spin.
I already complained about autocomplete in response to another comment asking to put FROM first; maybe existing tooling is enough to make my life easier.
One more point in the "Anti Join". Use EXISTS instead of IN and LEFT JOIN if you only want to check existence of a row in another large table / subquery based on the conditions. EXISTS returns true as soon as it has found a hit. In case of LEFT JOIN and IN engine collects all results before evaluating.
Yeah, I was a bit confused there. In all my testing, (NOT) EXISTS was generating either a better plan or the same one as (LEFT) JOIN/(NOT) IN. In addition, it’s also clearer what the intent is.
Regarding "Comment your code!": At least for MSSQL, it’s often recommended not to use -- for comments but instead /**/, because many features like the query store save queries without line breaks, so if you get the query from there, you need to manually fix everything instead of simply using your IDEs formatter.
Are you able to cast as XML? I use that for OBJECT_DEFINITION, eg.
select name,cast((select OBJECT_DEFINITION(object_id) for xml path('')) as xml) from sys.procedures
This can be easier to straighten out since it preserves the newlines though other XML characters get mangled like > to >. One other option is VARBINARY plus something to un-hex it.
Everybody is up in arms about the comma suggestion but everyone thinks the 1=1 is a good idea in the where clause? If I saw that in a code review I don’t know what I’d think of the author.
You can motivate it with the same reasons as trailing commas. Making code reviews easier since changes to WHERE statements does not effect other lines. But if the reason is, as in this case to be able to add dynamic conditions. You will for sure be fired where I work.
Is anyone willing to share general guidance on where to draw the line when it comes to using DB configuration to speed things up ( almost "buy") vs. basically doing things manually ("build")? In my limited experience it often falls to app developers because competent DB admins are all getting paid much more to work elsewhere (as mentioned above, it is important to know the DB).
My canonical example is large volumes of data that accrue over time with the most recent accessed most often, where the DB admins can partition things or do partial indexes to keep access fast, but the app developers can move records into a separate archive table sometimes behind the scenes while still supporting things like (eventual) search of the whole data set. (A note here that it feels like a tool could do a lot of the initial heavy lifting to automate splitting one table into many when it makes sense -- perhaps when limited by a cloud DB's missing features)
Another management option sometimes accommodated by the DB vs. doing manually is to store all large blobs/files in their own separate database (filesystem?!) for a different storage configuration etc.
I imagine it can go as far as basically implementing an index manually: one massive table with just an auto-incrementing primary key but tons of columns then setting up a table with that ID and a few searchable columns (including up to going full text search/vectors I guess).
Edit: one useful tip manually implementing the Materialized View pattern with MSSQL 2016+: use partition switching as well explained and implemented by https://github.com/cajuncoding/SqlBulkHelpers?tab=readme-ov-... (incidentally the most commercially useful out-SEO'd tiny-star-count library I've ever found, focused on bulk inserts into MSSQL using .NET). I think this is a good example of drawing the buy/build line in the right place with the automation of the partition switching.
Not shown: stop using SELECT *. You almost certainly do not need the entire width of the table, and by doing so, you add more data to filter and transmit, and also prevent semijoins, which are awesome.
This is another area where I wish SQL was more composable. I’d love to be able to specify a bunch of columns using a single reference or a function, without having to resort to dynamic sql. Exclude and rename are a start, but not enough.
Because any subsequent clauses usually start with AND, so if you're just checking data or validating it outside of production, it makes sense since you can comment to see lines you aren't checking out.
It also depends on how you write SQL, but not by much.
I wouldn't put a dummy variable into a finalized query.
I like SQL but I think it's time for the big players like MySQL, MSSQL, Postgres etc to start using FROM-first and piping syntax. I've had the pleasure of using Kusto query language and it's a huge leap forward in DX.
I feel like a significantly more context-aware autocomplete could go a long way here, most SQL editors are approaching 72% garbage.
It should be possible to make an educated guess at which tables are in play.
Elsewhere I mentioned always fully qualifying entity references which would narrow the list of possibilities down to a more workable number in most cases.
Unfortunately EXCEPT is almost never what you actually want. It's a set operator, like UNION, and just like UNION it also has the side effect of removing duplicates from the result set unless you explicitly say EXCEPT ALL.
Because it's a set operator and not a join, it's usually very hard for the query planner to optimize it beyond the most trivial cases. It usually ends up being one of the last steps in the query plan. In a good query plan you almost always want to eliminate rows you don't care about as early as possible so you don't have to drag them along in every join operation only to have the data discarded at the end, but if you're using EXCEPT instead of the explicit anti-semi-join operator (that is NOT EXISTS(<subquery>)), you're making that very difficult for the query planner.
EXCEPT and INTERSECT are sometimes a handy shortcut when you're writing some quick and dirty query by hand, but I have literally never used either of them in a production query. You almost always want to use EXISTS() and NOT EXISTS(). They explicitly communicate intent, which is appreciated both by people reading the query and by the query planner, and they lack the footguns some of the other alternatives have.
A common mistake I see is that people think foreign keys will automatically create indexes. Missing indexes is a general problem in SQL. Missing indexes on columns that are in foreign keys are even worse.
i noticed a lot of these advise used by senior devs in my team or in legacy code. but as someone just starting out, a lot of these (like "1=1") was very odd and made the queries less accessible.
nice to finally found the term for the "anti-query"; learning about it really changed how i write queries. equally good to see that most of these apply regardless of the RDBMS of choice.
Not at all actually, I just hadn't really planned to add this as a tip.
Additionally I thought an in-line view was fine for the examples included.
But maybe I will!
On readability, I often find aligning things in two columns is more readable. To modify the two examples in TFA:
SELECT e.employee_id
, e.employee_name
, e.job
, e.salary
FROM employees e
WHERE 1=1 -- Dummy value.
AND e.job IN ('Clerk', 'Manager')
AND e.dept_no != 5
;
and with a JOIN:
SELECT e.employee_id
, e.employee_name
, e.job
, e.salary
, d.name
, d.location
FROM employees e
JOIN departments d
ON d.dept_no = e.dept_no
WHERE 1=1 -- Dummy value.
AND e.job IN ('Clerk', 'Manager')
AND e.dept_no != 5
;
In the join example, for a simple ON clause like that I'll usually just have JOIN ... ON in the one line, but if there are multiple conditions they are usually clearer on separate lines IMO.
In more complicated queries I might further indent the joins too, like:
SELECT *
FROM employees e
JOIN departments d
ON d.dept_no = e.dept_no
WHERE 1=1 -- Dummy value.
AND e.job IN ('Clerk', 'Manager')
AND e.dept_no != 5
;
YMMV. Some people strongly agree with me here, others vehemently hate the way I align such code…
WRT “Always specify which column belongs to which table”: this is particularly important for correlated sub-queries, because if you put the wrong column name in and it happens to match a name in an object in the outer query you have a potentially hard to find error. Also, if the table in the inner query is updated to include a column of the same name as the one you are filtering on in the outer, the meaning of your sub-query suddenly changes quite drastically without it having changed itself.
A few other things off the top of my head:
1. Remember that as well as UNION [ALL], EXCEPT and INTERSECT exist. I've seen (and even written myself) some horrendous SQL that badly implements these behaviours. TFA covers EXCEPT, but I find people who know about that don't always know about INTERSECT. It is rarely useful IME, but when it is useful it is really useful.
2. UPDATEs that change nothing still do everything else: create entries in your transaction log (could be an issue if using log-shipping for backups or read-only replicas etc.), fire triggers, create history rows if using system-versioned tables, and so forth. UPDATE a_table SET a_column = 'a value' WHERE a_column <> 'a value' can be a lot faster than without the WHERE.
3. Though of course be very careful with NULLable columns and/or setting a value NULL with point 2. “WHERE a_column IS DISTINCT FROM 'a value'” is much more maintainable if your DB supports that syntax (added in MS SQL Server 2022 and Azure SQL DB a little earlier, supported by Postgres years before, I don't know about other DBs without checking) than the more verbose alternatives.
4. Trying to force the sort order of NULLs with something like “ORDER BY ISNULL(a_column, 0)”, or doing similar with GROUP BY, can be very inefficient in some cases. If you expect few rows to be returned and there are relatively few NULLs in the sort target column it can be more performant to SELECT the non-NULL case and the NULL case then UNION ALL the two and then sort. Though if you do expect many rows this can backfire badly and you and up with excess spooling to disk, so test, test, and test again, when hacking around like this.
Leading comma is nice in SELECT statements because you can comment toggle individual lines. Indenting your code to make it more readable is basically what anyone with room temperature IQ does automatically. A lot of these other tips look like they're designed to deal with SQL design flaws, like how handling nulls isn't well defined in the spec so it's up to each implementation to do whatever it wants.
A lot of databases support trailing commas in select clauses.
Which is just as well. I want to scratch my eyes out every time I see someone formatting with comma starting the lines. It's the kind of foolish consistency that is a big part of performative engineering.
> I want to scratch my eyes out every time I see someone formatting with comma starting the lines
Right!? I _physically_ recoil every time I see that. I think that's the clearest example of normalisation of deviance [1] I know.
Seems like anyone that enters the industry straight from data instead of moving from a more (software) engineering background gets used to this.
And the arguments in favour are always so weak!
- It's easier to comment out lines
- Easier to not miss a comma
Those are picked up in seconds by the compiler. And are a tiny help in writing code vs violating a core writing convention from basically every other language.
I'm a data person and despite seeing this for years, still despise that approach to commas. Seriously, it's not that hard to comment out the damn comma.
Just append a random constant at the end of every SELECT column list instead, 42 to the rescue! (I kid, I kid.)
I can't tell yet whether my experiment starting an all-OR WHERE clause with 0=1 so each OR could start the next line would go over like a lead balloon here too or not.
One thing I've actually found useful especially in SQL is always including a single-line comment in front of the closing of every multi-line comment, so that using a single-line comment to toggle off the opening of the multi-line comment toggles the entire block back into action and the end is already set (the closing comment is itself already commented out). This obviously doesn't work if multi-line comments start nesting, but that can confuse parsers enough already.
Trying to get an entire team on the same page with SQL formatting is one of the mothers of all bikesheds. In any case it's useful to be aware of the idioms whether or not they are personally palatable.
SQL has so little room for expression or opinionated formatting, so it's funny to people bikeshed over comma placement. I'm kind of jealous that they have time to think about whether left comma or right comma offends them greatly.
To be fair, BigQuery SQL is improving at quite a pace. If you follow their RSS, they are often announcing small but solid affordances like trailing commas, the new RANGE datatype, BigLake, some limited grouping and equality for arrays and structs, etc.
It is also probable that they expose Google's new query pipe syntax. Currently there are some hints from the error messages in the console that it's behind a feature flag or something.
The only ones that matter don't. If you care that much about seeing a comma somewhere you shouldn't, you have no real problems and shouldn't be complaining.
When you comment the first statement, that doesnt have the ",", it will break and you still have to remove the "," from the second line, so your comment is not valid.
I don't understand the point at all. If you need to add some condition later on, why not just add it then? What benefit is there to just marking out the spot where you might add the condition at some point in the future?
For their one here it's just the ability to rapidly comment/uncomment conditions in a query editor while exploring the data or debugging the query, and not having to worry about the leading AND or OR.
I've also seen it in code with iterative adds, for example:
for crit in criteria:
sql += " AND " + crit
No needed to add a sentinel or other logic to skip the first AND. I saw it a lot before people got used to " AND ".join(criteria).
I personally don't use it too, but I think it's origins are not just readability, but from developing queries in a REPL like environment.
As you develop and are constantly creating / debugging queries where you often add new and or or clauses as a whole line, that becomes much faster to add and remove those same lines as they're a single shortcut away in nearly all text editors.
Yeah, so often do I have an EXPLAIN ANALYZE query.txt file I'm repeatedly editing in one window and piping into psql in another to try and make something faster. So I put WHERE true at the top.
If you use it in the same way you use trailing commas. Fair. But the site says to make it easier to add dynamic conditions. Which is a terrible idea in maybe not all but many SQL engines.
Lets say its 2001 and you are writing some hot e-commerce stuff in plain php. You want to filter data depending on multiple fields in the submitted form. If some field is there, you add one more "AND" clause to the "WHERE", like this: if (isset($_POST['product'])) { $query .= "AND product = " . $_POST['product']; }. So in order not to check every time if the added clause is the first one you start with "WHERE 1=1 ", as "WHERE AND ..." would not work.
I get how this isn't good. But how else would you handle multi-field filtering, keep all the ANDs and use (product_id = $1 OR $1 IS NULL) so the unset filters are no-op? That's ok as long as the query planner is smart enough.
In this case. A query that you build by adding different strings. 1=1 is for adding AND statements to the WHERE clause dynamically. In your code. I never seen it used for anything else. Adhoc is just the practice of running raw SQL queries.
So you end up with things like this.
"SELECT * FROM Music WHERE 1=1" +
"AND category='rock'"
The risk is now that you by mistake allow for SQL-injections but also every genre will generate a different query plan. Depending on what SQL engine you use this may hurt performance.
And one would think that this is a thing of the past. But it is not.
I think that it means the reason for doing where 1 = 1 is sometimes to allow for easy insertion of dynamic queries which can be a security and performance issue. The actual usage of where 1 = 1 doesn't cause the security or performance issue.
Which is exactly what the site says. To insert dynamic conditions. I know that you can use 1=1 for the same reasons as trailing commas. But kinda obvious that this is not the case here.
I'll add this as a caveat. I'm an analyst so my SQL isn't really exposed to anyone other than myself and so I wasn't aware of this, thanks for flagging.
A random person claims adding 1=1 is a security risk and you are going to add it as caveat without verifying if the claim is true nor knowing why? That's how misinformation spreads around.
OP doesn't know what they are talking about because adding 1=1 is not a security risk. 1=1 is related to sql injections where a malicious attacker injects 'OR 1=1' into the end of the where clause to disable the where clause completely. OP probably saw '1=1' and threw that into the comment.
Duration of working with SQL doesn't matter. The better SQL programmers don't do it specifically, and have experience in real languages that they bring over to database queries.
Not sure I get this. But I think it does matter since you understand why people do it to begin with. I worked on two enterprise solutions over the last couple of years that have this exact problem. That people are using WHERE 1=1 and then add random "AND something=something" that completely trashes the performance of the db. Also, it does not matter as much on-prem. But in cloud envs it does. Since you can't really spike CPU and mem the same way as on-prem.
The reason I pointed out this specific issue is just that I thought it was the worsed of many poor tips. ChatGPT can give better tips.
Jesus. The problem is not WHERE 1=1. It is WHY people do it. People do it because they then in Python, JS or whatever can easily add conditions. Like QUERY + "AND x='blabla'". There is the problem. Every time you create a unique SQL statement the query will need a new query plan in most SQL engines. Some will cache parts of the query. And you could use parameters along with this paradigm but if you are this bad at SQL I doubt it.
It is kinda funny that op backpedal on this cause to me the whole site is amateurish. I just pointed out what I thought was the worst part. It is likely that it is generated by AI. Either way the post is terrible.
I'm an analyst so I literally only query for analytical purposes. I don't have any SQL code embedded in an application so the 1=1 is purely for tweaking/testing a query more quickly.
I certainly didn't use AI, all these tips/tricks are from work experience and reading Snowflake documentation and the like, but I guess I can't convince you either way. Regardless I appreciate the feedback!
Learn your DB server. Check the query plans often. You might get surprised. Tweak and recheck.
Usually EXISTS is faster than IN. Beware that NOT EXISTS behaves differently than EXCEPT in regards to NULL values.
Instead of joining tables and using distinct or similar to filter rows, consiser using subquery "columns", ie in SELECT list. This can be much faster even if you're pulling 10+ values from the same table, even if your database server supports lateral joins. Just make sure the subqueries return at most one row.
Any query that's not a one-off should not perform any table scans. A table scan today can mean an outage tomorrow. Add indexes. Keep in mind GROUP BY clause usually dictates index use.
If you need to filter on expressions, say where a substring is equal something, you can add a computed column and index on that. Alternatively some db's support indexing expressions directly.
Often using UNION ALL can be much faster than using OR, even for non-trivial queries and/or multiple OR clauses.
edit: You can JOIN subqueries. This can be useful to force the filtering order if the DB isn't being clever about the order.