We built a large ETL and machine learning operation in PG through triggers, from simple algorithms calculating rate of change for updated datasets, to identifying trends and connecting seemingly unrelated data.
Even at our scale [0][1] the performance tradeoff is absolutely worth it. The best part is the consistency of the data. If the trigger dies, the whole transaction is rolled back and you have to re-run it. This way we never end up with different state of data that has to be fixed after the fact.
There are some downsides. Triggers are very hard to debug, optimize and monitor. Versioning is kind of nightmare and there is no direct performance information that you could capture. In our case it's even more as we are running on citus and we have to deal with data distribution, colocation and other issues.
But it's now two years and we're only adding to it.
> There are some downsides. Triggers are very hard to debug, optimize and monitor.
That's enough for me. My heart always sinks when in response to a problem someone suggests a solution of "let's just stick a few triggers in"..
Its happened too many times in my career where some wierd problem that noone can work out turns out to be caused by a trigger that noone realised was there.
I do understand. Triggers are ripe for improvements. We have some internal tools built around them, but nothing significant. Most helpful are protocols for humans to follow when dealing with a problem. Always start with the trigger.
At the same time, triggers saved us hundreds of thousands of lines of code and extremely complicated logic, that would be required would we wanted to replace them.
The biggest concern that engineering teams seem have with stored procs is maintainability. The consistency guarantees are indeed very attractive, but perhaps not at the expense of keeping business logic in separate places, weak version control/deploy solutions, and a new programming language to the stack.
If there were a compiler that could take business logic in the project's programming language and manage mappings to stored procs (reminiscent of LINQ's mapping to queries), maybe we could have the best of both worlds.
The best implementation I've seen of this was actually in a Rails library, where rather than keep the functions and triggers in migrations, it kept them in individual files so they'd be version controlled just like the rest of the code...and then reloaded them with migrations if they had changed.
I don't know if the library is maintained anymore, but I need to find it.
EDIT: Found it and it's not maintained. Solid approach to the problem though.
That was what we did in Java with Flyway and it’s stateless migrations. You could easily see the history of stores procedures and every time the app started up it made sure they were all updated and correct.
Worked great, WAY better than having them buried in other SQL files.
Funny. I was just having a conversation with my partner about this very issue and this was the approach I was thinking of. I’ve found keeping database functions, triggers, etc. painful as part of standard migrations—finding the most recent version is particularly annoying. They’re the kind of thing it’s best to treat like the rest of a project’s code—version controlled outside of migrations.
How would you create triggers if you didn't create an SQL file with the trigger code, and running that at some point? And since it's in a file, wouldn't you check that into version control?
Or are you actually typing in triggers manually in the Postgres command line?
Rails migrations are checked in to version control, and then rails keeps track of which "migration" has been run. For tables this works fine, because you want to define them only once and if you need to change columns/add columns etc. you can create a new migration, which contains only the updates you want.
For triggers however you need to replace the trigger wholesale, so you end up with several versions of the trigger on your file system and need to check each migration file to see what the latest definition of the trigger is (for tables rails normally creates a separate file with the current state of the database after running a migration, so you have relatively easy access to the latest state of table definitions).
You can easily add functions, triggers, etc. as SQL in migrations. It’s just not a good idea when you realize you might want to change them. Putting them in migrations seems nice at first because it omits an extra step when upgrading your db. But then you have the SQL multiple times in your migrations, which makes finding the correct version annoying.
Not the OP but I did use triggers as suggested in the past. IME the issue is that DDL like triggers are transactional, so when you deploy them executing queries could be using a mix of old and new.
Granted, this is an issue for any system comprised of distributed components. Though I think the pain is more acute when it's the source of truth.
One way around is downtime or going all in on stored procedures for all work, or at least all writes.
The other wrinkle is that the trigger has to be rolled back when reverting code. And some shops are not diligent about maintaining down procedures. There is also the possiblity of rolling back migrations out of the order they were run, so you might not get back to a consistent state.
To keep things clean one could keep the trigger code in a file separate from any migrations and then create a migration that references the trigger source so that one doesn't have to go digging through the migration files later to read the trigger or update it.
I think some folks above may literally have meant they put the trigger code in the migration file itself. Or I'm way off base.
Yes, you can technically just drop and replace all your stateless things (views, triggers, stored procedures, user defined functions).
But, if one of your migration steps happens to depend on a particular version of one of those things (it's not likely to be a trigger, but who knows?), then it can break.
I'd say - assuming you're doing your migration offline - then do it both ways. At a previous workplace I used a paid for tool called SQL Delta [https://www.sqldelta.com/]. It couldn't handle our complicated data migrations (at some point you need a human in the loop), but it was really useful for checking if there was anything off, and for synchronizing all of the views and so on at the end.
> but perhaps not at the expense of keeping business logic in separate places
Perhaps the answer is to consolidate business logic in the database with constraints. This also protects data from manipulation outside of an app, like with scripts or ETL.
Putting more logic in the db often results in terse error messages though, so the app needs to deal with that and make them nicer for the end user. I wrote about one possible approach here - https://begriffs.com/posts/2017-10-21-sql-domain-integrity.h...
> weak version control/deploy solutions
There are solutions that allow you to store migrations in version control, apply them to different database targets, and run consistency checks. For instance http://sqitch.org/ but maybe you're familiar with it already and regard it as one of the weak tools. Thought I'd point it out though.
> a new programming language to the stack
plpgsql is somewhat gnarly, but it is well adapted for the database. You know what it's going to do, compared with managing mappings from some other language. I never tried LINQ though so maybe the mapping would be more pleasant than I'm imagining.
I agree it doesn't look like Python or whatever, but it does look a lot like SQL. So if you're familiar with SQL, PL/pgSQL is easy to learn. To me it's just SQL with if-statements and loops.
They are slower (by orders of magnitude) than writing them in C or C++, or even PL/pgSQL. Calling into the interpreter for millions of trigger events can quickly become a major bottleneck.
When I originally implemented https://pgxn.org/dist/debversion/ for version numbering, I originally implemented it in Perl, then Python. The implementations were clean, but the performance of both was abysmal. After reimplementing it in C++ with a C interface, it runs like greased lightning. While this is a custom datatype with operators implemented as C functions, the same concerns apply to triggers which are invoked on every affected row.
We have the DB schema in a custom XML file, and have written a tool to compare the actual schema with that of the XML file. The tool then applies the necessary changes if they're non-destructive (extending column, adding index etc), or fails and lets the user know what needs to be done if run interactively.
By making sure we do as few potentially "destructive" changes to the schema as possible, this tool automatically upgrades our customers database when we release a new version with high reliability.
Adding or altering triggers, stored procs and views are considered "non-destructive" in this context, but of course that requires some discipline from us. One aspect of that is that we try to keep the number of triggers to an absolute minimum. Removing triggers and similar is considered "destructive", however we have a way in the XML to explicitly delete the object if needed.
The XML file lives in our version control repository alongside the source code, and we have triggers which updates test databases automatically when an update is committed etc.
It's a fairly simple idea but has worked quite well for us.
a lot of languages have migrations available, that help manage the state of the database. in addition, postgres allows for you to ALTER or REPLACE functions and triggers, just like you'd ALTER a table.
these in combination make it very easy to manage and deploy.
Liquibase is one example of a tool that integrates with your project and can deploy or rollback version controlled scripts. Bit similar to Rails migrations too from memory.
> weak version control/deploy solutions, and a new programming language to the stack
Are people not using the same version control for stored procedures as their "normal" programs? I thought deployment was pretty much a solved issue with DBAs.
In my experience - there are a lot of organizations where old-school DBAs resist any modern software engineering practices, such as proper version control.
I think some of it is stubborness, but a lot of it is that DBAs in those organizations tend to be more "developers who happen to write DB code" and not really the A in DBA. Letting "their" code go into version control is the first step in breaking the illusion that they are somehow special, and it scares them.
Or it could be that DBAs who let the database go down get fired. I know a lot of DBAs that are old school that live in fear of any change because they get left holding the bag. They aren't special, they get blamed. It also doesn't help that a lot of new developers don't understand why "move fast, break things" is just not something any DBA of any worth is going to accept.
Based on the experience of co-workers (and my own limited dabbling), you are right that version control is (and should be) used for stored procedures just as it is with application code. The problem is that there isn't a lot of tooling that isn't language- or platform-specific to help maintain these things in the long run. So DBAs tend to write and maintain their own collection of homegrown scripts for deploying changes to the DB (or worse, do it by hand).
I'm sure some companies have this figured out, though, so YMMV.
- Scripts are numbered for quick order determination by humans (e.g., 0001-initial-schema.sql, 0002-create-foo-table.sql)
- A scripts.json file containing an array of string filenames declares the exact order of scripts, just in case numbers are shared for some reason (say, two branches being merged at the same time with a new script in them)
- A custom PowerShell module knows how to read scripts.json and the files and execute the changes
- When possible, all scripts are executed transactionally by the CI tool (some commands in some RDBMSes can't be executed in transactions, however)
- Optionally, backups/restores may be performed to handle failures
I suspect it’s the shared infrastructure problem. People who are used to running a microcosm of their app locally don’t usually make statements like that in my experience.
I have a dev box with 16 GB of RAM that sits unused all day because I work with people who think this way. Such a waste of resources.
For a large node app, we wrote most of the business logic in JS and ran with PLV8. All the procedures were called from a single PLV8 library in shared memory, and we created functions stubs so you could still call with straight SQL (mostly used for ETL scripts).
I've worked on a pretty large, very stored proc-heavy app based on SQL Server for several years, and separation of business logic has never been much of an issue.
Keeping logic in stored procedures is, in practice, not much different from putting it in a separate function. If your app is data-driven, like the one I work with, most of your business logic can be done in stored procedures. Most any sort of data transform is likely better done in SQL than another language, unless your data is a poor fit for a relational database.
We have had issues with version control in the past. I'm currently working on implementing a CI and Git workflow using Jenkins and Red Gate tools, which I expect will make keeping our database in source control trivial. To be fair, though, I'm still working out the last of the kinks, so it may end up being more difficult than I thought.
As far as having to learn another language, I've always found that to be an incredibly weak excuse to use JavaScript everywhere. Learning another programming language isn't difficult, and having the right tool for the job is indispensable. Given that JavaScript these days seems to go by "flavor of the week", I don't that think not being able to learn is a problem.
I don't find the argument that it's difficult to switch between languages to be convincing, either. Switching between two programming languages that you know isn't any more difficult than switching between a programming language and a spoken one.
One approach that works very well is to keep stored functions in separate .sql files in a directory (I use "fixtures"), and execute them all on each deployment. This should happen after triggering migrations, so that table and column dependencies are guaranteed to be present. The .sql files use CREATE OR REPLACE FUNCTION so that their execution is idempotent.
This keeps the stored functions version-controlled along with the source code, and avoids any need to hunt through migration files to find the latest definition. Adding a stored function or modifying its function body just works.
The less common operations of deleting a function or modifying its argument list do require an explicit line in a migration file, but those situations are rare (and potentially backward-incompatible, requiring extra caution regardless).
One subtlety is that a migration that adds a new table with a trigger should define an empty stub function as the trigger. This avoids duplicating code. The real function body will be loaded from the fixture immediately afterwards.
My second biggest concern is performance overhead. Sure, they are individually fast. However the one thing that is hardest to scale is the database. Loading the database up front with overhead means that you'll hit that limit sooner rather than later.
The biggest problem I've had with triggers is that your application tier tells the database to do an update, it says it did it, and then the app shows the user the result.
But in the background, the trigger somehow overrode your changes, so the next time the user comes back they see different data than the app just told them was there.
Granted, this is arguably the wrong way to use triggers, but once it's become an invasive problem in your codebase it's incredibly hard to deal with. You can't remove the triggers for fear of what still relies on the side effects, and you don't want to query back the data for every update either. It ends up being like the polar opposite of functional programming - side effects everywhere.
In postgres you can handle this using the "returning" syntax for updates; then as long as the trigger runs before rather than after the update you don't have to lie about anything.
I'm not really sure that would help in the case we had.
Most of our code ran through an ORM, and the ORM assumes the DB either did what it asked, or the query fails. I don't want to have to abandon my ORM because I can't trust my "DBA" to not sabotage my queries.
A good ORM allows you to specify that a column may be changed server-side by triggers etc. on an update (or insert), and it will then make sure to retrieve the correct value after the fact, if necessary.
This is just a limitation of ORMs. After a certain complexity you really should just be making direct queries to your database and not relying on ORMs.
I've often wanted to write a compiler that would accept a schema definition like protobufs or graphql as input and generate the required SQL to update the database if a schema change was committed. It would also be read by the application layer so its model definitions were kept up to date.
As you say, there's a tremendous advantage in using triggers to remove data invariants, but then there's also the issue of the schema's state, and it can be prone to error depending on the complexity of the trigger. It's definitely recommended to use SQL's schema inspection to verify and test it. Perhaps even based on schema definitions from protobufs or whatever.
I understand the appeal and you're not the only person to want it: I've seen people and companies implement db schema to graphql, db schema to models, and models to db schema.
Based on personal experience, this is a bad idea though. You do not want a 1:1 correspondence between your database schema, your backend models, and your graphql schema, because the way you organise information in each layer should be different.
Database schema needs to be performant for expected queries. That means de/normalisation decisions; sometimes the same data will be stored in multiple locations.
Backend models need to express the domain, because this is where your business logic is. (There's a reason people bitch about ORMs: when you get to complex enough usecases they're not flexible enough in either direction and you need extra models wrapping THAT.)
Graphql schema is a view of your backend; sometimes several fields will be fulfilled using the same model, sometimes your model should not have a reflection in graphql schema (because you do not want to expose this data to frontend/the world), and sometimes your graphql schema will be full of deprecated fields because client apps have not been updated (see Facebook policy of never removing anything.)
Everything you say is so true. In a similar vein, all those ready-made REST libraries that help you shoehorn your business models into 1-to-1 mappings between them and your REST resources/endpoints, have no reason to exist. And yet you still see people battling with them.
REST is still HTTP + more exotic verbs + headers + more serializing/encoding options. You can build a small library, specific for you project's needs in a matter of 3-5 days. And on top of that, you don't forfeit any possible future optimizations, which you most certainly will by choosing any ready made library.
It's feasible to make such a tool to update the database schema so that it would match a changed definition, but I don't think that it's feasible for it to update the actual database.
To do that, you need knowledge about how and why the schema was altered, and schema definitions like protobufs or graphql don't contain that. For example, how would you distinguish between renamed column (where you need to keep the data) and deletion of a column plus adding a different one?
There's a reason why migrations are the standard level of abstraction, packaging changes to schema with scripts ensuring consistency of data throughout the whole version history.
This was actually one of the interesting directions Drizzle SQL was exploring back in the day. It was a fork of MySQL that focused on removing old cruft, including lots of older platform support, modernizing the source, and adding modularity in a lot more places, including the use of different languages for SQL programming. Other languages like Perl, Python, Ruby, etc.
Microsoft SQL Server allows one to write server-side code (triggers, sprocs, functions) in any language that compiles to .Net IL. There are some serious restrictions, though, as most of the .Net Framework becomes unavailable[1]. More recent version also provide for server-side R, but our SQL Server is 2014, so I have not had the option to play with that (plus, I don't know R).
[1] I was looking into writing a stored procedure for use with the Service Broker in C#, but the fact that most of the .Net Framework is unavailable (not to speak of third-party libraries) quickly put an end to my investigation.
It's not the whole answer, but Postgres is rather remarkable for its support of using first class programming languages in stored procs - you can even write them in Java if you want:
I once embedded our entire Perl business logic (closed source, previous job) inside Postgres with its DB routines changed to the Postgres bridge internal stuff, and creating a single interface where you could message it with an XML blob and get back an XML blob of response. It was actually pretty usable, but the speed benefit wasn't enough to justify such a brittle rube goldberg machine!
This sort of sentence is uttered by someone who is so far down a hole that all they can see is stars. It’s a big hole that a lot of people are in but it’s the consequence of a nasty trade off.
All of this is trivial if your team decides that a shared database for dev work is bad for repeatability and thus bad for scaling the team.
You should be able to spool up a local database with good sample data in it. To do that your schema, indexes and triggers would be under version control, and a data dump is somewhere people can get it.
Once you have this your CI system runs Postgres locally or in a container, runs the drop create scripts, and then runs your integration and end to end tests on the canned data.
In addition to getting a CI solution for next to free you get rid of the concurrent access Wild West and this particularly painful conversation:
Why did this break and why didn’t you notice it before you pushed? Oh I saw that problem the other day but I thought someone else was changing data (and not my code being broken).
If the data is on your machine and it gets broken, then it is only your machine that could have broken it. You can’t delude yourself into thinking it was someone else mucking around. The problem is either in your code or in your latest pull from master. You are responsible for determining the cause, not me, not the release manager, not QA. You.
Not all bugs are logical. Many times CI workflow does not catch poorly written but logically correct queries because sample db has few records in it, emulating large data sets with data patterns like production to get the same kind of query plans is hard.
I would say this is the first objection that is on point in this thread, not that figuring out deployment isn't hard, but the blog posts on how to solve it for your platform are multitudinous.
Getting useful data loads to test queries is much more difficult but you have a few options (at least coming from a SQL Server approach):
* Query hints to emulate larger sets of data (so you can see what type of IO you would get, of course multiplication is your friend)
* Building a data masking process so that the data is somewhat representative (in volume, mocking it such that the histogram of values is the same in your database is WAY harder)
* Building an isolated load test environment with something like distributed replay https://docs.microsoft.com/en-us/sql/tools/distributed-repla... (record workload, replay, measure, make your change, replay, measure, yes - it is tedious but that's a performance regression test for you)
We aren’t currently using stored procedures, but we have a fairly substantial pgtap set for our schema definitions (where we have a number of complex constraints).
The best way to discuss any speed thing is with anecdotes, right? /sarcasm
I had a system that required the parsing of large json chunks. The system pulled the json from an API, pushed the data into a json-type column, then sorted the data into normal form.
I originally tried using straight Python to pull the data, but decided that I ought to keep the original data for record keeping, plus testing was a lot faster without constantly calling the API.
When all was said and done, the whole operation took about a minute to complete. I decided to try trigger, which caused the entire process, from call, to printing "done," to take less than a second.
In this case, the trigger was signficantly faster.
The danger of this anectdote, and all stories with databases, is that all things have to be posted with "in this case." Any time you read triggers, aggregation, CTEs, etc, are fast or slow, consider that this is almost always told in a vacuum. There are so many variables, that the term "fast" is wholly useless
I have no experience in using the json column in postgres (well, sort of did, but it was just a json type, before json column was a real thing).
For me, I try to do as much on the database side like sorting, which does require a good schema and table designs. This is why folks often criticize MongoDB. One of the reasons was the convincence of “schemaless”.
When Mongo was first introduced, I think a lot of developers, including me, saw Mongo as an excuse to move away from relational databases. So we began dumping all kinds of shit. Doing fancy stuff on Mongo side is not possible without a good design either.
What people probably did was just pulling data from multiple collections, and do filtering and “joins” on the server (client) side. I would find myself writing a for loop over doing a bunch of stuff. Yikes.
Of course there are other criticisms against MongoDB, but ultimately developers like myself did not (and probably still) have any decent clues how to use databases well. Learnig to use databases right is something I really want to be good at.
Fantastic feature that makes it trivial to write several different types of backwards compatible DDL changes.
For instance, you want to rename a column? Add the new column, rename the existing table, add a view in its place with both the old column name and the new column name, with an INSTEADOF trigger to update the base table. Postgres lets you do this all in a transaction, so the entire operation is atomic.
Writable views also work perfectly with PostgREST, if you're view is auto-updateable or has an INSTEAD OF trigger than you can POST/PATCH/DELETE via the view to mutate.
I can imagine uttering some four letter words while trying to figure out what was going on if I was to take over maintaining a system that used this (principal of least surprise)
Writing through a view is totally unremarkable and has been for a long time. It’s quite common in Oracle and SQL Server lands. You can update via a CTE too, very easy, very elegant.
Counterpoint: using writable views was one of the best decisions my company ever made and no one regretted it. It is much easier to comprehend than ordinary triggers (specifically, the automated version).
probably the same four letter words that were uttered when it was determined that something wild and crazy was needed in order to support the requirements.
Triggers in postgres are amazing - I've got a full automated phrase-extraction system built in postgres triggers, such that salient phrases are automatically detected in marked text fields.
I've been maintaining a large (and quite old) in-play gambling system for the past six months that uses Postgres triggers and pg_notify [1] as the core mechanism for broadcasting websocket frames to SPA clients.
Hilariously, it turned out the system was violating a recommendation in the Postgres official documentation for pg_notify:
> The "payload" string to be communicated along with the notification. This must be specified as a simple string literal. In the default configuration it must be shorter than 8000 bytes. (If binary data or large amounts of information need to be communicated, it's best to put it in a database table and send the key of the record.)
The original triggers were invoking pg_notify with row_to_json(NEW), often causing the pg_notify calls to fail (which also caused the underlying insert/update that fired the trigger to fail too, meaning we DROPPED new or updated records) due to the JSON text being far too large.
1) I don’t know pgbench, but given the NULL checks in the trigger, I wonder how often they actually change the records in this benchmark. If you want to performance test triggers, wouldn’t it be better to unconditionally change those fields?
2) as the blog post indicates, these are extremely simple triggers that work locally. The IMO interesting cases are moving those last modified columns to a separate table, requiring an index lookup, or inserting rows into an audit log.
>2) as the blog post indicates, these are extremely simple triggers that work locally. The IMO interesting cases are moving those last modified columns to a separate table, requiring an index lookup, or inserting rows into an audit log.
Depends on the question you're asking. Benchmarking extremely simple triggers is the only way to find out whether the triggers invocations themselves cause significant overhead compared to what you would otherwise have to do from application code (such as inserting a row into a separate table or populating two additional columns in the same table).
I think triggers (and stored procedures in general) are more attractive the more roundtrips they help avoid.
We built a large ETL and machine learning operation in PG through triggers, from simple algorithms calculating rate of change for updated datasets, to identifying trends and connecting seemingly unrelated data.
Even at our scale [0][1] the performance tradeoff is absolutely worth it. The best part is the consistency of the data. If the trigger dies, the whole transaction is rolled back and you have to re-run it. This way we never end up with different state of data that has to be fixed after the fact.
There are some downsides. Triggers are very hard to debug, optimize and monitor. Versioning is kind of nightmare and there is no direct performance information that you could capture. In our case it's even more as we are running on citus and we have to deal with data distribution, colocation and other issues.
But it's now two years and we're only adding to it.
[0] https://www.citusdata.com/customers/pex
[1] https://cloud.google.com/customers/pex/