Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know why people are stuck on this model of a long chain/log of successive schema changes when you can simply make a diff.

I lay out this approach here: https://djrobstep.com/talks/your-migrations-are-bad-and-you-...



Hibernate has this model built in as well, I believe. https://stackoverflow.com/a/221422

hbm2ddl.auto=update

But it isn’t recommended for production because the automatic diff-based approach can lead to unpleasant surprises. With individual migration files, you always know exactly what they are going to do and can think and plan for how a rollback might be necessary if needed.

It’s fair to say that you don’t necessarily need to keep 100 individual files around once they’ve been deployed to every environment for a while and there’s no chance you’d rollback. But I think people typically do this approach because it is simpler and easier than attempting to export the database structure at #100 or whatever and setting a new baseline every few months.


I'm certainly not suggesting applying generated diff scripts without reviewing them!

In fact, probably the main benefit of the diff approach is that you can rigorously test the outcome, and explicitly confirm a matching schema will result.


How does this possibly work then, if you have to review the scripts first?

You're just storing a migration script in the repo still, but it's generated and works against only a single schema "revision"... ?


I don't understand the objection here. You want to apply the same migration script against different schemas?


The same schema, from potentially different points in time.

So it's not really applying to different schemas, each single migration script applies against the same state that the previous one produced - but collectively they can be applied against a database from any point in time while that migration system has been used.


Again, what exactly are you objecting to? You're just describing how traditional migrations work. There's nothing stopping you saving a sequence of scripts that you use a diff tool to generate, it's just less necessary because you can work against the actual version instead.


... your comments are both confusing and contradictory.

You've simultaneously said that storing scripts is "bad" and that a live "diff" is better, but then also said that those generated "diff" scripts can be reviewed, which means you have to store them.

I'm completely aware of using tools to generated scripts by diffing a model schema against a DB, and using a different tool to apply the finalised, reviewed scripts without invoking the "diff" tooling outside of development environments.

What I cannot grasp is your convoluted claims that storing scripts is bad, and that diff's can be reviewed before they're applied, without storing them, and will work against some cowboy-esque DB modifications.


? Nobody is saying that storing scripts is bad. This is getting silly. I'm skeptical that you even read the original link before attacking it.


I tried to read your "approach" and I got as far as "have to track version numbers" and gave up.

IMO you're inventing problems to justify your own preferences.

For context: I use, and am a proponent of the 'series of patches' approach for SQL migrations. I wrote a tool to apply them, because I wasn't happy having to constantly patch the existing one I'd been using.

I don't really understand what you mean about "managing versions" - you just need a naming scheme that ensures the items can be applied in order. Dates (as a prefix, with a descriptive suffix) usually works well.

I also don't see how you think having the patches over time in in the repo is a problem. If you don't like seeing the full history of changes - just remove the old ones after a while.

The "series of patches" approach also has a benefit that I don't see how your "diff" approach would solve (without some overly complex tooling): upgrading a previous version of the database to the current schema.

With our current system, I could take a database dump from 2 years and dozens of migrations ago, run the migrations on it, and it would come up to the current schema.


If you gain some satisfaction from having to track version numbers, then good for you. I and most other people find it tedious.

They're also impractical to use in environments where your one versioning system isn't the only use of the data. If you or some DBA needs to make an emergency database change - suddenly the real DB doesn't match your versioning.

But if you compare directly against production, it's not a problem at all.

Not directly checking against the production schema is rather like flying a plane with no reference to what is actually outside the windows - rather an unreliable experience.

Nothing is stopping you from keeping a long chain of these scripts if you want - you'll lose nothing and gain more automation and testability.

But in practice nobody wants to keep these files around, and nobody wants to restore 2 year old backups to production.

And if you did, a diff-based approach will do just as well, more automatically.

Where the diff-based approach probably shines the most is when making experimental changes locally during development. Play around with your models, add a column, add a constraint, rename the column, make it not null, change your mind and drop it again. One command syncs the dev database to your models automatically and instantly.


> If you gain some satisfaction from having to track version numbers, then good for you. I and most other people find it tedious.

I literally said there are no version numbers to track. If you find identifying the current date either manually or in some automated tool tedious, I don't know how to help you.

> If you or some DBA needs to make an emergency database change - suddenly the real DB doesn't match your versioning.

Well by that logic, if I suddenly need to make some "emergency" fix to the code, it won't match the version control system.

The solution there, is to have a method in place to rapidly deploy a change, not to make your migration tool also work around cowboy solutions.

Your previous comments also imply that to be usable, the "diff" system needs to be reviewed (i.e. to handle table renames, and to be considered safe for production).

So how does that handle the cowboy approach where the schema doesn't match? Either your "diffs" are generated at the time of execution, so the previous state is regardless but you can't review them, OR your diffs are generated ahead of time, so they can be reviewed, but will not necessarily work against the live database.

So which is it?

I'm not against a "make changes against the DB directly" workflow for development. I've written code that does exactly that, as you describe, from models. But it's not practical for production use.

It's usable as a development tool, and to produce static migration scripts.


"They're version dates not version numbers" is hairsplitting. You still have external information that you must rely on to determine what state your production db is in, and that's bad.

If an on-call SRE calls me in the middle of the night asking me if he can add an index to solve a performance problem, I'd rather say "yes, no problem", not present a series of additional steps for them to jump through.

You review the script, and when the time comes to apply, recheck the database to make sure it's still the same state you generated the script against. Generally people tell you if they've made changes on live dbs that you're working on, but it's nice to double-check regardless.


> You still have external information that you must rely on to determine what state your production db is in, and that's bad

What external information? Whether each migration has been applied or not is stored in the database itself. The dates are literally used just to ensure correct ordering - that's literally their only purpose.

> not present a series of additional steps for them to jump through.

If someone can't write the change they want to make into a file, write the opposite action into another file, commit and push that change to a VCS repo, I don't think they should be given access to a god damn toaster oven much less your production database.

> You review the script, and when the time comes to apply, recheck the database to make sure it's still the same state you generated the script against.

.. How can that possibly work with automated deployments? And how on earth do you "recheck the database to make sure it's still the same", with any degree of certainty?

Your entire approach smells like a very manual process that doesn't work for teams any larger than 1 person.


> Whether each migration has been applied or not is stored in the database itself.

You're dragging this further into pedantic territory here. A chain of scripts and a version table is external to the structure of the database itself.

> If someone can't write the change they want to make into a file, write the opposite action into another file, commit and push that change to a VCS repo...

The recurring theme here is that you have a preference for mandatory busywork instead of a direct approach. People putting out fires ought to be focused on what will directly solve the problem most quickly and safely. In larger environments with dedicated ops people supporting multiple applications/environments/databases, not every ops person is going to be familiar with your code and preferred workflow.

> How can that possibly work with automated deployments? And how on earth do you "recheck the database to make sure it's still the same", with any degree of certainty?

...with a diff tool.

> Your entire approach smells like a very manual process that doesn't work for teams any larger than 1 person.

The whole point is that it is automatic rather than manual. I've used it before in teams "larger than 1 person" and it has worked fine.


> Your entire approach smells like a very manual process that doesn't work for teams any larger than 1 person.

You may be misunderstanding the concept. Automated declarative schema management (AKA diff-based approach) has been successfully used company-wide by Facebook for nearly a decade, to manage schema changes for one of the largest relational database installations on the planet. It's also a widely used approach for MS SQL Server shops. It's not some newfangled untested crazy thing.

I have a post discussing the benefits here: https://www.skeema.io/blog/2019/01/18/declarative/


I understand the concept of a tool that changes the schema to match some declared state dynamically.

I wrote the same functionality into a library.

What I cannot comprehend is the poster who claims that such an approach can simultaneously:

- be automatically applied

- be reviewed and even edited after generation to handle e.g. renames

- handle previously unknown changes in the DB schema (aka handling cowboy behaviour from other ops).

All three are simply not possible at once.


Here's how you can achieve all 3.

- Develop intended schema (I)

- Inspect production schema (P), save as P0

- Generate migration (M) by comparing to production (P0): I

- P0 = M

- Edit M as necessary, test for correctness, commit to master (meets your second criteria)

- Deploy code, with the migration running as a deploy step (meets your first criteria)

- Migration works as follows:

- Inspect P again, save as P1. If P0 != P1, abort process (this prevents any issues from out-of-band changes as per your third criteria, and means the pending script won't run more than once)

- Apply M.

- Inspect P once more save as P2. Double-check P2 == I as expected.


I disagree. This is definitely all possible at once with proper tooling.

Ideally this workflow is wrapped in a CI/CD pipeline. To request a schema change, you create a pull request which modifies the CREATE statement in a .sql file in the schema repo. CI then automatically generates and shows you what DDL it wants to translate this change to.

If that DDL matches your intention, merge the PR and the CD system will apply the change automatically. If that DDL doesn't match your intention, you can either modify the PR by adding more commits, or take some manual out-of-band action, or modify the generated DDL (if the system materializes it to a file prior to execution).

In any case, renames are basically the only general situation requiring such manual action. Personally I feel they're a FUD example, for reasons I've mentioned here: https://news.ycombinator.com/item?id=21758143


Because diffs don't work. If you rename a table, for instance, a diff algorithm will see that as dropping the table and creating a new one.


People love to cite this reason, but practically speaking, it's FUD. Renaming in production -- whether entire tables or just a column -- is operationally complex no matter what.

Assuming any non-trivial software deployment (multiple servers), it's impossible to execute the rename DDL at the exact same time as new application code goes live. So either you end up with user-facing errors in the interim, or you can try to write application logic that can interact with both old and new names simultaneously. That's overly complex, typically not supported by ORMS or DAOs, and very risky in terms of bug potential anyway.

I'm a database expert, and among 100% of the companies I've worked at or consulted for, renames were either banned entirely or treated as a very rare special-occasion event requiring extra manual work. Either way, lack of rename support in diff-based schema management isn't really a problem, as long as the tooling has these two properties:

1. Catches unsafe/destructive changes in general and require special confirmation for them (preventing accidental rename-as-drop in the rare case where a rename truly is desired)

2. Offers a way to "pull" from the db side, so that if a rename is actually needed, it can be done "out of band" / manually, and the schema repo can still be updated anyway


> People love to cite this reason, but practically speaking, it's FUD.

> Either way, lack of rename support in diff-based schema management

You're trying to polish a turd here, this isn't "lack of support" it's "it will drop objects in production."

> Catches unsafe/destructive changes in general and require special confirmation for them

Again, polishing a turd: the only way your "automation" works is through manual intervention.

> Assuming any non-trivial software deployment (multiple servers), it's impossible to execute the rename DDL at the exact same time as new application code goes live.

You can construct a view referencing the old table, and rename the table. Yes, it has to be an updateable view and you need transactional DDL, but within those constraints, it's doable.

> I'm a database expert, and among 100% of the companies I've worked at or consulted for, renames were either banned entirely or treated as a very rare special-occasion event requiring extra manual work.

If they're using a DBMS that doesn't support transactional DDL, completely understandable. If their tools are liable to drop production data due to renames, also completely understandable.

But the fact that they ban a trivial operation is a symptom of the problem with all the half solutions and snake oil in SQL schema management. It's so bad that you have large companies investing heavily in ripping out the schema entirely, which itself is just more snake oil.

In the problem space of trying to keep a schema in sync, we know that diffing leads to unacceptable answers, that is an indicator that it's the wrong conceptual basis for a correct solution.


> You're trying to polish a turd here, this isn't "lack of support" it's "it will drop objects in production."

That's a strawman argument. Any reasonable schema management implementation has safeguards against DROPs. If your tooling blindly executes a DROP without extra confirmation, use better tooling.

> Again, polishing a turd: the only way your "automation" works is through manual intervention.

There's absolutely nothing wrong with requiring extra human confirmation for destructive actions. Quite the contrary. I've spent most of the last decade working on database automation and operations at social network scale, and will happily say this is a common practice, and it's a good one at that.

> You can construct a view referencing the old table, and rename the table. Yes, it has to be an updateable view and you need transactional DDL, but within those constraints, it's doable.

So you're assuming that every single table has a view in front; or you're dynamically replacing the table with a view and hoping that has no detrimental impact to other database objects or app performance. Either way, you're talking about something operationally complex enough that it isn't fair to say that production table renames or column renames are a "trivial operation" at the vast majority of companies.

> It's so bad that you have large companies investing heavily in ripping out the schema entirely, which itself is just more snake oil.

This is frequently overstated. For example, although Facebook uses a flexible solution for its largest sharded tables, there are many tens of thousands of other tables at Facebook using traditional schemas.

> In the problem space of trying to keep a schema in sync, we know that diffing leads to unacceptable answers, that is an indicator that it's the wrong conceptual basis for a correct solution.

The only "unacceptable answer" you've cited is rename scenarios, which even if it incorrectly leads to a DROP, the tooling will catch.

If you need crazy view-swapping magic to support an operation (renames), that is an indicator that it's a conceptually problematic operation that should be strongly reconsidered in production.

As I've already stated elsewhere in this thread, declarative schema management has been successfully used company-wide by Facebook by nearly a decade, and is also a common practice in the MS SQL Server world. If you're unconvinced, that's fine, but many companies have found it to be a great workflow!


Which is straightforwardly solved by editing the generated script accordingly.


Huh, this got me thinking.

How about you just give a UUID to everything in the schema?

The technical challenge with generated scripts (that you could edit by hand, but that just means that you now don't have an automated system) is that they don't understand changes at a deep enough level - they lack the context to see that a table or column has been renamed because they have no understanding of identity.

So - just give them identity.

  Schema at commit 20b1ea23

  03496418-e44c-42a6-a6a4-6563b7ae7bfb users
    25233812-9a95-4bc3-893e-6accb935fa49 name
    2f4c79c3-81b6-42d4-8379-ce5f0ed8ef62 address
    83fc34c8-56c7-49d4-94d0-150cd76204bc password

  Schema at commit c0d07562

  03496418-e44c-42a6-a6a4-6563b7ae7bfb users
    89482484-8205-40ad-a73b-a1bb988dc1d9 firstname
    25233812-9a95-4bc3-893e-6accb935fa49 lastname
    2f4c79c3-81b6-42d4-8379-ce5f0ed8ef62 address
    c9f0d35d-439e-488c-a6d5-7a144c54335c address2
    83fc34c8-56c7-49d4-94d0-150cd76204bc password
Now you could diff these two:

  @@ -1,4 +1,6 @@
   03496418-e44c-42a6-a6a4-6563b7ae7bfb users
  -    25233812-9a95-4bc3-893e-6accb935fa49 name
  +    89482484-8205-40ad-a73b-a1bb988dc1d9 firstname
  +    25233812-9a95-4bc3-893e-6accb935fa49 lastname
       2f4c79c3-81b6-42d4-8379-ce5f0ed8ef62 address
  +    c9f0d35d-439e-488c-a6d5-7a144c54335c address2
       83fc34c8-56c7-49d4-94d0-150cd76204bc password
and every line of the diff has the necessary information to decide what operation you intended.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: