Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Database versioning best practices (enterprisecraftsmanship.com)
171 points by vkhorikov on Aug 10, 2015 | hide | past | favorite | 70 comments


I agree. Rails database migrations get so many of these things right: every change is a script, the database is versioned (via schema_migrations). I really like that the database is the sum of a list of deltas. I've worked with people who argued it'd be better if you could just write a declarative schema description by hand (like schema.rb), and have Rails automatically adjust your database to meet that description. But if you already have data you need to migrate from the old to new structures (e.g. changing a 1-to-many to a many-to-many), that isn't going to work.

I'm thrilled he mentioned that once you deploy a migration script, you can't ever change it. I wrote the same thing here [1]. But I see it happen all the time, and it guarantees that your schema gets out of sync with others'. If I've already run that script, Rails doesn't know it needs to run it again.

One thing he left out is that you should write migrations so they work not just on the current version of your code, but on future versions. I see this cause problems on projects again and again. For instance a feature branch with a migration from day 1 with `Permission.create!(name: "foo", value: "bar")`, and then a week later the developer decided to rename the Permission class to Setting. By the time the branch was merged into master, the migration was broken. My solution for this is that a database migration should never depend on the application code, but only use direct SQL.

[1] http://illuminatedcomputing.com/posts/2013/03/rules-for-rail...


I've arrived at a similar conclusion, with a few exceptions:

Never changing past migration scripts should be a guideline, not necessarily a hard rule. With Rails, I generally prefer to use ActiveRecord in migration scripts (rather than raw SQL) for the purpose of keeping things more readable (again, guideline rather than hard rule). As you mentioned, sometimes changes in application code can break past migrations, hence the need to occasionally alter a past migration (e.g., placing parts of the migration in a conditional). With practice, this actually encourages one to write relatively "future-proof" migrations, anticipating possible future code changes and creating each migration with as few assumptions as possible.

At times, as in your example, a migration may alter data (not just structure), and sometimes old migrations alter data in an obsolete way that isn't really helpful anymore. Occasionally (especially if the migration in question is time-consuming), I'll revise an older migration, but I always keep it around (so the version still exists) and add comments to note what is being removed and why.

One thing that really helps is regularly running all migrations in sequence. My test suite would routinely run them all, and I never use a schema.rb to load a schema directly. Vagrant dev environments are always built using all past migrations. Doing that regularly helps ensure that the current database schema is always a result of repeatable deltas.

One more thing: I avoid migration rollbacks whenever possible. I've seen them cause more chaos than relief in already-stressful situations. In practice, this means (1) making sure the codebase before and after a migration can handle the database state before and after the migration, (2) writing migrations that allow you to easily restore the database to a consistent state if they fail midway through, and (3) whenever possible, fixing a migration and redeploying rather than attempting to roll back changes. A rollback is, after all, a migration itself, prone to its own set of potential bugs and failures.


One solution to the problem of ActiveRecord code in migrations getting out of date, that I usually like more than either the solution of only using raw SQL (though often it's the right tool for the job) or of editing migrations later, is to define a version of the classes you want to use locally inside the migration class. This is usually as simple as, eg.:

    class DoSomethingWithPermission < ActiveRecord::Migration
      class Permission < ActiveRecord::Base; end

      # ...
    end
It protects you from all sorts of tomfoolery – changes to the class name, its relationships, its validations, etc. – and doesn't require any change in style.

I also love the idea of migrations being forward-only, and deployed on their own (ie. without accompanying code changes). If you can pull it off, those constraints give rise to some nice properties: schema changes must work with existing deployed code, code changes must work with existing deployed schema, deploys can be done with zero downtime, botched schema changes can be fixed without rolling back code changes, etc. It is harder to do than the traditional style, requiring schema changes to be made in phases, and often involving triggers and/or views to keep things consistent, so its cost/benefit isn't necessarily a clear win.


Thanks for this -- defining a local copy of the class is brilliant!


Indeed you should ensure that the changes work in future versions. In GitLab we do that with the guideline "Please don't depend on GitLab specific code since it can change in future versions. If needed copy-paste GitLab code into the migration to make make it forward compatible." from our migration style guide that has some other tips if you're interested in maintainability https://gitlab.com/gitlab-org/gitlab-ce/blob/master/doc/deve...


Only using direct SQL works, but I've always relied on redefined model classes IN the migration, a la:

http://railsguides.net/change-data-in-migrations-like-a-boss...

Written about 20 rails apps of varying levels of complexity, never been bitten by this approach. YMMV


I mostly agree with this but with one (and a half) modifications and one addition.

Instead of storing a version as an integer, I strongly prefer naming each migration and storing the applied migrations in a database table.

Rather than relying on a single integer, I can simply write code that applies all the migrations which haven't yet been applied. This makes it much easier for the database to be modified in multiple concurrent branches without any merge pain.

At work we follow this practice and name our migrations something like "1-add-foo-table". If another dev makes a branch with a migration named "1-add-bar-table" then there's no conflict.

We also store our migrations as a directory rather than a single script. This lets us split things up into multiple files if needed. We also allow for both SQL and Perl (our main language) migration scripts in the directory, which is handy.

Finally, all migrations must be idempotent. In theory no migration should ever be run against the same database twice, but making our migrations idempotent is a little extra insurance.

For the curious, I wrote some Perl modules that help manage a system of migrations like what I just described:

* https://metacpan.org/release/Database-Migrator

* https://metacpan.org/release/Database-Migrator-mysql

* https://metacpan.org/release/Database-Migrator-Pg

The system is designed to be extensible so you can build on top of it, rather than a fully encapsulated tool.

There's also Sqitch (http://sqitch.org/). It's written in Perl but is entirely language-agnostic to the best of my knowledge.


I've come to the same conclusions.

Our tooling still requires an integer as the first part of the script name - but that's only used to determine overall execution order. The numbers can overlap and there can be gaps which helps with the branching problem.

I also agree that idempotent scripts are a key best practice.

We also persist a hash of the script in the database so that our tools can detect when a script has been modified after it was applied. This has helped catch and prevent a whole class of bugs.


As an extension here, you could store the entire script. Storage is cheap (at least for the size of the scripts that we're talking here most of the time).


I agree that versions should be named rather than numbered. Well, I guess that's an implementation detail. What really matters is having the equivalent of branches in a version control system. That way the main branch can use migrations a, b, c, d, and e. And the release branch for the legacy version that backports bugfixes but not new features can easily use migrations a, c, and e and skip b and d because those went with features that weren't backported.

In Python, sqlalchemy-migrate does this wrong and Alembic does this right. I worked on a big project that used sqlalchemy-migrate, and this caused us pain. It's difficult for a large project to change database versioning systems, so it's important to pick a good one from the start.


This is exactly the same conclusion I've come to over the years.

For compiled languages, I strongly recommend embedding the scripts in your jar/assembly/dll. That way you can sign it, and make deployment much much easier.


> Finally, all migrations must be idempotent.

That must be tough. Are you doing it by simply checking a "I've already done this flag"?


You'd have a database table where its entire contents explain what's been done to the database schema and the data migrations.

And then you make inserting a row named '3-add_customers_table' as part of the transaction for the migration. If this row already exists, the insert will fail, aborting the entire transaction.

Or something like that...


If your RDBMS supports transactional DDL and conditionals then it is pretty simple.

E.g. adding a column: IF NOT EXISTS ... ALTER TABLE ... ADD ...;


This is how the Play Framework work too.


I had a project from about 6 years ago that used this method of DB versioning. The project was quite large and took 4 years to make-- it had a very large database schema and required a lot of default data.

One day recently I pulled that project off the shelf and decided to see if I could get it running for nostalgia's sake:

SGS Server starting...

Expected database version: 57.

Found database version: 0.

Applying DB change script: 1...success.

Applying DB change script: 2...success.

(...)

Database updated.

SGS Server running.

After seeing my old project set itself up automatically with no grief, I felt an enormous amount of pride. I only wish my day job would implement a similar system.


This can be a major pain for larger software projects. At work we have to contend with:

* A database so large that even a minimally empty one cannot be created from scratch in less than 10-15 minutes. This creates problems for CI, integration testing, etc.

* The developers spent a good chunk of the late 90s/2000s writing Oracle PL/SQL code; hundreds of packages and thousands of stored procedures with oodles of business logic.

* We store reports, pdf attachments and other documents etc all in the DB as well.

* Since we put so much stuff on the database, small problems and schema fudges tend to creep in over the years, which makes every customer database a little bit different.

* Oracle licensing can be very unkind and the upper management mandate that we can't use the oracle XE version even in development/testing.

We ended up using a combination of Flyway for schema changes, hand-rolled scripts to apply stored procedures and packages, and we had to roll a database provisioning pool as-a-service for developers, and it's still a massively janky and fragile setup. We really need better tooling for this.


> * A database so large that even a minimally empty one cannot be created from scratch in less than 10-15 minutes. This creates problems for CI, integration testing, etc.

When we ran into this issue we would take period snapshots of the schema / data dump so that it could be recreated rapidly at a certain point. For example, we would create a DB creation and data insert script at version 2.0, then update scripts would be applied starting at 2.1, 2.2 etc.


A database so large that even a minimally empty one cannot be created from scratch in less than 10-15 minutes. This creates problems for CI, integration testing, etc

You are certainly using the wrong technology then; at a previous job I was easily spinning up 3T Oracle databases in a few minutes using COW clones on the storage array attached to VMs. They were only good for a few hundred M of changes, but that was plenty for testing.


> The developers spent a good chunk of the late 90s/2000s writing Oracle PL/SQL code; hundreds of packages and thousands of stored procedures with oodles of business logic.

I've developed a hypothetical approach for versioning of stored procedures within the model described here as part of Alembic, known as "replaceable objects": http://alembic.readthedocs.org/en/latest/cookbook.html#repla...


Last I looked, Oracle was free for development and test use. Is that no longer the case?


You see the problem there is that development and test are gateway environments. They lead frequently to other environments such as staging, and production, at which point you find yourself dependent on Oracle's latest stuff and it can be time consuming and expensive to get yourself clean.

Next time you find yourself Jonesing for the latest features and someone offers you dev & test for free, just say no.


Actually not the case (and I'm not sure it ever was). We were under this misapprehension for while too.

You're allowed to have a development database to "create one prototype" with a very loose description of what that is. It also "limits the use ... to one person .. and one server". As for testing - "all programs used in a test environment must be licensed".

http://www.oracle.com/us/corporate/pricing/databaselicensing...


Liquibase is pretty good for managing database versions

http://www.liquibase.org/


I'll second the use of Liquibase. Writing changeSets is simple, and it's very easy to bring a developer's fresh database up to speed by simply letting Liquibase apply all the changeSets in order.

We've found that maintaining a separate "changelog" file for each major release that collects the history of changes for that release and subsequent minor releases is easiest. We also use the Jira ticket # associated with a schema change as the ID of a changeSet, so we can tie changes to feature requests, bug reports, etc.


I've had some good experiences with this integrating with Hibernate.


Oh, good old "programming in XML".


"Supports XML, YAML, JSON and SQL formats" - from the front page of http://www.liquibase.org/. What more do you want ?


As if programming in YAML or JSON is any better. And programming in platform-specific SQL, for all non-trivial purposes, ties you to the specific flavor of DBMS.

What I want is a DSL that is a platform-agnostic DSL: terse, typechecked, possibly "compiled" to some IR, with basics of semantic analysis, with support for (or at least awareness of) the quirks of the specific underlying platform.


Which in this case, depending upon your editor, can mean autocomplete and validate as you type. In spite of liquibase's newer YAML and JSON support, XML works very well here.


I love the fact that the change sets translate down to the dialect of the DB its going to.


I think this post only gives half of the story. The thing is you need to manage not just your schema (and reference data) but also your DB code - i.e. stored proc etc.

For schema, the approach recommended by the OP seems sensible - check diff scripts into VC and consider them immutable (see also http://www.depesz.com/2010/08/22/versioning/ for a lightweight Postgres approach).

For stored procs etc, I think you need an approach much more akin to normal code - here you want to be able to leverage your VCS just as you do with normal code - so for these you want to consider them mutable.

One of the best presentations I've seen on this topic is:

http://www.slideshare.net/OleksiiKliukin/pgconf-us-2015-alte...


Stored procedures are usually considered part of the schema. They can be diffed. I know the Visual Studio comparison tool is able to diff databases but also generates code to check affected stored procedures for errors after a table schema change.

I would still recommend changing stored procedures through UPDATE scripts and having this change scripts immutable. (Ideally not just immutable but also idempotent and with forward and backward scripts to undo change.)


Curious why you'd recommend that. It seems to me that it costs you something (convenience of standard VC practice on code) and gains you little.

I'd recommend checking out the linked presentation above - for things like stored procs you have the option of installing multiple versions simultaneously (e.g. under different names or 'Schemas') - obviously you can't do that for the main schema itself. This is why I think a hybrid approach makes more sense.


It's something I've seen working in a few places, but I'll take your recommendation and review the slides for a better approach. You're right that this approach does lose some of the benefits of version control but then typically the SP schema changes are very closely linked to table schema changes anyway.


Back in the day, I wrote my very own database migration toolkit [1] with its special DSL and cool things like automatic rollback script generation and (limited) type inference. It never gained any traction, though. But I still use it in every project with great success.

[1]: https://code.google.com/p/octalforty-wizardby/


> Don’t use complex notations like “x.y.z” for the version number, just use a single integer.

I prefer to use two integers: major & minor version, semver style. Minor version gets incremented if old code can still use new DB (e.g. when adding a column). Major gets incremented when the change is not backwards-compatible (e.g. when renaming a column).

When you have to rollback a release not having to downgrade the DB can be a major time saver.


Why not timestamps + description? This way, even if you create migrations on different branches, you don't have version number overlap. Rails does that if I remember correctly.

eg:

201509102039-add-this-to-table.sql

201510220840-add-that-to-table.sql


Because version numbers can be handled automatically, descriptions can't.

Say, you have app v.20150701 running in production. Today you deployed new version and it ran migration script against production DB. Two days later you discovered a critical issue in your new version and are forced to roll back to the old app. Now, your 20150710 app opens the database and notices that it has 20150811-oh-we-did-something-to-the-db.sql migration applied. Can it use the database or should it fail/downgrade the DB? How do you tell if all you have is a timestamp and description?

IMO this is more important than solving branch merges because it helps me when my production system is down and there's pressure to bring it back online ASAP. Merging branches can be done offline in the cozy comfort of my development environment.

PS. Of course, the above scenario might be more or less relevant depending on your development and deployment workflows.


Interesting point, though I've never thought about this situation because I've never worked with software versioning. If I need to rollback, I have to check which migrations are present in the SCM repository in a specific revision that I need to rollback to


For our Scala web applications this is pretty much a solved problem, since Play Framework keeps track of database versions in an 'evolutions' table, and runs evolutions in order to bring the database up-to-date. We've also modelled our test snapshots runner after the evolutions format - so each test fixture contains queries for both set up and teardown, on top of the empty vanilla database described the evolutions, of course.

The trouble we encounter is extending outside software to meet our needs, specifically Wordpress. WordPress is a better-than-average CMS but like many CMSes it is very developer-unfriendly. I need to be able to snapshot the database from a single WordPress installation and version it along with an installation of the several WordPress Plug-Ins I am developing for a specific client. This way I can keep a repository reflecting the current state of a single site (with a specific set of plug-ins activated, and specific configurations for each), an exact copy of which I want to eventually deploy in a production environment. I've seen plug-ins for this, but for at least in one specific case (VersionPress) it seems like it is only available through an overly expensive 'Early Access Program' - and it's been in Early Access for many months now, making it seem Vaporware-ish. Anyone know of a better, automated solution, other than dumping the MySQL database every time I make a change?


I just really wish the Play Framework evolutions feature allowed you to give a logical name to your scripts. It would nice to be able to tell at a glance what the high level purpose of each script was.


+1. This bugged me about the excellent Schema Evolution Manager. Ended up porting it and extending it to take an optional name to add after the date.


Sure, I'll throw my hat in the ring here with a shameless self-plug.

I help maintain a database versioning/migration tool that my team has been successfully using for years now: https://github.com/nkiraly/DBSteward

The idea is that instead of managing a schema + several migrations, you just store the current schema in XML, then generate a single migration script between any two versions of the database (or just build the whole thing from scratch). The ideal use-case during deployment is to checkout the existing deployed schema into a different directory, then diff against the current and apply the upgrade script.

I've always found most migration solutions to be wanting, and while this approach has its downsides (things like renames can be hairy, need to explicitly build in support for RDBMS features), I do like it a lot more than the standard sum-of-changes approach.

The tool is still very much a work in progress, written in terrible 10-year-old PHP, and desperately in need of a proper rewrite and modernization, but it is definitely stable, safe, and production ready.


Flyway does exactly this and enjoys widespread adoption http://flywaydb.org/


Another happy user of Flyway here which essentially forces all of the described best practices onto your DB. If they only supported transactionless migrations, we'd be totally happy campers :)


Coming very soon :-)


We use http://www.dbmaintain.org at the Java shop where I work. It enforces all of the best practices in the article.

Something else that we do is to have the schema/mandatory data in a separate git project. This has the added benefit of having a separate deployable artifact from the application itself.


We're using Delphix (off Oracle) to do our "every developer should have a database". For anyone not familiar with Delphix, it provides individual DB instances that are "forked" from the source DB. I'm assuming it's some kind of Copy On Write.

So far it's working ok, although we've not automated, snapshot/update/refresh off the master yet. The master is planned to be periodically refreshed from ProdCopy.

Would be interested if anyone else is using Delphix and can comment or link to anything on it. I've been meaning to poke around the REST interface but haven't had time yet.


We're using Delphix too. Complements an approach like this, doesn't replace it. We use Liquibase to support the schema-as-code concept; it's enforced by no one outside Sr. DBAs being able to make such changes directly in anything other than Dev (lower-than-QA) environments.

Delphix is terrific for having production-like data in your test environments ("like," because you'd better be masking sensitive information from prod before it gets written in test.

The issue we've had with Delphix is performance. It absolutely pounds on the storage system, and tends to suck up all available bandwidth and CPU that's made available to it. If you resource it properly, it's amazing.


It's all nice and good but there is an additional problem the article doesn't mention: how do you handle "exceptional" (unplanned) cases in your database version control? Sometimes you discover things don't work as you expected, or that there is a last-minute schema change you want to promote quickly to the production db. I find it hard to encompass these situations by modeling changes via forward/backward migrations (database schema diffs, like rails does).

At my workplace we operate a large-ish web service in our own datacenters (airline reservations). We're very happy with the way we version code: we've one single branch where changes arrives all over the clock; every second day this branch is deployed in a "pre-production" platform. Every week we take one of those versions, fork a release branch out of it, and make all necessary adjustments, i.e. a bunch of cherrypicks (changes arrived late but needed quickly) and possibly backouts (we found a bug and have no time for a fix):

    a--b--c--d--e   <--- what's in pre-prod
           \
            b*--e'  <--- what I want in prod
    
    b* = backout, e' = cherripick
Now, I really really would like to do the same for our DDL (database schema changes). But backouts and cherrypicks just don't work if you're versioning migrations, because you have to deal with what's already in the prod db (conversely, when you deploy software you just trash the old binary alltogether and replace it with the new one):

backouts, like b* above: you would hand to your DBA the sequence of patches a,b,c,b* ,e' which makes no sense because you're asking to apply migration b and its reverse b* (you should say "just skip b", but version control doesn't work like that)

cherripicks, like e' above: good luck remembering, in two weeks from now, that you've already applied e' and don't need to apply e.

At the end of the day, we -do- version our changes with forward/backward migrations, but we always end up managing backups and cherrypicks by sticking notes on the wall (or on the team wiki), which smells bad. I admit I never tried any of the popular solutions for database version control (liquibase, flyway) and I don't know how they work. But I feel the ideal solution would be to version the whole schema (not schema diffs), and have tools that at any given moment can compare the schema I have in my VCS and what is actually on the platform, and generate the necessary "migration" on the fly.

EDIT: formatting


Django's South does this. http://south.readthedocs.org/en/latest/tutorial/part1.html And now, the next version of that is built into Django core. https://docs.djangoproject.com/en/1.8/topics/migrations/


What about versioning data? For most projects we have SQL tables consisting of preconfigured data, that is essential for the project to work. (predefined templates, initial configuration settings) How do you go about producing an SQL difference script for your data, getting it under version control and then applying it at the push of a button?

Currently we are using the Toad for SQL tool for producing and applying both schema and data diffs between our development environments, but it's a Windows only tool.


Two things that may be relevant that I've done before:

1. The simplest. Setup scripts for doing full backups and restores to/from S3. This is helpful for many reasons, but should also help your case. Diffs are an optimisation if your data is large enough, but I think many cases are easily small enough to just push and pull as a whole. Until you're at a significant number of gigs of data, I'd recommend trying this first.

2. Record your data in an append-only format (this doesn't require a different database, just a strict way of doing things). This means you can always get a full history of any bit of your data. I do this for some scraping work (recording lots of noisy/unreliable values, so not losing old versions is important). If your data is stored like this, pushing around diffs should be easy (grab everything with a created_at time after a certain point).


We're using this approach on a large team and db but are struggling with having multiple major version branches in play at the same time and backporting of functionality and bugfixes breaking things by changing the order in which changes get applied. At some point, the problem remains that you're migrating the data in a live system, and that means the order of changes matters.


Archiveopteryx (aox.org) handles database versioning in a very clean way. You can actually upgrade and downgrade your application and everything required to change your schema and dataset is contained within Postgres.

http://archiveopteryx.org/aox/upgradeschema


I would like to add that it should be easy to blow away your db and recreate the structure with enough sample data so that the application works in a dev/test environment without having to pull in a db dump from some other machine. Depending on dumps drives me crazy.


This touches on some very common issues related to migrations. It's very interesting to see that after all this time, effectively handling database schema change management can still be a challenge. I wrote about my experience with this topic several years back. I think some of these observations might still be useful to folks just digging into the problem: http://robertelwell.info/blog/future-proof-db-changelog/


Wow, I came in there thinking what we will learn. Turns out we do all of those points in our platform!

http://platform.qbix.com/guide/database

http://platform.qbix.com/guide/scripts

We use the actual database schema as the primary source of truth for the schema, and we generate the code for the models from it. We have scripts that upgrade the database schema for each plugin and app.


I have a similar approach, but with timestamps instead of an incremental int.

I'd use a tool for this, but then I would need to change how I deploy the apps given that usually, it's best to run the migrations and then update the codebase to prevent exceptions.

In case there are new columns, the best solution would be: execute the migrations (if there are new columns, make them NULL), deploy the code, fill the new columns via SQL or code, if the new columns must be not-NULL, change the table definition and make the columns not-NULL.

Given the complexity I find better to do it by hand.


Yeah, timestamps seem to work much better, especially with distributed version control. Otherwise you have to coordinate who gets the next number in the sequence. There's still a race condition between two branches changing the schema in incompatible ways, but with a clean merge. But in my experience, code review, tests, and just developers running migrations day-to-day mitigates that issue.


I am sort of new to this stuff but I believe Django essentially does this correctly as long as you have your site under version control. Although having a DB instance for each developer seems a little bit tricky to me but that may just be my inexperience.


What do you think about versioning with an ORM?

Doesn't this imply a automatic versioning with the rest of the code?

For example, I use Sequelize and create my tables via model classes, which I write like every other code in my app.


That works until a method you reference in the migration changes or doesn't exist anymore. For instance, a model you're working with might not expose some of the methods you're working with or a library has been modified, but it's being used in the migration. Obviously this is something that people should try to avoid, but I once found myself in the position of having to do checkout to some specific commits in the history in order to be able to run migrations at a past project. In the end I found myself repeating the process once too often and created a gem for it: https://github.com/hph/farfugl. This is for Rails migrations but the concept is the same in any library or language, you just have to go to the commit where the migration originates from.

As a side note, I've been using Sequelize for a year now and it can often be pretty annoying. Both the query interface and the migrations are subpar when compared to Rails. For instance:

    Model.find({ id: 1}) // or maybe just Model.find(1)!
vs. (the Sequelize syntax)

    Model.find({ where: { id: 1 } })
Forgetting the "where" above doesn't even raise an error, it just silently fails.


For IDs you can use

    Model.find(1)


"every SQL script file must be immutable after it is deployed to production or staging environment."

git doesn't support locking files. Then, what is the best way to meet this immutability requirement?


A baseball bat


Schema Evolution Manager is excellent for managing Postgres databases in a language-, framework-agnostic manner!


SSDT is fantastic for this. I wish it worked for something other than SQL Server though!


Well I mean it has "SQL Server" in the name, so..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: