Hacker News new | past | comments | ask | show | jobs | submit login
Building a better and scalable system for data migrations (yorickpeterse.com)
68 points by YorickPeterse 3 months ago | hide | past | favorite | 36 comments



I agree with most points but would like to add one about "tunability". This article focuses mostly on correctness, which is of course important (and without which the whole system would be useless) but in itself is often not sufficient for large systems. I've seen it happen more than once that a big table rewrite causes so much load the database might as well be down completely.

One thing I would love to see implemented more thoroughly in database systems is the ability to throttle and/or pause the load that DML can impose. Normal SQL only says "what" the ALTER TABLE should do, but the "how" is often rather lacking. At most you get a CREATE INDEX CONCURRENTLY in postgres or an "ALGORITHM=instant" in mysql, but rarely do you get finegrained enough to say "use at most XYZ iops for this task", let alone that you can vary that XYZ variable dynamically or to assign priorities to load caused by different queries.

AFAIK TiDB and pt-osc provide ways to pause a running migration, gh-ost can also throttle a migration dynamically. Vitess also has several ways to manage migration, as it leverages gh-ost. For postgress I don't think any of the currently popular tools have good ways to manage this, but I would love to be proven wrong.


Vitess maintainer here. While Vitess can still use gh-ost, the recommended way is to use Vitess' internal VReplication mechanism. With this, your can throttle, pause, resume, cancel, retry, revert a migration, as well as auto-resume from failover (migration continues from point of interruption once a new primary is instated).

Vitess throttling is by default based on replication lag, but you can use different metrics, such as load average, or indeed multiple metrics combined, to define what constitutes a load.


Good point, I forgot about tunability. At GitLab we did have a mechanism for this for background migrations, IIRC it looked at the vacuuming statistics of PostgreSQL and would back off if it deemed there was too much load.


> As I've hinted at before, migrations should be functions. Functions as in "a function written in a programming language", not functions as in "SQL snippets I just refer to as functions because it makes me sound smart".

Disagree. SQL is a declarative language that is both clear and universal. Feel free to generate the migrations however you’d like, but I want to see the SQL.

Another bonus (and I’m sure I’ll be told I’m wrong) is that you don’t need to write tests for the migration, because it’s declarative. Assuming you know what it means (if you don’t, maybe you shouldn’t be administering a DB) and what its locking methods entail, you will get _precisely_ what it says, and nothing more. If you get a failure from, say, a duplicate entry when creating a UNIQUE constraint, that’s orthogonal to the migration itself – you described the end state, the DB tried to make that happen, but was unable to do so due to issues with your data. All the tests in the world wouldn’t catch that, short of selecting and deduping the column[s], and at that point, you’re just creating work.

I am firmly convinced that any and all infrastructure should be declaratively instantiated, and declaratively maintained. I do not want or need to test my Terraform files, nor my DDL.


> Disagree. SQL is a declarative language that is both clear and universal. Feel free to generate the migrations however you’d like, but I want to see the SQL.

This works until you need something beyond the trivial "UPDATE table SET column1 = column2", such as "Update the values in column X with the contents of file Y that exists in Amazon S3", or really anything else you can't express in SQL.

> Another bonus (and I’m sure I’ll be told I’m wrong) is that you don’t need to write tests for the migration, because it’s declarative.

This indeed is wrong, and essentially comes down to "It looks simple so it's correct". A SQL query "DELETE FROM users" might be correct, but if you meant for it to be "DELETE FROM users WHERE id IN (...)" it's going to cause problems.

In other words, at least for data migrations you absolutely have to write tests or you will run into problems.


These are not migrations in the traditional sense. I was referring to DDL. Nevertheless:

> "Update the values in column X with the contents of file Y that exists in Amazon S3"

You actually can do that in SQL, assuming it’s hosted on AWS (or for Postgres, you’ve installed the correct extension). It would be a bit convoluted (I think to handle the UPDATE, you’d first dump from S3 into a temp table, then update from that), but it would work.

> A SQL query "DELETE FROM users" might be correct, but if you meant for it to be "DELETE FROM users WHERE id IN (...)" it's going to cause problems.

If someone doesn’t notice this issue in SQL, they’re not going to notice it in an ORM, either. It’s also possible (and a good idea) to enforce predicates for DELETE at a server configuration level, such that the DB will refuse to execute them. And really, if you actually want to delete everything in a table, you should be using TRUNCATE anyway.


You can express it in SQL; you would need to create a table with "the contents of file Y" from S3, or use something like PostgreSQL foreign data wrappers to that effect: we're using pg_analytics[1] to query CSV/Parquet data, and the same applies to most databases these days.

The limitation you're implying—is not that of SQL, but your data model.

[1] https://github.com/paradedb/pg_analytics


Both you and sgarland are missing the point here: the example I gave is literally just that: an example. There could be many more complex cases, some relying on other business logic provided by your application, some more isolated. Some of those cases could be handled using some random PG extension, but for many that won't be sufficient.

In contrast, a migration written in an actual programming language can handle all such cases. Depending on the amount of abstractions applied, it can also look close enough to a declarative language (in other words, it doesn't have to be verbose).

So yes, the limitation I'm implying very much is a limitation of SQL (or really any declarative query language for that matter). It has nothing to do with the data model as it applies equally well to using e.g. MongoDB, Redis, or really anything else that stores a pile of data you may want to transform in complex ways.


What can I say, use a shitty schema—win shitty prizes. This is why we never allow coders to touch the schema: you _will_ screw it up given the tiniest chance. Suddenly, my data model has to be aware of "all such cases." Thank you please no good bye.


Unfortunately what you wind up with, IME, is devs “owning” their DB (sometimes it’s a schema on a shared instance, sometimes it’s its own instance), until such time that the poor decisions catch up. Then it’s an incident, at which point the DBRE / SRE team asks WTF, nurse the thing back to life, and then are summarily ignored for future advice. Rinse and repeat.

I’m hoping my current job breaks this trend.


True. Although to be fair, I think you're a bit naive to assume that this is a trend, if anything. IMO there's no escaping Ownership issues: if your culture is to "move fast break things", it would be inevitable. The ownership philosophy is such that some gaps in ownership are inevitable.


I think you’re missing the point that what you’re describing is not a migration as a DB{A,RE} would call it. That is what I’m pushing back on.

If you’re shifting or otherwise manipulating tuples, then yes, you probably want to handle that logic in a more general-purpose language (though it isn’t _required_, annoying though it might be otherwise).

But for DDL? No, I don’t want anything in the way. The state should be stored in VCS, of course, but I don’t want some abstraction trying to be clever about ALTER TABLE foo. Just run it how I specified.


Not exactly an SQL database but we were migrating data _constantly_ (dead devices, load balancing, ...) for the data/metadata of an exabyte scale object store...

Once client code can work with the mixed state (while the migration is in progress) It no longer matters how long it takes. Once the migration is robust enough so it can handle crashes, interrupts, ... it no longer matters how often you trigger the "continue". The migration is using too many iops ? just kill it, schedule a continuance later.

Also, your smallest step needs to be an atomic multi-update (you don't want to bother with partial failures)


For people interested in this subject, you might also be interested in my project Reshape: https://github.com/fabianlindfors/reshape. It takes database migrations a step further than most tools, guaranteeing zero-downtime migration for both schema and data changes.


interesting thanks! How would it compare to other like bytebase/bytebase, Xataio/pgroll, liquibase/liquibase, flyway/flyway, graphile/migrate


also separate question but I feel it would be nice to have a few "example" tutorials, where you go through setting a database with fake data, with a sample app which queries it in real-time in a different terminal and adds data or interact, and then going through a few updates / reverts, and be able to check that the sample app can still work perfectly without having to close it and restart, would be a quite useful "start tutorial" to get the idea of your lib


and why building a separate "reshapedb" if "postgres + reshape" is good enough? won't this bring just more issues?


The database I'm working with, Rama, has an interesting way to do "instant migrations" at scale. Your migration function runs on every read until data is durably migrated in the background. So you get migrated data immediately after deploy, while Rama quietly persists the changes during normal processing. No downtime.

https://blog.redplanetlabs.com/2024/09/30/migrating-terabyte...


I’m a bit surprised to hear that is the state of migrations in Rails. Being a Django developer I’m used to having most of the points you mention covered for a really long time already.

It’s timeless by looking at a captured schema, that doesn’t change with later code changes. Which means you can’t import / use methods of models and have to copy those over. So a different way than using the vcs for this, but still.

Not sure I’d name the second point “scalable”, but you can easily do data migrations as well.

It is really easy to use!

And you can certainly write tests for it, although it isn’t included in the framework [0]. But you can do that with just SQL too [1].

What I find much harder with (somewhat) larger dbs, is is a) determining whether it will lock up too much and b) whether it backwards compatible (so that you can roll back). Which is splitting in these pre- and post-migration steps as the article mentions. We currently use a linter for that but it is still a bit basic [2].

[0] https://www.caktusgroup.com/blog/2016/02/02/writing-unit-tes...

[1] https://tapoueh.org/blog/2017/08/sql-regression-tests/

[2] https://github.com/3YOURMIND/django-migration-linter


I have been eyeing up https://atlasgo.io/ - I am going to give it a whirl soon.


Not directly linked to migration (although I'd love to know the best way to handle those when you have a kind of Datalake with big number of past tables that have been archived on S3)

I'm trying to move my database to Postgres, there is a part which is "describing all the objects" (object id, properties, etc), and a huge table which is a log of events, that I'm storing in case I want to data-mine it later.

Of course this last table is:

huge (or should become huge at some point) better suited by columnar storage might be archived from time to time on S3 My initial thinking was to store it in Postgres "natively" or as a "duckdb/clickhouse" extension with postgres-querying capabilities, keep the last 90 days of data in the database, and regularly have a script to export the rest as Parquet files on S3

does this seem reasonable? is there a "best practice" to do this?

I also want to do the same with "audit logs" of everything going in the system (modifications to the fields, actions taken by users on the dashboard, etc)

what would you recommend?


It would be interesting to consider branchable DBs as part of the migration cycle, especially for forward and backwards time skips.

While not dealing with that kind of scale yet, our application (An AI Data Engineer that has done migration work for users) needs to do before and after comparisons and find diffs. We use a branchable DB to compute those changes efficiently (DoltGres)

Could be an interesting thing to consider since it's worked well for us for that part.

Our build if u wanna check that out too -> https://Ardentai.io


Avoiding SQL migrations was my #1 reason for moving to event sourcing.

This approach cuts the "database server" into an event stream (an append-only sequence of events), and a cached view (a read-only database that is kept up-to-date whenever events are added to the stream, and can be queried by the rest of the system).

Migrations are overwhelmingly cached view migrations (that don't touch the event stream), and in very rare cases they are event stream migrations (that don't touch the cached view).

A cached view migration is made trivial by the fact that multiple cached views can co-exist for a single event stream. Migrating consists in deploying the new version of the code to a subset of production machines, waiting for the new cached view to be populated and up-to-date (this can take a while, but the old version of the code, with the old cached view, is still running on most production machines at this point), and then deploying the new version to all other production machines. Rollback follows the same path in reverse (with the advantage that the old cached view is already up-to-date, so there is no need to wait).

An event stream migration requires a running process that transfers events from the old stream to the new stream as they appear (transforming them if necessary). Once the existing events have been migrated, flip a switch so that all writes point to the new stream instead of the old.


I liked the line of thought from the author and suffered similar points at a much small scale! Once, I'd planned to do a migration, which moved multiple columns from one table to another and the associated data, but calculated wrong the time to complete and the amount of CPU/RAM, if I'd the right tools, probably I'd scheduled better and planned a better approach.


Considered doing the migration via Liquibase? I dont know that it supports parallelism in the migration process but you could do most of the rest in it (pre/post migration scripts, rollbacks, test cases (within precondition changesets), etc... Not a real programming language though; its xml config files or annotated SQL scripts...


I found that liquibase doesn't really support rollbacks, particularly with mysql as you can't do transactions for schema updates, if the migration fails in the middle of an update it just gets left in a half-updated state.


I'm glad to see I'm not the ~only one bothered by this issue. This has been, and still is, one of the challenges in managing MySQL schemas.

That's one of the reasons we implemented "migrate down" differently than other tools.

I'm not here to promote my blog post, but if you are interested in seeing how we tackled this in Atlas, you can read more here: https://atlasgo.io/blog/2024/04/01/migrate-down


Liquibase does support rollback DDL with MySQL; I used it.

I put each DDL in a Liquibase changeset with a corresponding rollback DDL I constructed by hand. If the Liquibase changeset failed, I could run the rollback for all the steps after the "top" of my wish-I-could-put-them-in-a-MySQL-transaction operations.

But you are right MySQL itself doesn't support transactions for DDL and that is true whatever tool you use.

It is true that if you put multiple SQL operations in a single Liquibase changeset that are not transactional you can't reliably do rollbacks like the above.

It is also true that constructing an inverse rollback SQL for each changeset SQL by hand takes time and effort particularly to ensure sufficient testing, and the business/technical value of actually doing that coding+testing may or may not be worth it depending on your situation/use-case.


I hadn't heard of Liquibase until now. Taking a brief look, it doesn't seem like it would be suitable though. For large scale migrations you really need to be able to express arbitrary logic ("migrate A to B but only if file X on the file system contains Y"), which you can't do with something like SQL or XML.


The great thing is, Liquibase even allows you to run arbitrary programs as migrations using `customChange`: https://docs.liquibase.com/change-types/custom-change.html

Though you can get a long way with just specifying the appropriate contexts before you kick off your migrations (and tagging your changesets with those context tags as well): https://docs.liquibase.com/concepts/changelogs/attributes/co...


This post addresses the data migration specifics well. Besides, there is the part about scaling the organization to deploy database migrations. We've built a project https://github.com/bytebase/bytebase to address this.


The SQL migrations are when running in-place, hard to reverse changes, where production is the only realistic env.

I wish there was a better way to run blue/green DB deployments. Though this feature is rare (e.g. gh-ost) and not that usable at less than bug tech scale.


My project Reshape (https://github.com/fabianlindfors/reshape) might be interesting to you, it's a bit like blue/green DB deployments in that it uses a two-step process where it can support both the old and the new schema at the same time during the deployment phase.


Having worked with migrating petabyte scale sharded SQL databases I get a sinking feeling when I think about automating this. I think it would actually be a pretty good test case for artificial super intelligence.


I work for a nosql db vendor. We don’t have a fixed schema so there are no migrations. If you want fast access to some parts of data you just add an index and reindex async. If you need SQL for Tableau or PowerBI you just add a view which is also like an index.

But the underlying data and it’s model can be in flux and we handle exabyte scale ha dr rebalancing etc


If you enjoy maintaining 5 different formats of the same data, then by all means use a document storage DB. But at some point you'll have to migrate the data, at which point you'll face the same issues (or possibly worse due to the lack of support for schema migrations) as relational databases.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: