An orthogonal migration issue which I'm hitting right now: we need to migrate from heroku postgres to aws rds postgres, and I'm stressed about the risk and potential downtime in doing so. If there was a way to make a replica in rds based on heroku, promote the rds replica to be the primary, hard switch our apps over to rds, that'd be a lifesaver.
I'm working through this blog post [1] now, but there is still a bit to be defined (including a dependency on heroku's support team) to get this rolling.
Why the migration is required? Heroku postgres doesn't support logical replication, and logical replication is required for any ELT vendor (Fivetran, Stitch, Airbyte) to use Change Data Capture to replicate data from postgres to snowflake (with replicating deleted rows efficiently).
Note: I've also read this ebook [2], but this approach requires downtime.
Note 2: I reached out to heroku support and asked if logical replication was on their short term roadmap. They said they've heard this quite a bit, but nothing tangible is on the roadmap.
If anyone has any thoughts on the above migration, I'd be all ears. :)
I did that exact migration. Unfortunately, to my knowledge, there's no way to do it with zero downtime. You need to make your app read only until the RDS instance has ingested your data, then you can cut over. For me, that was roughly one gigabyte of data and took about forty seconds.
My best advice is to automate the whole thing. You can automate it with the Heroku and AWS CLIs. Test on your staging site until you can run through the whole process end to end a few times with no interruptions.
Yep, absolutely garbage that these clouds (Azure is another one) don't allow you to replicate with external systems. Pretty much devalues their entire hosted postgresql offering if you ask me, since it's just designed to keep you locked in (duh).
If you have any significant amount of data where you're worried about a migration, stay far away from hosted postgres offerings. You'll never get your data out without significant downtime.
There are other ways to handle this at the application level, to be clear, using dual read & write and backfill. More relevant when you have TB+++ of data.
Interesting. I've done dual-writes at the application level to migrate the datastore for a smaller feature (branch by abstraction), but never for an entire application. And the code path was quite simple, so it was easy to think about all of the edge cases at one time in your head.
Do you have any resources which talk through the read/write/backfill approach?
So, basically, Postgres would have a replication port which can be used for both replication/clustering and transfer across cloud providers. And sharding. </dreaming>
We've moved a number of customers from Heroku over to Crunchy Bridge with essentially no down time, am currently helping one customer with 7TB through that process. It's not over to RDS, but would be happy to talk through process if helpful. And we do support logical replication and have many people using wal2json/logical replication with us.
> Why the migration is required? Heroku postgres doesn't support logical replication
You could possibly hack together some form of higher-layer logical replication via postgres_fdw and database triggers. A comment ITT references this as a known technique.
One possible solution for the ETL stuff might be to use Heroku Kafka for the Change Data Capture and then from that Kafka you can move it someplace else.
Interesting that you bring this up. I looked into heroku's streaming connectors to facilitate an integration with materialize.com, but Heroku's support team wasn't confident we could sync all 187 postgres tables under 1 connection.
Interesting, what was it from that podcast that made you reconsider? Always eager to learn about opportunities for improving the experience of using Debezium.
Oh wow, by "work on" you mean "the core maintainer of". Thank you for replying. :)
The main part I reconsidered based on was the level of effort taking the data from kafka and landing into snowflake, especially around handle postgres schema changes safely. I also have no experience with kafka, so I'd be out of my depth's pretty quickly for a critical part of the architecture. He also expressed the need for building quality checks into the kafka to snowflake code, but those details were a bit sparse (if i recall correctly).
Note: all of the above are probably outside the scope of debezium. :)
Note 2: your article [1] on using cdc to build audit logs w/ a "transactions" table blew my mind. Once I listened to your data engineering podcast interview [2], I knew there was some implementation of "event sourcing lite w/ a crud app" possible, so I was excited to see you had already laid it out.
Gotcha, yeah, there's many things to consider indeed when setting up end-to-end pipelines. Thanks for the nice feedback, so happy to hear those resources are useful for folks. As far as event sourcing is concerned, we got another post [1] which might be interesting to you, discussing how "true ES" compares to CDC, pros/cons of either approach, etc.
Adding to your list of options that still require _some_ downtime: we used Bucardo [0] in lieu of logical replication. It was a bit of a pain, since Bucardo has some rough edges, but we made it work. Database was ~2 TiB.
Coming from the outside, with zero understanding of the internal details, my hunch is the same: lack of support for logical replication is more of a business decision than a technical decision. (But again, this a hunch -- partially based on how good heroku is from a technical perspective)
It's absolutely an evil business decision, and all the clouds are playing this game. Don't ever use a hosted database solution if you're thinking about storing any significant amount of data. You will not be able to get it out without downtime.
I'm working through this blog post [1] now, but there is still a bit to be defined (including a dependency on heroku's support team) to get this rolling.
Why the migration is required? Heroku postgres doesn't support logical replication, and logical replication is required for any ELT vendor (Fivetran, Stitch, Airbyte) to use Change Data Capture to replicate data from postgres to snowflake (with replicating deleted rows efficiently).
Note: I've also read this ebook [2], but this approach requires downtime.
Note 2: I reached out to heroku support and asked if logical replication was on their short term roadmap. They said they've heard this quite a bit, but nothing tangible is on the roadmap.
If anyone has any thoughts on the above migration, I'd be all ears. :)
1) https://vericred.com/how-we-migrated-a-1tb-database-from-her...
2) https://pawelurbanek.com/heroku-migrate-postgres-rds