Logical replication and decoding for Cloud SQL for PostgreSQL

jacobkg · on June 5, 2021

Glad to see that they are adding features but I wish the pace was faster. We have been using Cloud SQL for Postgres and overall it is good but there are a number of glaring and frustrating feature holes. The two top of mind for me are:

1) No way to force SSL connections without enforcing two-way SSL (which is a huge pain and not supported by all the clients we use). This is literally just a Postgres config option but they don’t expose it. RDS has had this feature since 2016.

2) No in place upgrade. This is again a feature built in to Postgres and RDS has had it for years. Instead the upgrade story for Cloud SQL is cumbersome and involves setting up a new instance, creating and loading backups, etc.

We switched to Cloud SQL from running our own Postgres and it is a huge improvement, but the feature set is disappointing compared to RDS

Diggsey · on June 5, 2021

Also, if you use Cloud SQL in HA mode, it may still go down randomly for up to 90s with no warning or entry in the operation log, and this is considered expected behaviour.

Here is a direct quote from google support when we contacted them about our database going down outside of our scheduled maintenance window:

> As I mentioned previously remember that the maintenance window is preferred but there are time-sensitive maintenance events that are considered quite important such as this one which is a Live migration. Most maintenance events should be reflected on the operations logs but there are a few other maintenance events such as this one that are more on the infrastructure side that appear transparent to clients because of the nature of the changes made to the Google managed Compute Engine that host the instances, this is a necessary step for maintaining the managed infrastructure. For this reason this maintenance does not appear visible in your logs or on the platform.

Here "transparent to clients" means that the database is completely inaccessible for up to 90s. Furthermore, because there's no entry in the operation log, there's no way to detect if the database is down because of "expected maintenance", or because of some other issue without talking to a human at google support: so really great if you're woken up in the middle of the night because your database is down, and you're trying to figure out what happened...

jacobkg · on June 5, 2021

That’s troubling. In fairness, when I last used RDS (2018) we had 9 databases running and we averaged about one database failover per month, with about 2-3 minutes of downtime per incident. I never got a satisfactory answer from support other than that this was a thing that sometimes happened.

Diggsey · on June 5, 2021

To be clear: there is no failover happening, even though we do have a failover instance. If there was, we could at least detect that something had happened after the fact!

The Cloud SQL failover only occurs in certain circumstances, and in all our time using Cloud SQL the failover has not once kicked in automatically (despite many outages).

In fact, one of our earliest support issues was that the "manual failover" button was disabled when any sort of operation was occuring on the datbase, making it almost completely useless! Luckily this issue at least was fixed.

kevincox · on June 6, 2021

It sounds like the machine that the DB VM was running on was being taken offline or restarted for maintenance. The default behaviour for GCE is to live-migrate the VM to another machine. So I guess Cloud SQL uses the default here. (It may well be the best option as failover isn't instant either.) Live migration is usually much faster than 90s at well but if you are making heavy RAM updates that could definitely slow it down.

Either way I agree that some full-stack integration is needed on GCPs part to at least get that into the maitnaince log. It would also be nice to make most of these happen during the maitnaince window but IIUC they don't always have 24h notice of a machine reboot.

jacobkg · on June 5, 2021

Yikes! That is alarming. I also don’t like the part about automated failover not working

slashdev · on June 6, 2021

Live migration is when they move you to another system.

They have to do that when hardware fails (if you're lucky), but that it's happening so often suggests it's part of regular software maintenance or something like that. Which is pretty unacceptable to me.

dilyevsky · on June 6, 2021

Yes this is regular occurrence on gcp

dajonker · on June 5, 2021

I could not agree with you more. These are exactly our two complaints about Postgres on Cloud SQL.

jacobkg · on June 5, 2021

Glad to know I am not alone!

dilatedmind · on June 5, 2021

how significant is 1 if you are using cloud sql proxy? My understanding is that the proxy tunnels traffic over an encrypted connection, so there is no benefit to adding an extra layer.

jacobkg · on June 5, 2021

cloud sql proxy is secure and that is what all our developers use for local access to the database.

However, we have some third party data analysis tools (such as Tableau) that also connect to one of our databases. They are hosted in their own clouds and have to connect over the databases’s public IP address and can’t use cloud_sql_proxy. I of course manually confirmed that these connections use SSL but I would feel much more comfortable if I could enforce it from our end.

vira28 · on June 6, 2021

I am wondering why no in place upgrade? Are they concerned by doing “—-link” there may be data corruption and irrecoverable?

koolba · on June 5, 2021

What is two way SSL? Do you mean client certificate authentication?

jacobkg · on June 5, 2021

btown · on June 5, 2021

This is awesome! For comparison, this is also supported on Amazon RDS, so AFAICT this opens up the possibility of near-zero-downtime streaming migrations between the two cloud providers: https://aws.amazon.com/blogs/database/using-logical-replicat...

Also, it enables a really cool pattern of change data capture, which allows you to capture "normal" changes to your Postgres database as events that can be fed to e.g. Kafka and power an event-driven/CQRS system. https://www.confluent.io/blog/bottled-water-real-time-integr... is a 2015 post describing the pattern well; the modern tool that replaces Bottled Water is https://debezium.io/ . For instance, if you have a "last_updated_by" column in your tables that's respected by all your applications, this becomes a more-or-less-free audit log, or at the very least something that you can use to spot-check that your audit logging system is capturing everything it should be!

When you're building and debugging systems that combine trusted human inputs, untrusted human inputs, results from machine learning, and results from external databases, all related to the same entity in your business logic (and who isn't doing all of these things, these days!), having this kind of replayable event capture is invaluable. If you value observability of how your distributed system evolves within the context of a single request, tracking a datum as it evolves over time is the logical (heh) evolution of that need.

hamandcheese · on June 5, 2021

Where I work we use debezium to enable “transactional publishing” to kinesis.

Not only does it allow publishing events as part of an ordinary database transaction, it also provides a nice buffer if kinesis ever goes down. During the Nov. 2020 kinesis outage, our services using this mechanism kept chugging with no immediate issues.

cpursley · on June 5, 2021

Similarly - Supabase uses Elixir to listen to Postgres changes via logical replication. Pretty neat pattern and Elixir/Erlang is especially good at this sort of thing:

https://github.com/supabase/realtime

pyrophane · on June 5, 2021

Hijacking this thread a bit for a related question. Anyone have a solution for replicating data from Cloud SQL Postgres to BigQuery that they like?

Have been shopping around for a good way to do this, ideally with the ability to capture deletions and schema changes.

Have looked at Fivetran but it seems expensive for this use case, and won’t capture deletions until they can support logical replication.

powerbook5300CS · on June 5, 2021

We get data into BigQuery from pg using a custom airflow plugin but recently started doing something similar on AWS with DMS to get data from postgres -> redshift.

DMS is awesome. No code is the best code. Big query is great but not having a “snap-your-fingers and the data is there” connector makes it a PITA for me to maintain.

As a result… we are using more redshift.

joeswartz · on June 5, 2021

One option is to use DBeam (http://github.com/spotify/dbeam) to export Avro records into GCS and then load to BigQuery with a load job.

spookylettuce · on June 5, 2021

I tend to utilize bigquery's external connections and scheduled queries to periodically clone my "hot" PG data to an BQ table for long term storage and analysis.

It seems so much easier to go PG=>BQ than the other way around.

sa46 · on June 5, 2021

Is the idea something like a scheduled:

   INSERT INTO bq_table SELECT * FROM EXTERNAL_QUERY('');

I'm guessing you're on the hook for keeping the schema up to date with the Postgres schema.

nceasy · on June 6, 2021

I also use the external data connector, the problem is when you have two projects in different locations. Then I go for psycopg2 and uploat it to bigquery using bigquery magic lines

patwolf · on June 5, 2021

Same here. We previously used Stitch until they dropped the free tier.

tlarkworthy · on June 5, 2021

I second federated table connector (and scheduled queries).

hermanradtke · on June 5, 2021

We went with StitchData over FiveTran. We also had to build a custom way to delete records. Hopefully we can get rid of this soon.

mrbungie · on June 5, 2021

An option, albeit a bit complex would be:

Debezium => Kafka => Parquet in Google Cloud Storage => BigQuery external queries.

gunnarmorling · on June 5, 2021

Why the Parquet step? You should be able to do straight Debezium -> Kafka -> BQ, using the BQ sink connector for Kafka Connect (https://github.com/confluentinc/kafka-connect-bigquery); we have users using this with the Debezium MySQL connector, I'd expect this to work equally for Postgres.

Disclaimer: working on Debezium

thejosh · on June 6, 2021

We use fivetran for Heroku postgres, but also use it for other random sources to Snowflake.

It's decent.

idunno246 · on June 5, 2021

presumably alooma would be the one, since google aquired them, but two+ years they still arent integrated but stopped accepting new customers

jeffbee · on June 5, 2021

Bigquery can read directly from cloud SQL.

pyrophane · on June 5, 2021

Are you referring to federated queries?

mikeklaas · on June 6, 2021

You can do something like:

   CREATE TABLE xx AS SELECT * FROM EXTERNAL_QUERY('postgres.db', 'SELECT * FROM my_table'));

atombender · on June 5, 2021

Google just launched the Datastream product in beta, which looks very cool. I find it curious that it doesn't support Postgres at all, even as they're launching the logical replication feature for Cloud SQL.

In fact, the Datastream documentation has a diagram showing Postgres as a source, as well as custom sources — but, disappointingly, neither is supported. Only Oracle and MySQL are supported.

wdb · on June 5, 2021

Blog article says its planned for later this year. :)