We currently use the kafka connect bigquery connector, along with debezium, to "stream" both the "changelog"/"transaction log" and to "mirror" our rdbms instances into bigquery. While this works, it's been a fair amount of effort to iron out issues over time. We also have had to work around bigquery limits including issues exceeding concurrent queries (switched to batch mode, which has it's own issues) and frequency of writes (we've had to throttle to flushing every minute, which is good enough, but did have a use case for faster updates). Also have issues related to partitioning and clustering, and more...
So seeing this to potentially replace the kafka connect bigquery connector looked appealing. However, according to the docs and listed limitations (https://cloud.google.com/datastream/docs/sources-postgresql) it does not handle schema changes well nor postgres array types. Not that any of these tools handle this well, but given the open source bigquery connector, we've been able to work around this with customizations to the code. Hopefully they'll continue to iterate on the product and I'll be keeping an eye out.
Yeah the Debezium connectors have some issues that really get in the way. I'm less familiar with BQ but some other DBs, the data typing is really, really basic
(case when string then varchar(4000/max))
and similar. It looks like a relatively easy thing to incrementally improve.
Hey, Gunnar here from the Debezium team. I would love to learn more about those issues you encountered with Debezium. Perhaps it's something we could improve. So if you could share your feedback either here, on our mailing list (https://groups.google.com/g/debezium), that would be awesome. Thanks!
For us, debezium has been working rather well in recent memory (there were a couple PRs I submitted that I'm proud of, even if tiny!). Most of the issues are on the kafka connector side, whether it's the bigquery sink, jdbc sink, and s3 sink.
A couple things that do pop to mind as it relates to debezium include better capabilities around backup + restores and disaster recovery, and any further hints around schema changes. Admittedly I haven't looked at these areas for 6+ months so they may be improved.
Well I'm trying to find it now! Maybe it has been updated. It was essentially a case statement on the destinations which went for some wide data types but, looking now, I see where it is looking at the schemas and getting more precision.
So seeing this to potentially replace the kafka connect bigquery connector looked appealing. However, according to the docs and listed limitations (https://cloud.google.com/datastream/docs/sources-postgresql) it does not handle schema changes well nor postgres array types. Not that any of these tools handle this well, but given the open source bigquery connector, we've been able to work around this with customizations to the code. Hopefully they'll continue to iterate on the product and I'll be keeping an eye out.