Thanks for the context. In what way would you say Cloudberry lags behind Greenplum technology-wise? I see newer Greenplum versions have a lot of planner improvements.
Greenplum 7 is listed as tracking Postgres 12 in the release announcement [1], and the release notes for later 7.x versions don't mention anything. Is there a newer release with higher compatibility?
When I say ancient, I mean that it's a "classical" shared-nothing design where the database is partitioned and hosted as parallel, self-contained replica servers, where each node runs as a shard that could, in theory, by queried independently of the master database. This is in contrast to newer architectures where data is sharded at the heap level (e.g. Yugabyte, CockroachDB) and/or compute is separated from data (e.g. Aurora, ClickHouse, Neon, TiDB).
Cloudberry, last I checked, took their snapshot of all the Greenplum utilities way before the repos got archived and development went private. The backup/restore, DR, Upgrade, and other such seem to leave a lot on the table. I haven't checked in a bit, it's possible they've picked back up some of that progress.
You're completely right, I had the wrong PG version in my memory. Embarrassing, thanks for catching that.
All the Greenplum utilities you mentioned here are also open-sourced and available for Cloudberry, but some of them are not in the main repo of Apache Cloudberry (This is more a matter of adhering to the Apache Software Foundation's regulations than a technical limitation).
Here is the unofficial roadmap of Cloudberry:
1. Continuously upgrading the PostgreSQL core version, maintaining compatibility with Greenplum Database, and strengthening the product's stability.
2. End-to-end performance optimization to support near real-time analytics, including streaming ingestion, vectorized batch processing, JIT compilation, incremental materialized views, PAX storage format, etc.
3. Supporting lakehouse applications by fully integrating open data lake table formats represented by Apache Iceberg, Hudi, and Delta Lake.
4. Gradually transforming Cloudberry Database into a data foundation supporting AI/ML applications, based on Directory Table, pgvector, and PostgresML.
Greenplum 7 is listed as tracking Postgres 12 in the release announcement [1], and the release notes for later 7.x versions don't mention anything. Is there a newer release with higher compatibility?
When I say ancient, I mean that it's a "classical" shared-nothing design where the database is partitioned and hosted as parallel, self-contained replica servers, where each node runs as a shard that could, in theory, by queried independently of the master database. This is in contrast to newer architectures where data is sharded at the heap level (e.g. Yugabyte, CockroachDB) and/or compute is separated from data (e.g. Aurora, ClickHouse, Neon, TiDB).
[1] https://greenplum.org/partition-in-greenplum-7-whats-new/