> Does Citus keeps it's performance over tables with tens of billions of records?
Citus essentially shards the data across machines, and queries these in parallel. You can thus scale out your cluster and CPU cores as you add more data and maintain performance.
> Also, how fast it is for ad-hoc queries over data coming from streams (Kafka/Kinesis) that has not been cached?
By 'cached', do you mean OS or database caching in-memory? Query performance for on-disk data is as fast as you can get with regular PostgreSQL, since each data node is essentially a PostgreSQL node, and each shard a regular PostgreSQL table. Standard tuning like indexes and Postgres configuration parameters will apply here.
Not every query is parallelizable. Maintaining performance is a lie. An easy to grasp is example is computing a median. And I mean an exact median, not an approximation.
@Tharkun: You are right that not every query is immediately parallelizable. Distinct count's are another example. In some cases data can be re-partitioned so we can calculate exact values and push down computation in parallel. This may provide better performance than a single large table, so there are still benefits to it. Ultimately though there will be tradeoffs to moving to an entirely distributed environment, but depending on the use-case the value may offset those.
I'm not sure why folks are downvoting you because most database systems that provide the full array of relational operations (joins, groupby, groupby cube, etc) do not scale linearly (maybe past a handful of nodes). Mixing OLTP / OLAP using current technologies is hard.
Sumedh from Citus Data here. I'd love to hear what difficulties you ran into, or any feedback you may have on pg_shard. My email is in my profile if you want to drop me a note.
We are also actively working on making our products easier to use and would love to get more user input along that way.
That is correct. This is primarily a bulk-load system. There are setups (as mentioned in the smilliken's comment above), where it can be used for real-time inserts, but requires more hands-on setup and configuration.
> Does Citus keeps it's performance over tables with tens of billions of records?
Citus essentially shards the data across machines, and queries these in parallel. You can thus scale out your cluster and CPU cores as you add more data and maintain performance.
> Also, how fast it is for ad-hoc queries over data coming from streams (Kafka/Kinesis) that has not been cached?
By 'cached', do you mean OS or database caching in-memory? Query performance for on-disk data is as fast as you can get with regular PostgreSQL, since each data node is essentially a PostgreSQL node, and each shard a regular PostgreSQL table. Standard tuning like indexes and Postgres configuration parameters will apply here.