spathak's comments

spathak · on April 20, 2016

Sumedh from Citus Data here.

> Does Citus keeps it's performance over tables with tens of billions of records?

Citus essentially shards the data across machines, and queries these in parallel. You can thus scale out your cluster and CPU cores as you add more data and maintain performance.

> Also, how fast it is for ad-hoc queries over data coming from streams (Kafka/Kinesis) that has not been cached?

By 'cached', do you mean OS or database caching in-memory? Query performance for on-disk data is as fast as you can get with regular PostgreSQL, since each data node is essentially a PostgreSQL node, and each shard a regular PostgreSQL table. Standard tuning like indexes and Postgres configuration parameters will apply here.

Tharkun · on April 20, 2016

Not every query is parallelizable. Maintaining performance is a lie. An easy to grasp is example is computing a median. And I mean an exact median, not an approximation.

spathak · on April 20, 2016

@Tharkun: You are right that not every query is immediately parallelizable. Distinct count's are another example. In some cases data can be re-partitioned so we can calculate exact values and push down computation in parallel. This may provide better performance than a single large table, so there are still benefits to it. Ultimately though there will be tradeoffs to moving to an entirely distributed environment, but depending on the use-case the value may offset those.

mtanski · on April 20, 2016

I'm not sure why folks are downvoting you because most database systems that provide the full array of relational operations (joins, groupby, groupby cube, etc) do not scale linearly (maybe past a handful of nodes). Mixing OLTP / OLAP using current technologies is hard.

spathak · on Oct 10, 2015

Sumedh from Citus Data here. I'd love to hear what difficulties you ran into, or any feedback you may have on pg_shard. My email is in my profile if you want to drop me a note.

We are also actively working on making our products easier to use and would love to get more user input along that way.

neumino · on Oct 10, 2015

It was mostly a lack of docs. Most of the documentation assumes citusDB, and how to get things work without citusDB is left out.

spathak · on June 29, 2012

That is correct. This is primarily a bulk-load system. There are setups (as mentioned in the smilliken's comment above), where it can be used for real-time inserts, but requires more hands-on setup and configuration.