Is ClickHouse good for event data when you want to do rollups? For example, say ...

manigandham · on Jan 6, 2020

Yes. Clickhouse is a column-oriented relational database among many others like MemSQL, Vertica, Redshift, BigQuery, Snowflake, Greenplum, etc. They're all focused on analytical queries over very large datasets using SQL.

An aggregation with several `group by` statements is no challenge and all of these databases also support approximate counting via HyperLogLog for faster results.

Clickhouse has some unique features where each table can have a separate 'engine' including some that automatically apply aggregations. Start with a normal table though since it'll be plenty fast enough for most use cases.

atombender · on Jan 6, 2020

Thanks! Looks like the only downside is that, as it returns rows as results, you end up getting a lot of duplicate column data back and need to "nest" the nested buckets yourself.

For example, a result like:

  topic;time;count
  news;2020-01-01;44
  news;2020-01-02;31

Now you have "news" repeated, and to group this into buckets for rendering summary tables and such (with sub totals at each level), you need to iterate through the flattened results and generate nested structures. This is something Elasticsearch gives you out of the box.

Last I looked at Clickhouse, it had master/slave replication only, and if you want shards of data distributed across a cluster it's something you need to manually manage?

hodgesrm · on Jan 6, 2020

> Now you have "news" repeated, and to group this into buckets for rendering summary tables and such (with sub totals at each level), you need to iterate through the flattened results and generate nested structures. This is something Elasticsearch gives you out of the box.

ClickHouse has a number of optimization for solving 'visitor' problems that you describe. Assuming you just want to group in different ways an idiomatic ClickHouse solution is to construct a materialized view that aggregates counts (e.g., of unique users like uniq(user)). You can then select from the materialized view and further aggregate to have larger buckets. ClickHouse can also compute single-level totals using the WITH TOTALS modifier.

If you need to have cascading sub-totals within the same listing as far as I know you'll have to compute the totals yourself. (That feature actually might be an interesting pull request since ClickHouse generates JSON output.)

> Last I looked at Clickhouse, it had master/slave replication only, and if you want shards of data distributed across a cluster it's something you need to manually manage?

ClickHouse replication is multi-master. The model is eventually consistent. Also, ClickHouse can automatically shard INSERTs across a cluster using distributed tables. That said, many users insert directly to local nodes because it's faster and uses fewer resources.

atombender · on Jan 6, 2020

Sounds great, thank you!

manigandham · on Jan 6, 2020

Right, relational databases only return flat tabular results but that seems minor compared to performance increase you gain.

Clickhouse is fast but not as operationally friendly as the others. It's more much work once you go beyond a single node so I'd suggest looking at those other options if you want something easier to operate, or use one of the cloud data warehouses like Bigquery or Snowflake to eliminate ops entirely.