I want to be able to build aggregations which shows number of "viewedArticle" events grouped by hour, grouped by topic, counting unique user_ids within each bucket.
Or let's say I want the top K articles viewed each day, filtered by a topic.
That's something that's trivial with Elasticsearch, which has a hierarchical aggregation DSL. Is ClickHouse good at this?
Whenever I see time-series databases such as InfluxDB mentioned, they look like they're focused on measurements, not discrete rows. You can attach the event data as "labels", but this isn't efficient when the cardinality of each column is very high (e.g. article IDs or user IDs in the above example).
Yes. Clickhouse is a column-oriented relational database among many others like MemSQL, Vertica, Redshift, BigQuery, Snowflake, Greenplum, etc. They're all focused on analytical queries over very large datasets using SQL.
An aggregation with several `group by` statements is no challenge and all of these databases also support approximate counting via HyperLogLog for faster results.
Clickhouse has some unique features where each table can have a separate 'engine' including some that automatically apply aggregations. Start with a normal table though since it'll be plenty fast enough for most use cases.
Thanks! Looks like the only downside is that, as it returns rows as results, you end up getting a lot of duplicate column data back and need to "nest" the nested buckets yourself.
Now you have "news" repeated, and to group this into buckets for rendering summary tables and such (with sub totals at each level), you need to iterate through the flattened results and generate nested structures. This is something Elasticsearch gives you out of the box.
Last I looked at Clickhouse, it had master/slave replication only, and if you want shards of data distributed across a cluster it's something you need to manually manage?
> Now you have "news" repeated, and to group this into buckets for rendering summary tables and such (with sub totals at each level), you need to iterate through the flattened results and generate nested structures. This is something Elasticsearch gives you out of the box.
ClickHouse has a number of optimization for solving 'visitor' problems that you describe. Assuming you just want to group in different ways an idiomatic ClickHouse solution is to construct a materialized view that aggregates counts (e.g., of unique users like uniq(user)). You can then select from the materialized view and further aggregate to have larger buckets. ClickHouse can also compute single-level totals using the WITH TOTALS modifier.
If you need to have cascading sub-totals within the same listing as far as I know you'll have to compute the totals yourself. (That feature actually might be an interesting pull request since ClickHouse generates JSON output.)
> Last I looked at Clickhouse, it had master/slave replication only, and if you want shards of data distributed across a cluster it's something you need to manually manage?
ClickHouse replication is multi-master. The model is eventually consistent. Also, ClickHouse can automatically shard INSERTs across a cluster using distributed tables. That said, many users insert directly to local nodes because it's faster and uses fewer resources.
Right, relational databases only return flat tabular results but that seems minor compared to performance increase you gain.
Clickhouse is fast but not as operationally friendly as the others. It's more much work once you go beyond a single node so I'd suggest looking at those other options if you want something easier to operate, or use one of the cloud data warehouses like Bigquery or Snowflake to eliminate ops entirely.
{event: "viewedArticle", article_id: 63534, user_id: 42, topic: "news", time: "2020-01-06"}
I want to be able to build aggregations which shows number of "viewedArticle" events grouped by hour, grouped by topic, counting unique user_ids within each bucket.
Or let's say I want the top K articles viewed each day, filtered by a topic.
That's something that's trivial with Elasticsearch, which has a hierarchical aggregation DSL. Is ClickHouse good at this?
Whenever I see time-series databases such as InfluxDB mentioned, they look like they're focused on measurements, not discrete rows. You can attach the event data as "labels", but this isn't efficient when the cardinality of each column is very high (e.g. article IDs or user IDs in the above example).