> You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).
Is it any more compressed than Apache Hive's ORC format (https://orc.apache.org)? Because that's increasingly accepted as a storage format in a lot of these analytical systems.
Yes, looks like it. According to these posts, ORC only uses snappy or zlib compression, while Clickhouse uses double-delta, Gorilla, and T64 algorithms.
ORC or Parquet are file storage formats so without context their performance can be almost anything. Where is the data stored? S3? HDFS? Local ram disk?
Clickhouse manages the whole distributed storage, ram caching, etc. thing for you.
In my experience, a unified single purpose vertically integrated solution will be faster than a bunch of kitchen sink solutions bolted together.
Of those, it looks like only Presto is open source and/or free. So maybe it's a presto versus clickhouse comparison, which explains why so many choose clickhouse (it's one of only 2 options in its class).
Presto is mostly an engine that runs on top of other databases, although it does have its own query execution engine.
The basic idea behind Presto is that it federates other databases, and supports doing joins across them. From what I understand, the problem that it solved at Facebook is bridging the gap between different teams; if a team has MySQL and another has files stored on HDFS, it doesn't really matter because all you do is query Presto and it'll query both under the covers. The alternative is setting up data pipelines, and dealing with the ongoing issues of maintaining those data pipelines.
How well do those work on a single 8GB node? Because ClickHouse works very well at that scale, with a single C++ executable.
There's large complexity and cost overheads to Hadoop solutions, and not everyone has actual big data problems. ClickHouse hugely outperforms on query patterns that would devolve into table scans in a row store, while working at row store volumes of data without a bunch of big nodes.
Snowflake doesn’t really keep up with Clickhouse (in my experience) and it costs money.
DataBricks is essentially Spark, and I shouldn’t need a whole spark cluster just to get database functionality. It also costs money.
Unless I’m mistaken, Presto is just a distributed query tool over the top of a separate storage layer, so that’s 2 things you have to setup.
I have no experience with BigQiery but I’ve heard good things about it and Redshift, however but if the rest of your infra isn’t on GCP/AWS then that will probably be a blocker.
Clickhouse is open source, comes with convenient clients in a bunch of languages as well as a HTTP API. It’s outrageously fast and has some cool features and makes the right trade-offs for its use-case, large range of supported input/output formats, built-in Kafka support and the replication and sharding is reasonably straightforward to setup.
I don't think it's fair to say "A is faster than B" like in the above comments based on the order they appear in a list that mixes GPU clusters and laptops results. The author of the benchmark does nothing wrong deontologically, but the results table seems ordered by time and some people jump to quick conclusion or use it as a way to rank performance when it's not appropriate.
For example, aside from the lack of transactions, Clickhouse is designed for insertion. There's an INSERT statement, but no UPDATE or DELETE statements. You can rewrite tables (there's ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE), but they're intended for large batch operations, and the operations potentially asynchronous, meaning that they complete right away, but you only see results later.
ClickHouse has many other limitations. For example, there's no enforcement of uniqueness: You can insert the same primary key multiple times. You can dedupe the data, but only specific table engines support this.
There's absolutely no way anyone will want to use ClickHouse as a general-purpose database.
I should have phrased that differently: if something is good enough in some key metric, it extends to other uses - even if it makes a poor fit.
So I insist: everyone will WANT to use clickhouse as a general purpose database, and will create ways to make it so (ex: copy table with the columns you don't want filtered out, drop the original, rename)
It is just too fast and too good for many other things, so it
will expand from these strongholds to the rest.
A personal example: I am migrating my cold storage to clickhouse, because I can just copy the files in place and be up and running.
I know about insert and the likes, I have a great existing system - but this lets me simplify the design, and deprecate many things. Fewer moving parts is in general better.
After that is done, there is a database where I would benefit from things like alter tables or advanced joins, but keeping PostgreSQL and ClickHouse side by side, just for this? No. PostgreSQL will go. Dirty tricks will be deployed. Data will be duplicated if necessary.
There's been a lot of community interest in both topics. Merge join work is largely driven by the ClickHouse team at Yandex. Object storage contributions are from a wider range of teams.
That said I don't see ClickHouse replacing OLTP databases any time soon. It's an analytic store and many of the design choices favor fast, resource efficient scanning and aggregation over large datasets. ClickHouse is not the right choice for high levels of concurrent users working on mutable point data. For this Redis, PostgreSQL, or MySQL are your friends.
Sure - but the comment you're replying to made no mention of NoSQL. It just said Clickhouse lacks OLTP by design, that doesn't mean it won't be widely used, just that it will perhaps be limited to analytics workloads.
If you need deletes and transactions, look elsewhere, but Clickhouse seems to be great for what it's been designed for.
For sure Rust has better design than Go, but I don't know any actively developed web framework in Rust. Go may be a better choice because of ecosystem.
Is it possible to find a contract for a dev that has no permit to work in UK? For instance, I'm skilled in Scala&Python and can easily visit UK as a tourist/businessman. What's the way to this market?(Scala User Group meetups, anything other)?
I'd be in violation of your visa/visa waiver. I have to pay UK corporation tax, something you won't be able to do with a National Insurance number (right to work in the UK) or working on a work visa or something.
I have American friends that have come and worked in the UK, but all for companies, none have setup themselves, so not sure if it's possible.
It shouldnt be a violation if I'll visit UK only to get new contracts and work on them remotely. I also have my own LLC in EU, so I can act on its behalf making taxation for customer easier.
Work from home, and you'll be sorted, even if you're coming in from week to week. Tax laws target _where the work is done_. Then you're just coming to visit a client.