Hacker Newsnew | past | comments | ask | show | jobs | submit | bonobocop's commentslogin

The MIT course for Raft (with the lectures on YouTube) was great to learn for me: http://nil.lcs.mit.edu/6.824/2020/labs/lab-raft.html


Why add RedPanda/Kafka over using async insert? https://clickhouse.com/docs/optimize/asynchronous-inserts

It’s recommended in the docs over the Buffer table, and is pretty much invisible to the end user.

At ClickHouse Inc itself, this scaled far beyond millions of rows per second: https://clickhouse.com/blog/building-a-logging-platform-with...


Hey,

We went from the get go to that infrastructure for multiple reasons in the first place:

* Having a durable buffer before ensures if you have big spikes that gets eaten by the buffer, not OLAP which when it is powering your online dashboard you want to keep responsive. Clickhouse cloud now has compute/compute that addresses that but open source users' don't.

* When we shipped this for the first time, clickhouse did not have the async buffering in place, so not doing some kind of buffered inserts was forwned upon. * As oatsandsugar mentioned, since them we also shipped direct insert where you don't need a kafka buffer if you don't want it

* From an architecture standpoint, with that architecture you can have multiple consumers

* Finally, having kafka enables having streaming function written in your favorite language vs using SQL. Definitely will be less performance to task ratio, but depending on the task might be faster to setup or even you can do things you couldn't directly in the database.

Disclaimer I am the CTO at Fiveonefour


> Clickhouse cloud now has compute/compute that addresses that but open source users' don't.

Altinity is addressing this with Project Antalya builds. We have extended open source ClickHouse with stateless swarm clusters to scale queries on shared Iceberg tables.

Disclaimer: CEO of Altinity


The durability and transformation reasons are definitely more compelling, but the article doesn’t mention those reasons.

It’s mainly focused on the insert batching which is why I was drawing attention to async_insert.

I think it’s worth highlighting the incremental transformation that CH can do via the materialised views too. That can often replace the need for a full blown streaming transformation pipelines too.

IMO, I think you can get a surprising distance with “just” a ClickHouse instance these days. I’d definitely be interested in articles that talk about where that threshold is no longer met!


Nothing stopping an OSS user from pointing inserts at one or more write focused replicas and user facing queries at read focused replicas!


The biggest reason is that you may also have other consumers than just Clickhouse.


Sure, but the article doesn’t talk about that, it seemed to be focused on CH alone, in which case async insert is much fewer technical tokens.

If you need to ensure that you have super durable writes, you can consider, but I really think it’s not something you need to reach for at first glance


Author here: commented here about how you can use async inserts if that's your preferred ingest method (we recommend that for batch).

https://news.ycombinator.com/item?id=45651098

One of the reasons we streaming ingests is because we often modify the schema of the data in stream. Usually to conform w ClickHouse best practices that aren't adhered to in the source data (restrictive types, denormalization, default not nullable, etc).


Yeah, handles all the OTel signals


Not OP, but to me, this reads fairly similar to how ClickHouse can be set up, with Bloom filters, MinMax indexes, etc.

A way to “handle” partial substrings is to break up your input data into tokens (like substrings split in spaces or dashes) and then you can break up your search string up in the same way.


Quite like Cloudprober for this tbh: https://cloudprober.org/docs/how-to/alerting/

Easy to configure, easy to extend with Go, and slots in to alerting.


Thanks for sharing — I'm the author of Cloudprober. Always happy to hear thoughts or suggestions (https://github.com/cloudprober/cloudprober discussions or slack linked from the homepage - https://cloudprober.org)


I don’t think it’s fraud on the DR side if you actually take the trains and intend to travel.

If you didn’t actually intend to travel, then claiming DR is fraud.


I wonder how this works if you intend to travel should the train be on time, but become aware of the delay (or likelihood of it) and change your plans - what counts as proof of intent?

It would be better if the law was changed so that any transport company selling a ticket is forced to refund if they couldn't fulfil their obligation, regardless of whether the ticket was used or intended to be used. Can't provide the service? Then don't sell it!


If you choose not to travel, then you are eligible for a refund, rather than Delay Repay: https://www.nationalrail.co.uk/help-and-assistance/compensat...

There’s differences in consumer rights effectively between a refund and compensation (like DR)


What’s fraudulent about traveling without intending to travel?


Claiming delay compensation if you don’t have intent to travel is the fraud part.

Easiest example is if you have a season ticket, but you have the day off. You weren’t going to take the train to work that day, so no intent to travel. If you claim DR, then that’s fraud for the compensation.


But you end up hundreds of miles away from home, who could possibly argue that you moved halfway across the country without intent?


The DR system doesn’t look at ticket scans alone. It also builds a profile per customer based on a number data points.

It will flag up quite quickly if you are “sniping” delayed trains at different times.


Thoughts on stuff like ClickHouse with JSON column support? Less upfront knowledge of columns needed.


It is a great step, but in my testing with the new JSON type if you use beyond 255 unique json locations/types (255 max_dynamic_types in their config) you will fall back to much worse performance for certain queries and aggregations. This is quite easy to hit with some of the suggestions in this blog post, especially if you are designing for multi-tenant use.

For this clickhouse wide event lib I'm working on (not worth anyones time atm) I am still using this schema https://www.val.town/v/maxm/wideLib#L34-39 (which is from a Boris Tane talk https://youtu.be/00gW8txIP5g?t=801) for good multi-tenant performance.

I hope clickhouse performance here can still be vastly improved, but I think it is a little awkward to get optimal performance with wide events today.


A small question on the schema, I noticed that you have only “_now” as the Order By (so should just use that for the primary key). Do you expect any cross tenant queries?

Just my feeling would be that I’d add the tenant ID before the timestamp as it should filter the parts more effectively


Yes, I think you are correct. In the video Boris/Baselime uses (_tenantId, _traceId, _timestamp). Will update that :)


Clickhouse's revised JSON type is still quite new (in beta currently), but I'm hopeful for it. Their first attempt fell apart if the schema changed.

[1] https://clickhouse.com/blog/a-new-powerful-json-data-type-fo...


JSON column type in ClickHouse [1] looks promising, since it allows storing wide events with arbitrary sets of fields. This feature is still in beta. Let's see how it will evolve.

[1] https://clickhouse.com/docs/en/sql-reference/data-types/newj...


ClickHouse is open core too. If you care about that.



I think some of the best technical writing I've enjoyed is: https://aws.amazon.com/builders-library/

Clear and concise articles that really dig into some of the hard technical problems with working at scale.

Has honestly made me a much better systems programmer since starting to read them.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: