I feel like the ecosystem is very, very close to ready for what I would consider...

atombender · on March 18, 2024

I work on something where we use Vector similar to this.

The application writes directly to a local Vector instance running as a daemon set, using the TCP protocol. That instance buffers locally in case of upstream downtime. It also augments each payload with some metadata about the origin.

The local one then sends to a remote Vector using Vector's internal Protobuf-based framing protocol. That Vector has two sinks, one which writes the raw data in immutable chunks to an object store for archival, and another that ingests in real time into ClickHouse.

This all works pretty great. The point of having a local Vector is so applications can be thin clients that just "firehose" out their data without needing a lot of complex buffering, retrying, etc. and without a lot of overhead, so we can emit very fine-grained custom telemetry data.

There is a tiny bit of retrying logic with a tiny bit of in-memory buffering (Vector can go down or be restarted and the client must handle that), but it's very simple, and designed to sacrifice messages to preserve availability.

Grafana is a nice way to use ClickHouse. ClickHouse is a bit more low level than I'd like (it often feels more like a "database construction kit" than a database), but the design is fantastic.

sroussey · on March 18, 2024

Depending on your use case and if you miss a few logs, sending log data via udp is helpful so you don’t interrupt the app. I have done this to good effect, though not with vector. Our stuff was custom and aggregated many thing into 1s chunks.

atombender · on March 18, 2024

Dropping messages occasionally can be fine, but the problem with UDP is that the packet loss is silent.

I believe UDP can be lossy even on localhost when there's technically no network, so you'd have to track the message count on both the sender and recipient sides. It's also more sensitive to minor glitches, whereas TCP + a very small buffer would allow you to smooth over those cases.

I use NATS (which is UDP-based) for a similar kind of firehose system, and the amount of loss can sometimes reach 6-7%.

memset · on March 18, 2024

I have! My product literally does this. I’m actually piping logs to motherduck with my software.

Here’s a post on how to do this with fly.io which uses vector: https://scratchdata.com/blog/fly-logs-to-clickhouse/

This is my actual production vector.yaml: https://gist.github.com/poundifdef/293bf2c4cd5aaa734b0b8e25e...

You could literally download my product (it’s open source) and set it up in 5 minutes: scratchdata.com

fulmicoton · on March 18, 2024

Quickwit is very similar to what is described here.

Unfortunately, the files are not in Parquet so even though Quickwit is opensource, it is difficult to tap into the file format.

We did not pick Parquet because we want to actually be able to search and do analysis efficiently, so we ship an inverted index, a row-oriented store, and a columnar format that allows for random access.

We are planning to eventually add ways to tap into the file and get data in the Apache arrow format.

amluto · on March 18, 2024

For a quick skim through the docs, it wasn’t clear to me: can I run a stateless Quickwit instance or even a library to run queries, such that the only data accessed is in the underlying object store? Or do I need a long-running search instance or cluster?

sebosp · on March 18, 2024

Would this fit your medium to long term? It's a weekend work to automate: json logs go to Kafka, logstash consumer to store batches in hive partitioned data in s3 with gzip compression, Athena tables over these s3 prefixes and prestodb language used to query/cast/aggregate the data

alexisread · on March 18, 2024

You can simplify this setup- vector can write to Uptrace which is backed by Clickhouse, which itself can use tiered storage eg. S3.

Very easy to get setup locally too as a POC.

FridgeSeal · on March 18, 2024

Yeah go check out what the QuickWit guys are doing.