I feel like the ecosystem is very, very close to ready for what I would consider to be a really nice medium-to-long-term queryable log storage system. In my mind, it works like this:
1. Logs get processed (by a tool like vector) and stored to a sink that consists of widely-understood files in an object store. Parquet format would be a decent start. (Yscope has what sounds like a nifty compression scheme that could layer in here.)
2. Those logs objects are (transactionally!) enrolled into a metadata store so things can find them. Delta Lake or Iceberg seem credible. Sure, these tools are meant for Really Big Data, but I see so reason they couldn’t work at any scale. And because the transaction layer exists as a standalone entity, one could run multiple log processing pipelines all committing into the same store.
3. High-performance and friendly tools can read them. Think Clickhouse, DuckDB, Spark, etc. Maybe everything starts to support this as a source for queries.
4. If you want to switch tools, no problem — the formats are standard. You can even run more than one at once.
Has anyone actually put the pieces together to make something like this work?
I work on something where we use Vector similar to this.
The application writes directly to a local Vector instance running as a daemon set, using the TCP protocol. That instance buffers locally in case of upstream downtime. It also augments each payload with some metadata about the origin.
The local one then sends to a remote Vector using Vector's internal Protobuf-based framing protocol. That Vector has two sinks, one which writes the raw data in immutable chunks to an object store for archival, and another that ingests in real time into ClickHouse.
This all works pretty great. The point of having a local Vector is so applications can be thin clients that just "firehose" out their data without needing a lot of complex buffering, retrying, etc. and without a lot of overhead, so we can emit very fine-grained custom telemetry data.
There is a tiny bit of retrying logic with a tiny bit of in-memory buffering (Vector can go down or be restarted and the client must handle that), but it's very simple, and designed to sacrifice messages to preserve availability.
Grafana is a nice way to use ClickHouse. ClickHouse is a bit more low level than I'd like (it often feels more like a "database construction kit" than a database), but the design is fantastic.
Depending on your use case and if you miss a few logs, sending log data via udp is helpful so you don’t interrupt the app. I have done this to good effect, though not with vector. Our stuff was custom and aggregated many thing into 1s chunks.
Dropping messages occasionally can be fine, but the problem with UDP is that the packet loss is silent.
I believe UDP can be lossy even on localhost when there's technically no network, so you'd have to track the message count on both the sender and recipient sides. It's also more sensitive to minor glitches, whereas TCP + a very small buffer would allow you to smooth over those cases.
I use NATS (which is UDP-based) for a similar kind of firehose system, and the amount of loss can sometimes reach 6-7%.
Quickwit is very similar to what is described here.
Unfortunately, the files are not in Parquet so even though Quickwit is opensource, it is difficult to tap into the file format.
We did not pick Parquet because we want to actually be able to search and do analysis efficiently, so we ship an inverted index, a row-oriented store, and a columnar format that allows for random access.
We are planning to eventually add ways to tap into the file and get data in the Apache arrow format.
For a quick skim through the docs, it wasn’t clear to me: can I run a stateless Quickwit instance or even a library to run queries, such that the only data accessed is in the underlying object store? Or do I need a long-running search instance or cluster?
Would this fit your medium to long term? It's a weekend work to automate: json logs go to Kafka, logstash consumer to store batches in hive partitioned data in s3 with gzip compression, Athena tables over these s3 prefixes and prestodb language used to query/cast/aggregate the data
1. Logs get processed (by a tool like vector) and stored to a sink that consists of widely-understood files in an object store. Parquet format would be a decent start. (Yscope has what sounds like a nifty compression scheme that could layer in here.)
2. Those logs objects are (transactionally!) enrolled into a metadata store so things can find them. Delta Lake or Iceberg seem credible. Sure, these tools are meant for Really Big Data, but I see so reason they couldn’t work at any scale. And because the transaction layer exists as a standalone entity, one could run multiple log processing pipelines all committing into the same store.
3. High-performance and friendly tools can read them. Think Clickhouse, DuckDB, Spark, etc. Maybe everything starts to support this as a source for queries.
4. If you want to switch tools, no problem — the formats are standard. You can even run more than one at once.
Has anyone actually put the pieces together to make something like this work?