Graphing metrics and doing transformations/comparisons is (much) faster. Yes, me...

snidane · on May 15, 2022

I don't understand why "some logs used for plotting metrics have near-zero debugging value (e.g. a log line that just includes the timestamp and the message "event x occurred"). Those kinds of logs should be fully converted over to metrics."

What is the difference between logs and metrics then?

Logs are pieces of text and metrics are well defined data points?

I can represent both using the same log format. I can also extract the logs with well defined structure into a columns store or inverted index for fast querying and plotting.

The entire distinction of logs and metrics and keeping one vs the other reeks of strong premature optimization by the software community. Storage is cheap, just dump the raw logs to s3 and run etl on them to extract meaningful metrics.

Logs, metrics and traces have the same representation - text or some kind of json yo have it structured. Metrics are just logs with well defined shema. Traces are logs with correlation ids in them to allow for joining between logs coming from different services.

It's just a data problem, nothing else. However, people keep overpaying for complicated services for observability that they don't need. When they split it into logs, metrics and traces they have to connect their programs using proprietary connectors to these external services, adding vendor lockin and potential for failure when the observability service has downtime. Instead of just dumping logs into stderr as json objects as intended by the unix philosophy.

kevindong · on May 16, 2022

As you correctly point out, everything you can do with metrics can be achieved via logs if you have enough compute and I/O. But that ignores the reality of what happens when you do indeed have too many logs to use something fancier like column store/inverted indices as you point out? I agree that in the vast majority of cases, it's likely fine to just take the approach of using logs and counting. But plenty of of developers (particularly here on HN) are in that overall small slice of the overall community that does have a genuine need for greater performance than is afforded by some form of optimized logging.

Likewise, traces are indeed (as you point out) functionally just correlating logs from multiple services which is akin to syntactic sugar. But again, that's precisely its value _at scale_: easing the burden of use. I've personally seen traces that reach across tens of services in an interconnected and cyclical graph of dependencies that would be hellish to query the logs for by hand.