Observability is not only for SREs

hinkley · on May 15, 2022

Few things are truer in software than this:

If you expect people to use a tool for your benefit, you’re going to be disappointed and/or bamboozled every single time.

Managers are experts in this. They don’t seem to see why they are always having to harp on people to keep visibility tools up to date because “not getting harped on” is not a motivation to do something. Most of my job is keeping track of my responsibilities. If I’m keeping track of yours, then why are you even here?

People use tools for their own benefit (if even then). If that helps out someone else, that’s a bonus feature, not a core one. So we use testing frameworks if it helps us stop our own regressions. We use issue trackers to fight pre-emption and weekend amnesia. If you don’t honor that (by mandating that they be used in a different way that raises the cost/benefit ratio) then you’ll get less compliance, not more.

So if the SREs want something, they better have a self-interest angle, otherwise they’ve just got a platform for lecturing, not for progress. Of course, there are a few people in any org who are satisfied with lecturing.

robluxus · on May 15, 2022

Not getting harped on is indeed a motivator for a lot of folks. And managers take advantage of this sometimes consciously sometimes unconsciously.

Re: SREs and observability, unfortunately sometimes the power imbalance tips the scale in their favor because if as a service owner you let observability out of your control then guess who will report your SLA to your boss?

Of course every org and every SRE org is different. Except yours. The Anna Karenina principle applies.

hinkley · on May 16, 2022

My phrasing was poor. I agree it's a motivator, but it's a pretty poor one. Not quite bottom rung, but close enough to it.

pm90 · on May 15, 2022

This is an excellent overall comment. One thing: As a sibling has pointed out I do enjoy not getting harped on so usually its either a small enough task so ill just do it or it’s difficult enough that I give reasons why I can’t do it (or need some dedicated time to get it done).

Ive had managers who have understood this dynamic and others who have ignored and try to use force by continuing to harp after Ive provided reasons. Generally I haven’t lasted that long under the latter.

hiptobecubic · on May 15, 2022

The "self-interest angle" in my experience has been that the people who are getting paged also hold the keys for launching new things. If they think the thing is too hacky and unpolished to be allowed to page them in the middle of the night, then they won't launch it. Often these are SREs, but not always. Experienced, healthy non-SRE teams usually have similar standards.

KaiserPro · on May 15, 2022

One of the only professional achievements I was responsible for is getting the FT to move away from splunk logging, to generating metrics directly.

I installed a graphite/grafana cluster[1], and delivered a bunch of tech talks on how to massage raw metrics into useable metrics that related to business goals.

I did this because I hated waiting for splunk graphs to generate, and I could never really get any kind of actionable graphs out of it.

over a couple of years it went from a couple of thousand metrics to well over a million active metrics. (anything not updated in the last 7 days was deleted)

but the biggest "oh its actually useful" was seeing product owners, scrum leaders and buisness analysts create, update use dashboards, and setting alerts on service health _without_ software engineering help. It was brilliant, and a testament to how simple graphite/grafana is to use.

[1]well, I installed a graphite cluster, someone twisted my arm to get grafana installed.

snidane · on May 15, 2022

I am not very educated in this area, but I wonder why move away from the concept of metrics extracted from logs to discarding logs and storing just the metrics?

Don't metrics have to be defined first? I assumed you'd first do a big data log analysis to understand what is going on and then monitor for metrics learned to be useful. Storing just metrics means it'll be hard to investigate when things go wrong when the currently defined metrics don't capture that new issue.

In big data circles the consensus these days is to store all raw data since storage is cheap. Useful cleaned aggregations and metrics can be extracted later by processing the full history using the now available big data tools.

I suspect this approach is a problem in the traditional software engineering, where focus is not on historical data like it is in data engineering. The solution goes - Splunk is too expensive, let's ditch logs altogether and hope we capture some metrics and plot them in nice dashboards. Instead of dumping the logs highly compressed into cheap s3 and running some Snowflake or Spark on it later. Heck, even dumping the intra-day non historical logs into Kafka or Materialize and extract the same metrics on the fly while preserving all raw data.

kevindong · on May 15, 2022

Graphing metrics and doing transformations/comparisons is (much) faster. Yes, metrics do have to be defined first but in my experience that's a non-issue since the things you want to monitor are usually immediately obvious during development (e.g. request-response times, errors returned, new customers acquired, pages loaded, etc.).

With that being said, it's not a mutually exclusive situation. You can have both. However, some logs used for plotting metrics have near-zero debugging value (e.g. a log line that just includes the timestamp and the message "event x occurred"). Those kinds of logs should be fully converted over to metrics.

Some other logs however are genuinely useful (e.g. an exception occurred, the error count should be incremented, and this is the stack trace).

snidane · on May 15, 2022

I don't understand why "some logs used for plotting metrics have near-zero debugging value (e.g. a log line that just includes the timestamp and the message "event x occurred"). Those kinds of logs should be fully converted over to metrics."

What is the difference between logs and metrics then?

Logs are pieces of text and metrics are well defined data points?

I can represent both using the same log format. I can also extract the logs with well defined structure into a columns store or inverted index for fast querying and plotting.

The entire distinction of logs and metrics and keeping one vs the other reeks of strong premature optimization by the software community. Storage is cheap, just dump the raw logs to s3 and run etl on them to extract meaningful metrics.

Logs, metrics and traces have the same representation - text or some kind of json yo have it structured. Metrics are just logs with well defined shema. Traces are logs with correlation ids in them to allow for joining between logs coming from different services.

It's just a data problem, nothing else. However, people keep overpaying for complicated services for observability that they don't need. When they split it into logs, metrics and traces they have to connect their programs using proprietary connectors to these external services, adding vendor lockin and potential for failure when the observability service has downtime. Instead of just dumping logs into stderr as json objects as intended by the unix philosophy.

kevindong · on May 16, 2022

As you correctly point out, everything you can do with metrics can be achieved via logs if you have enough compute and I/O. But that ignores the reality of what happens when you do indeed have too many logs to use something fancier like column store/inverted indices as you point out? I agree that in the vast majority of cases, it's likely fine to just take the approach of using logs and counting. But plenty of of developers (particularly here on HN) are in that overall small slice of the overall community that does have a genuine need for greater performance than is afforded by some form of optimized logging.

Likewise, traces are indeed (as you point out) functionally just correlating logs from multiple services which is akin to syntactic sugar. But again, that's precisely its value _at scale_: easing the burden of use. I've personally seen traces that reach across tens of services in an interconnected and cyclical graph of dependencies that would be hellish to query the logs for by hand.

hinkley · on May 15, 2022

Splunk being “too expensive” has nothing to do with finances.

From my second FTE position I’ve been dealing with the consequences of people logging things that were important to them a year ago that they no longer care about but can’t be bothered to remove either due to hoarding dynamics, project scheduling dynamics, or effort/reward dynamics (“that’s an old feature/bug, and I get paid to work on new things”).

Stats get aggregated with their own kind, while logs get jumbled together with everything else. The median value of old log messages is negative, while that of old stats is closer to zero.

If you put a bunch of data into a log message, its initial value is higher but so is the negative slope of the line. If you don’t put a bunch of data into the log message, what’s the difference between logging it or recording a stat?

Part of the unspoken assumption of logs is that I can read cause and effect chronologically in the logs. That ceases being true the moment you have a distributed systems. Getting reliable sub-second clock skew is expensive. Getting it down below speed of light delays within a single rack is pure fiction.

Anyone implementing stats on a single server application has bigger problems than logs vs stats, so distributed is practically a pre-req for even having a stats system.

kqr · on May 15, 2022

Log messages also have to be defined first.

I find it much harder to guess what log messages I will need in the future compared to what metrics.

Metrics are easy: for any resource, service, component, and queue (explicit or implicit), you want to know cumulative arrivals, departures, timeouts, errors.

That takes you most of the way there for most things.

You can add counters for significant events too, but they tend to be automatically included by the above. (Basically departures from the component that trigger the event.)

njpatel · on May 15, 2022

> The solution goes - Splunk is too expensive, let's ditch logs altogether and hope we capture some metrics and plot them in nice dashboards. Instead of dumping the logs highly compressed into cheap s3 and running some Snowflake or Spark on it later.

I don't usually promote on HN but this is exactly why we built https://axiom.co! We've been working on this problem for some time, essentially allowing schema-less/index-free ingest, S3-based storage in a highly-efficient format, and then querying with a Splunk-like (specifically Kusto-inspired) language via serverless functions.

We built it because we also realised we would avoid or think too much about logging (cost, scaling, retention, etc) which led to compromises either in our monitoring or later when we wanted to dive in and try and draw some insights/analytics from that kind of data.

KaiserPro · on May 15, 2022

> but I wonder why move away from the concept of metrics extracted from logs to discarding logs and storing just the metrics?

This is a _very_ good question.

Getting from log to meaningful dashboard full of metrics is very hard to do well. Its also very hard for a normal person(ie non programmer) to ask the right question of a log stream.

For example if you give me a log stream of a web server, and want me to raise an alert when an instance's health check takes longer than n milliseconds, it would require regex to get the host name, regex to get the health check, and more regex to get the response time. All of those steps are hard and require testing. Moreover its fragile, any kind of format change is really easy to slip in and cause your monitoring to break.

So your next question might be, why replace splunk with a metric based solution?

Thats the thing, you don't! logs are really really useful for what specifically went wrong. But they are not very good at telling you what is _currently_ going on over n number of services.

when implementing a service/program/things, I ask the developers to think about the information that is useful to log. Be that response time, number of peers connected, cache size, etc etc. Then instead of just dumping them to logs, you'd push them to a metric library and get that to look after it.

this means that you are consciously thinking about an number that best sums up the thing you care about. This then frees up logging to explain _why_ something has happened, rather than being rigidly defined because you'll break a dashboard.

Think of is as this:

dashboard: shows you if a service is happy, which is an aggregate of many smaller programmes.

Something goes wrong, response time of a microservice is above limits, you then track down when exactly the metric went wrong, and fire up splunk to look at the logs.

pm90 · on May 15, 2022

I feel like this is a very unusual setup initially of relying only on log based metrics. Metrics and logs generally serve different purposes; and log-based metrics enhance your metrics pipeline rather than replace it.

Kudos on you for introducing them to a metrics pipeline though!

xbar · on May 15, 2022

Seeing Taffy on initial load is pretty much perfect.

twic · on May 15, 2022

Flash of Unstyled Cavy.

dboreham · on May 15, 2022

> the world of QA

Wait...there are companies where there's still a world of QA?

drewcoo · on May 15, 2022

Yes.

https://en.wikipedia.org/wiki/Mirror_Universe

But you can read "QA" as "what you'd do to investigate the system if you had time and didn't have people constantly pushing new feature tickets at you."

KaiserPro · on May 15, 2022

yes, because as we know; SWEs are shit at proper testing.

A good QA department will know what the software does, but more importantly how it should behave, and how its changed over time. They should be a good resource for working out how a feature should look and behave.

a good QA department is worth its weight in gold

pphysch · on May 15, 2022

"Proper testing" is often introduced, in CS courses, as 100% Test Coverage cargo cultism. Which is really a error-prone and unproductive way of ensuring Correctness in the software beyond the guarantees offered by the programming language. Most SW orgs should not waste time on test coverage metrics.

On the other hand, we can look at SW testing as "building a robotic user that will catch errors before real users". This approach is far more intellectually-engaging, productive and gets SWEs in touch with the product.

That is to say, SWEs aren't shit at "testing", they are shit at product, because product is virtually omitted in traditional CS curricula.

pm90 · on May 15, 2022

I do think theres still a lot of value in having close to 100% test coverage. As you point out its not sufficient to prevent issues, but having code run somewhere at least once before running in production is good.

thfuran · on May 15, 2022

I'd much prefer 100% specs coverage to 100% code coverage.

xmodem · on May 15, 2022

Having worked in orgs both with and without a QA function, they can serve a very important counter-balancing function to an overly aggressive product org, that is determined to ship the features that'll make their growth metrics at any cost.

(This is the more modern version of QA, of course, where it stands for Quality Advocate or Quality Assistance)

tkiolp4 · on May 15, 2022

Before: software engineers wrote code.

Now: they write code, they test it, they understand customer’s requirements, they monitor their code, they are on-call, they mentor others, they challenge the status quo, they give tech talks, sometimes they attend company hackatons…

I’m tired of software engineers having to do a bit of everything.

dilyevsky · on May 15, 2022

You can always go an pick up a jr dev job at a bank somewhere. Either that or become hyperspecialized in a very complicated technology. Otherwise yeah your job is solving business problems not writing code

AdrianB1 · on May 15, 2022

"What is DevOps?"

drewcoo · on May 15, 2022

> because as we know; SWEs are shit at proper testing

Not true. There's nothing magical about testers. They don't have "special" skills or deep voodoo powers. The best ones are just smart people who understand people and software.

> They should be a good resource for working out how a feature should look and behave

You must have weak designers and PMs. Everyone should be welcome in the process, but those people drive that stuff.

hinkley · on May 15, 2022

I have worked with a wider range of QA folk than you have, apparently. The best are magic, the worst are leaches. And I think that’s the problem with QA. Those who don’t like it cherry-pick their examples.

The thing to remember is that getting rid of QA upsets a triumvirate power dynamic. If it’s the dev team versus project management, the dev team always loses. If it’s sometimes dev and QA against PM, then you can win those fights. But only if you haven’t removed the QA manager from the org chart.

We’ve played ourselves in essentially a union-busting manner.

If you’ve always been at loggerheads with the QA team, I can tell you that you’ve been missing out. Looking back, defusing the default animosity between QA and Dev has gotten me more promotions than anything else in my career, and the lack of a QA team has put a substantial damper on my upward mobility. It sucks and I really wish we could go back.

dijit · on May 15, 2022

> They don't have "special" skills or deep voodoo powers.

Some do, seriously.

I'm coloured by video games, but some of the QA people I've seen are magic, without them entire products wouldn't ship at all.

There's something to be said for having "break shit and report on how you did it" as a job title, regardless of the other gains you get for having someone who holistically understands the product.

Some are even lazy, so work to automate themselves out of working, which makes their impact much wider, even when it's domain specific.

It's also the case that the people coming out of QA in gamedev and going into other parts of the company tend to do extremely well compared to people going direct into that area. Our old managing director started in QA, as did the executive producer.

KaiserPro · on May 15, 2022

> There's nothing magical about testers

There is a spectrum, of course. But, a good QA has a "feeling" for finding where things break, an excellent QA know how to fix it.

Just as writing software is a skill, so is QA'ing along with Sysadmining.

tkiolp4 · on May 15, 2022

> There's nothing magical about testers

Yes, there is: they get to do testing full-time. Engineers spend their best “brain time” designing and writing systems. Testers spend their best brain time, well, testing. That can make a huge difference when it comes to find bugs.

tremon · on May 15, 2022

There's nothing magical about testers

Oh, but there is: they have fault-finding in their job description.