Hacker News new | past | comments | ask | show | jobs | submit login
OpenTelemetry and vendor neutrality: how to build an observability strategy (grafana.com)
148 points by meysamazad 7 months ago | hide | past | favorite | 56 comments



OpenTelemetry is nice overall as there are library for multiple platforms. I introduced it this year for a web game platform with servers in Node, Java, PHP and Rust and it all worked roughly similarly which made it good for consistency.

I like how OpenTelemetry decouples the signal sink from the context, compared to other structured logging libs where you wrap your sink in layers. The main thing that I dislike is the auto-instrumentation of third-party libraries: it works great most of the time, but when it doesn't it's hard to debug. Maintainers of the different OpenTelemetry repos are fairly active and respond quickly.

It's still relatively recent, but I would recommend OpenTelemetry if you're looking for an observability framework.


[flagged]


Why are you accusing me of posting an LLM reply?!

I just shared that I enjoyed using and contributing to OpenTelemetry. I never used an LLM. Do I really need to prove that I'm human?

- a couple PRs I posted to the Rust impl: https://github.com/open-telemetry/opentelemetry-rust/pulls?q... I also participate to issue discussions in this repo and others.

- OpenTelemetry tracer config in the web game platform I'm working on: https://gitlab.com/eternaltwin/eternaltwin/-/blob/main/crate...

- A somewhat cool thing: an OpenTelemetry proxy impl handling authentication of traces: https://gitlab.com/eternaltwin/eternaltwin/-/blob/main/crate...

- Node usage: https://gitlab.com/eternaltwin/labrute/labrute-react/-/blob/...

- Java usage: https://gitlab.com/eternaltwin/dinoparc/dinoparc/-/merge_req...

- PHP usage: https://gitlab.com/eternaltwin/mush/mush/-/merge_requests/20...

This is all 100% handcrafted, as all my messages ever.

Now that I proved that I actually used and contributed to OpenTelemetry, may I express that I like it overall but regret that the auto-instrumentation is brittle and hard to debug? I can expand on the particular issues I've hit or why I feel that it's still not mature enough.


FWIW, you gave that commenter way more than they deserved for the amount of effort they put into their comment. Also, I wouldn't have suspected your earlier comment was generated by an LLM.


Thanks, I was a bit worried that maybe the initial comment was not a good fit. I've read the article, but had more to say about OpenTelemetry rather than Grafana.

We use Grafana at work, with their Tempo product for trace analysis. We generate traces using OpenTelemetry. Tempo helps with debugging and perf work, but for personal projects I prefer open source solutions. I've used Uptrace, OpenObserve and Jaeger as backends, but thanks to this thread I also discovered Perses. In general, I prefer HN discussions that are a bit broader than the article itself.


I think the biggest value I see with OpenTelemetry is the ability to instrument your code and telemetry pipeline once (using otel collector) and then choose a backend and visualisation framework which meets your requirement.

For example at SigNoz [0], we support OpenTelemetry format natively from Day 1 and use ClickHouse as the datastore layer which makes it very performant for aggregation and analytics queries.

There are alternative approaches like what Loki and Tempo does with blob storage based framework.

If your data is instrumented with Otel, you can easily switch between open source projects like SigNoz or Loki/Tempo/Grafana which IMO is very powerful.

We have seen users switch within a matter of hours from another backend to SigNoz, weh they are instrumented with Otel. This makes testing and evaluating new products super efficient.

Otherwise just the effort required to switch instrumentation to another vendor would have been enough to not ever think about evaluating another product

(Full Disclosure : I am one of the maintainers at SigNoz)

[1] https://github.com/signoz/signoz


This is definitely the same sort of take we have at StarTree (Apache Pinot, but equivalent to the Clickhouse story you have above). We're working now on making sure StarTree is compliant with PromQL (we just demoed this at Current 2024), plus also TraceQL and LogQL. Because, not surprisingly, query semantics actually matter!

https://thenewstack.io/reimagining-observability-the-case-fo...


> StarTree is compliant with PromQL

do you mean one can write query on StarTree using PromQL?


Yes. For example you can make queries from Grafana using PromQL to StarTree Cloud. It's in private preview now.


Swapping backend is huge when you consider it makes it fairly easy to run a local setup.

One other nice thing about OTEL is the standardization. I can plugin 3rd party components and they'll seamlessly integrate with 1st party. For instance, you might add some load balancers and a data store. If these support OTEL, you can view latency they added in your normal traces.


Related to this, as a library author it's great that the library can now come pre-instrumented and the application can use whatever backend they want. Previously, I would have to expose some kind of event emitter or hooks that the application would have to integrate against.


Yeah, totally agree this is very helpful


From the very beginning of my tenure at my current "start-up" I wrote a Rust bespoke implementation using the base OpenTelemetry library with lots of opinionated defaults and company specifics. We integrated this early on in our microservice development, and it's been an absolute game changer. All of our services include the library and use a simple boilerplate macro to include metrics and tracing into our Actix and Tonic servers, Tonic client, etc. Logs are slurped off Kubernetes pods using promtail.

It was easy enough that I, as a single SRE (at the time) could write and implement across dozens of services in a few months of part-time work while handling all my other normal duties. OpenTelemetry has proved to be worth the investment, and we have stayed within the Grafana ecosystem, now paying for Grafana Cloud (to save our time on maintaining the stack in our Kubernetes clusters).

I would absolutely recommend it. I would recommend it and hopefully use it at any new future positions.


I can confirm that this is a pretty good way. Building out a basic distributed tracing solution with OTEL, jaeger and the relevant Spring Boot configuration and dependencies was quite a pleasant experience once you figure out the relevant-for-your-use-cases set of dependencies. It's one of those nice things that Just Works™, at least for Java 17 and 21 and Spring Boot 3.2 (iirc) or greater.

There appeared to be wide array of library and framework support across various stacks, but I can only attest personally to the quality of the above setup (Java, Boot, etc).


> once you figure out the relevant-for-your-use-cases set of dependencies

> It's one of those nice things that Just Works™

did it Just Work™ or did you have to do work to make it Just Work™?


Java has really good OTel coverage across tons of libraries. It should mostly Just Work™, though you'll still need to consider sampling strategies, what metrics you actually want to collect, etc.

Would say .NET isn't too far behind. Especially since there are built-in observability primitives and Microsoft is big on OTel. ASP.NET Core and other first party libraries already emit OTel compliant metrics and traces out of the box. Instrumenting an application is pretty straightforward.

I have less experience with the other languages. Can say there is plenty of opportunity to contribute upstream in a meaningful way. The OpenTelemetry SIGs are very welcoming and all the meetings are open.

Full disclosure: work at Grafana Labs on OpenTelemetry instrumentation


they say good technology makes "easy things easy, and hard things possible".

A lot of those java built-in libraries personally, I think they make easy things easy. Will it Just Work™ Out Of The Box™?

Hey we're engineers right? We know that the right answer to every question, bar none is "it depends on what you're trying to do" right? ;)


In my experience, it just worked -- I was at an org that ran 3rd party java services alongside our normal array of microservices (that all used our internal instrumentation library that wrapped OTEL) and using the OTEL autoinstrumentation for those 3rd party services was pretty trivial to get setup and running (just wrap the command to run the app with the OTEL wrapper, hand it a collector url.) Granted -- we already had invested in OTEL elsewhere and were familiar with many of the situations in which it didn't just work.


As I recall (this was about 6 months ago now) the Spring project's support for these libraries was somewhat in flux and was evolving, so arriving at the correct set of dependencies for a modern Boot stack was somewhat fraught with reading now-out-of-date information and having to find the newly-correct way. Once that was out of the way I needed to add like 3 or so dependencies to a parent POM all the teams used as a base and then add a small handful of config (which was well documented) to get it working. That config was like the jaeger sink endpoint, etc.

If you have the ability to use or upgrade to the newest version of Spring Boot, you will not have to go through what I did finding the "correct way" because iirc there was a lot of shifting happening in those deps between v3.1 and 3.2.


I had a very similar experience with a Quarks REST API where it's supported very well out of the box, we just had to point it to the appropriate otel collector endpoint and traces are created or propagated automatically.


I tried to introduce OTel in a greenfield system at a large Canadian bank. Corporate IT pushed back hard because they'd heard a lot of "negative feedback" about it at a Dynatrace conference. No surprises there.

Corporate IT were not interested in reducing vendor lock-in; in fact, they asked us to ditch the OTel Collector in favour of Dynatrace OneAgent even though we could have integrated with Dynatrace without it.


> they'd heard a lot of "negative feedback" about it at a Dynatrace conference

It's funny because Dynatrace fully support OpenTelemetry, even having a distribution of the OpenTelemetry Collector.


It's not uncommon to see misalignment in large corporations like that. A year or so ago, Datadog field teams were explicitly creating FUD over OTel ("it's unstable, unusuable" all the same crap) while at the same time ramping up some of their efforts to contribute to OTel and make their own product experiences with OTel data better. They (and perhaps also Dynatrace) have an entire field org trained on the idea that their proprietary tech is inherently better and nothing in the open will be able to compete with it.

Also, to say OTel threatens these proprietary agents would be an understatement. The OTel Java agent comes with 100+ OOTB integrations right now. If I were a Dynatrace sales leader and I know that we sunk a ton of cost into creating our own stuff, I'd be casting FUD into the world over OTel too.


Similarly, this article starts with this weird disclaimer:

>We realize, however, that [vendor] neutrality can have its limits when it comes to real-world use cases.


Dynatrace is a leading contributor to the OpenTelemetry project. It supports OTLP traces, metrics, and logs and offers a supported OpenTelemetry Collector distribution with receivers for signals like StatsD and Prometheus. By investing in OpenTelemetry, Dynatrace can focus more on analytics and less on data collection. This is Dynatrace's long-term strategy.

In the short term (at least the next 18 months), data collection decisions remain important, especially for vendors like Dynatrace that provide added value beyond standard trace views, golden signals, alerting, and dashboards. Organizations need to think about these instrumentation choices on a workload-by-workload, and team-by-team basis.

Choosing between OneAgent and OpenTelemetry instrumentation is pretty straightforward. Teams use OpenTelemetry to send observability signals to more than one backend. Teams use OneAgent for its added value, such as deep code-level insights in the context of traces or performance anomalies.

Over time, these decisions will become less critical as data collection becomes more of a commodity.

Full disclosure: I am a Dynatrace Product Manager.


The problem with OpenTelemetry is that it really only good for tracing. Metrics and logs are kinda bungee strapped later: very inefficient and clunky to use.

PS: And devs (Lightspeed?) seem to really like "Open" prefix: OpenTracing + OpenCensus = OpenTelemetry.


Could you elaborate more about metrics and logs feeling bungee strapped together?

For logs it's going to necessarily be a bit messier IMO. Logs in OTel are designed to just be your existing application logs, but an SDK or agent can wrap those with a trace and span ID to correlate them for you. And so the benefit is you can bring your existing logs, but the downside is there's no clean framework to use for logging effectively, meaning there's a great deal of variability in the quality of those logs. It's also still rolling out across the languages, so while you might have excellent support in something like Java, the support in Node isn't as clean right now.

Metrics is pretty well-baked, but it's a different model and wire format than Prometheus or other systems.


My two cents about metrics: in my experience, documentation, examples, ecosystem, etc. is far behind compared to traces. OTel blogs and tutorials usually assume you want tracing only, or were written a few years back when OTel was (almost?) exclusively traces.


Why do all these things use such damnably inefficient wire formats?

For metrics, we're shipping a bunch of numbers over the wire, with some string tags. So why not something like:

  message Measurements {
    uint32 metric_id = 1;
    uint64 t0_seconds = 2;
    uint32 t0_nanoseconds = 3;
    repeated uint64 delta_nanoseconds [packed = true] = 4;
    repeated int64 values [packed = true] = 5;
  }
Where delta_nanoseconds represents a series of deltas from timestamp t0 and values has the same length as delta_nanoseconds. Tags could be sent separately:

  message Tags {
    uint32 metric_id = 1;
    repeated string tags = 2;
  }
That way you only have to send the tags if they change and the values are encoded efficiently. I bet you could have really nice granular monitoring e.g. sub ms precision quite cheaply this way.

Obviously there are further optimizations we can make if e.g. we know the values will respond nicely to delta encoding.



Those are interesting results! I'm not surprised it works a lot better for metrics than logs and traces. Something I'd really love to have for logs/traces processing is the ability to query clp[1][2] with a dataframe interface (e.g. datafusion [3]). While I'm on that subject, I'd also prefer that interface for metrics processing. I don't need real-time streaming metrics graphs, it's perfectly fine to compute one on-demand.

I suspect something like clp is the way to go for logs-like data, that is, low entropy text with a lot of numerical content.

[1] https://www.uber.com/blog/reducing-logging-cost-by-two-order... [2] https://www.uber.com/blog/modernizing-logging-with-clp-ii/ [3] https://github.com/apache/datafusion


Do you know for sure that otel doesn't do this? Most collector pipelines I've seen use the batch processor, which may include exactly what you're describing. Not being obtuse, I've never looked at the source to see what it does.


Not as far as I can tell from the schema definitions[1].

[1] https://github.com/open-telemetry/opentelemetry-proto/tree/v...


Most modern developers (those who got started >2000) never had to worry about hyperefficiency — i.e., bitpacking. To them bandwidth, like diskspace, is near infinite and free. Who uses a single reserved bit to set a zero or a 1 these days when you can use a whole int32 (or int64)?

Yet I applaud your desire to make things more wire-efficient.


In the cloud where data transfer fees dominate it's really important. Although nobody seems to realize this and they just pay the Amazon tax lol.


The board gathers around the monthly cloud vendor bill and wonder why they need to raise a new round just to pay it off.


Generally you only have one point to send per series. You send all the points for ‘now’, then in N seconds you send them all again.


Can you expand upon this? Why would I have more than one point per timestamp? Or am I misunderstanding?

Let's say I'm measuring some quantity, maybe the execution time of an http request handler, and that handler is firing roughly every millisecond and taking a certain amount of time to complete. I'd have about a thousand measurements per second, each with their own timestamp--which to be clear can be aliased if something happens in the same nanosecond! It's totally fine to have a delta of zero. But the point is this value is scalar--it's represented by a single point.

But it seems like you're suggesting vector-valued measurements are a common thing as well--e.g. I should expect to send multiple points per measurement? I'm struggling to think of an application where I'd want this.. I guess it would be easy enough to add more columns.. e.g. values0, values1, ...

EDIT: oh, I see, I think you're saying I should locally aggregate the measurements with some aggregation function and publish the aggregated values.. Yeah that's something I'd really prefer to avoid if possible. By aggressively aggregating to a coarse timestamp we throw away lots of interesting frequency information. But either way, I don't think that really affects this format much. You could totally use it for an aggregated measurement as well. And yeah each of these Measurements objects represents a timeseries of measurements--we'd simultaneously append to a bunch of them, one for each timeseries. I probably should have called it "Timeseries" instead.

EDIT2: It might be worth spelling out a bit why this format is efficient. It has to do with the details of how Google Protocol Buffers are encoded[1]. In particular, the timestamps are very cheap so long as the delta is small, and the values can also be cheap if they're small numbers--e.g. also deltas--which for ~continuously and sufficiently slowly varying phenomena is usually the case. Moreover, packed repeated fields[2] (which is unnecessary to specify in proto3 but I included here to be explicit about what I mean) are further efficient because they omit tags and are just a length followed by a bunch of variable-width encoded values. So this is leaning on packing, varints, and delta-encoding to be as compact as possible.

[1] https://protobuf.dev/programming-guides/encoding/ [2] https://protobuf.dev/programming-guides/encoding/#packed


Take a step back. You wrote:

    repeated int64 values 
I’m saying that in most cases there will only be one value, hence ‘repeated’ is unnecessary.

I didn’t say anything about aggregation, but yes one counts things going at a thousand per second rather than sending all the detail. The Otel signal if you want detail is traces, not metrics.


> Take a step back.

Excuse me? Modify your tone, read what I wrote again, and this time make an effort to understand it. I'd be happy to answer any questions you might have.

I'm sorry if this sounds harsh but I truly cannot tell if you're trolling or what.. I think I made a serious effort to understand what you were talking about, and it seems like you haven't done the same.


For anyone that has built more complex collector pipelines, I'm curious to know the tech stack:

  - otel collector?
  - kafka (or other mq)?
  - cribl?
  - vector?
  - other?


What sort of complexity do you need? I've used them on my previous job and am implementing it on the current one. I have never heard of the last three you mention.

Otel collector is very useful for gathering multiple different sources, eg I am at a big corporation and we both have department level Grafana stack (Prometheus Loki etc) and we need to also send the data to Dynatrace. With otel collector these things are a minor configuration away.

For Kafka if you mean tracing through Kafka messages previously we did it by propagating it in message headers. Done at a shared team level library the effort was minimal.


The OpenTelemetry collector works very well for us – even for internal pipelines. You can build upon a wide array of supported collector components; extending it with your own is also relatively straightforward. Plus, you can hook up Kafka (and others) for data ingress/egress.


As an aside, can all of y'all competing observability vendors cool it with the pitches in the comments? Literally every time someone posts about observability there's a half dozen or more heavy-handed pitch comments to wade through.


This is hypocritical content marketing from a company that doesn't want you to be vendor neutral. As seen by the laughable use of hyperlinks to their products but no links when mentioning Prometheus or elasticsearch.

OTEL is great, I just wish the CNCF had better alternatives to Grafana labs.


Check out Perses: https://github.com/perses/perses

Less mature than Grafana but recently accepted by the CNCF as a sandbox project, hopefully a positive leading indicator of success.


Perses is an excellent step towards vendor neutrality. We at Dash0 are basing our dashboarding data model entirely on Perses to allow you to bring/take away as much of your configuration as possible.

The team around Perses did a solid job coming up with the data model and making it look very Kubernetes manifest-like. This makes for good consistency, especially when configuring via Kubernetes CRs.


Didn't know about Perses. Looks promising! I've had one foot out the door with Grafana for a couple years -- always felt heavy and not-quite-there (especially the Loki log explorer), and IMHO they made alerts far less usable with the redesign in version 8.


Well, no, they don't _want_ you to be vendor neutral. But they allow and support you to do so - unlike DataDog.


I haven't had too much issues extending Datadog--quite a bit of the client side stuff is open source.

Main issue is cost management


Looking for target for your OTEL data checkout Coroot too - https://coroot.com/ Additionally to OTEL visualization it can use eBPF to generate traces for applications where OpenTelemetry installation can't be done.


JMX -> Jolokia -> Telegraf -> the-older-TICK-stack-before-influxdb-rewrote-it-for-the-3rd-time-progressively-making-it-worse-each-time


Opentelmetry is definitely a good thing that will help reduce vendor lock-in and exploitative practices from some vendors when they see that the customer is locked in due to the proprietary code instrumentation. In addition, opentelemetry autoinstrumentation is fantastic and allows one to get started with zero code instrumentation.

Going back to the basics - 12 factor app principles must also be adhered to in scenarios where opentelemtry might not be an option for observability. e.g. Logging is not very mature in Opentlemetry for all the languages as of now. Sending logs to stdout provides a good way to allow the infrastructure to capture logs in a vendor neutral way using standard log forwarders of your choice like fluentbit and otel-collector. Refer - https://12factor.net/logs

OTLP is a great leveler in terms of choice that allows people to switch backends seamlessly and will force vendors to be nice to customers and ensure that enough value is provided for the price.

For those who are using kubernetes you should check the opentelemtry operator, which allows you to autoinstrument your applications written in Java, NodeJS, Python, PHP and Go by adding a single annotation to your manifest file. Check an example here of sutoinstrumentation -

                                                 /-> review (python)
                                                /
frontend (go) -> shop (nodejs) -> product (java) \ \-> price (dotnet)

Check for complete code - https://github.com/openobserve/hotcommerce

p.s. An OpenObserve maintainer here.


I'd argue that since most observability use cases are "write once, read hardly ever" they aren't really transactional (OLTP-oriented). You're doing inserts at a high rate, quick scans in real time, and a few exploratory reads when you need. Generally you're not doing upserts, overwriting existing data; you're doing time-series, where each "moment" has to be captured atomically.

Given that, it makes sense that if you have an OLAP data, like Apache Pinot, you can do the same and better than OLTP. You can do faster aggregations. Quicker large range or full table scans.

(Disclosure: I work at StarTree, which is powered by Apache Pinot)


Not sure, if you got OTLP (Opentelemetry Protocol) and OLTP (Online Transaction Processing) mixed up.

Pinot is cool.

OpenObserve is similar to Pinot but built in rust with Apache arrow Datafusion as the underlying technology and for a different, targeted and tailored use case of Observability.


Oh Jeez! Yes, apologies. I even just wrote a whole explainer for our internal folks on OTLP vs. OLTP. (Lolfail!)

And that's cool! I'll look into OpenObserve.


I can 100% confirm that OpenTelemetry is a fantastic project to get rid of most observability lock-in.

For context: I am the Head of Product of Dash0, a recently-launched Observability product 100% based on OpenTelemetry. (And Dash0 is not even the first observability based on OpenTelemetry I work on.)

OTLP as a wire protocol goes a long way in ensuring that your telemetry can be ingested by a variety of vendors, and software like the OpenTelemetry Collector enables you to forward the same data to multiple backends at the same time.

Semantic conventions, when implemented correctly by the instrumentations, put the burden of "making telemetry look right" on the side of the vendor, and that is a fantastic development for the practice of observability.

However, of course, there is more to vendor lock-in than "can it ingest the same data". The two other biggest sources of lock in are:

1) Query languages: Vendors that use proprietary query languages lock your alerting rules and dashboards (and institutional knowledge!) behind them. There is no "official" OpenTelemetry query language, but at Dash0 we found that PromQL suffices to do all types of alerting and dashboards. (Yes, even for logs and traces!)

2) Integrations with your company processes, e.g., reporting or on-call.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: