Google and Friends Add Prometheus to Kubernetes Platform

andrewstuart2 · on May 11, 2016

Probably more accurately, Prometheus is now officially being incubated by the Cloud Native Computing Foundation, which is a foundation for apps which are container-packaged, dynamically-managed, and microservices-oriented. Kubernetes is an engine to run cloud-native apps (itself conforming to the above spec). Prometheus is a platform for monitoring cloud-native apps, which also conforms to the spec.

Having watched CNCF and kubernetes very closely, I would not be surprised at all to see it adopt the governance of multiple platforms for a given service (logging, for example).

arturhoo · on May 11, 2016

Awesome news! Prometheus does fill a void in the monitoring, time series, visualization niche. I found it more straightforward to setup and use than the alternatives such as statsd, collectived, influxdb (which seems to be headed the wrong way in monetizing) and grafana et al.

I always thought that Hashicorp would someday fill this void with their polished products and straightforward community and monetization strategies, especially through consul [0]. I hope it can polish the rough edges while being incubated by the Cloud Native Computing Foundation.

PS.: I've prepared an Ansible role for prometheus for those interested, albeit outdated [1]

[0]: https://groups.google.com/forum/?utm_medium=email&utm_source...

[1]: https://github.com/arturhoo/cameos/tree/master/roles/prometh...

SEJeff · on May 11, 2016

FWIW regarding your grafana comment, Prometheus upstream plans to deprecate PromDash and move the ui to grafana as they do a better job.

From: https://prometheus.io/docs/visualization/promdash/

    NOTE: We recommend Grafana for visualization of Prometheus metrics nowadays, as it has native Prometheus support and is widely adopted and powerful. There will be less focus on PromDash development in the future.

jabl · on May 11, 2016

So what does it signify that Prometheus is now under the CNCF? More development resources? I suppose it at least to some extent allays whatever fears there might have been that it will go "open core" in some ways, as e.g. InfluxDB recently..

That being said, what is the killer feature here? Looking at the docs, it seems nice, but collectd+influxdb+grafana seems to do the job for us at the moment..

infinotize · on May 11, 2016

Re: prometheus vs influx: There's a small thread with a few comments by the influxdb ceo and some Prometheus devs here [0], mainly on use case and philosophy. See also [1], although this is quite dated.

[0] http://stackoverflow.com/questions/33350314/usecases-influxd... [1] https://prometheus.io/docs/introduction/comparison/#promethe...

sz4kerto · on May 11, 2016

InfluxDB _seems_ to be less stable, and also their monetization strategy might not appeal to some. I also appreciate the pull-style operation more and more (vs. the push-based operation of InfluxDB).

gtirloni · on May 11, 2016

Could you tell what you find better about the pull-style operation mode? My experience with pull-based monitoring tools comes from Zabbix/Nagios and it wasn't very pleasant (still had to configure clients extensively while keeping everything organized in a central database -- which is hard if your servers are coming and going all the time).

We're currently using collectd+influxdb+grafana but it looks like Prometheus might be a better option in the future, since we plan to use Kubernetes more and more.

sagichmal · on May 11, 2016

You're right in that pull only works well when your puller has a subscription to your service discovery system, so you don't need to manually configure targets when they change. Prometheus has SD plugins for major systems including Kubernetes.

bbrazil · on May 11, 2016

Prometheus developer here.

> More development resources?

Not directly. I expect the gains to be around the legal and organisational stuff you need as a project grows (e.g. trademark registration, formal governance), as well as publicity.

> I suppose it at least to some extent allays whatever fears there might have been that it will go "open core" in some ways, as e.g. InfluxDB recently..

That was never really a risk, as Prometheus has never been a commercial effort controlled by any one company.

> That being said, what is the killer feature here?

Prometheus is perfect for monitoring any sort of operational timeseries, and is designed to do so in a way that scales both technically and organisationally.

For example collectd is an on-host daemon that you have one of per machine. That may work when you're small, but with many services that will become a bottleneck as each need to be added and stragglers may become an issue (Twitter seem to have hit this, see "Lessons Learned" https://blog.twitter.com/2016/observability-at-twitter-techn...). The Prometheus approach of one exporter per target scales far better.

For another example if two different teams (say infra and database) have two different views of the world, with a collectd approach then the infra team likely wins. With Prometheus monitoring can be decentralised so infra can view a machine as mach1{rack="ch",datacenter="ma"} and database as mach1{owner="app2",env="canary"} each in their own Prometheus server. This gives teams more control by decentralising monitoring.

jabl · on May 11, 2016

Thanks for taking the time to answer.

> I expect the gains to be around the legal and organisational stuff you need as a project grows (e.g. trademark registration, formal governance), as well as publicity.

Good point. It's easy to forget the paperwork, but it certainly is important too!

> Prometheus is perfect for monitoring any sort of operational timeseries, and is designed to do so in a way that scales both technically and organisationally.

Umm, yeah, but well, isn't that the punchline of every monitoring project? ;-)

> For example collectd is an on-host daemon that you have one of per machine. That may work when you're small, but with many services that will become a bottleneck as each need to be added and stragglers may become an issue (Twitter seem to have hit this, see "Lessons Learned" https://blog.twitter.com/2016/observability-at-twitter-techn...). The Prometheus approach of one exporter per target scales far better.

I don't understand. Isn't that twitter blog essentially saying that they went from pull to push? Or is the main point here that they changed to have the agent on each host that collects everything on that host before transferring those metrics to the server (whether that's push or pull)?

> For another example if two different teams (say infra and database) have two different views of the world, with a collectd approach then the infra team likely wins. With Prometheus monitoring can be decentralised so infra can view a machine as mach1{rack="ch",datacenter="ma"} and database as mach1{owner="app2",env="canary"} each in their own Prometheus server. This gives teams more control by decentralising monitoring.

Good point. And I think this hits home in the sense that with our current monitoring setup the metrics we collect are really host-level, whereas for many metrics we'd rather have per-job (we're a HPC shop, so a bit different from the usual web 2.0-stuff) statistics. But I'm not sure Prometheus bends to this either, since IIUC each metric is stored in a separate file, and if we'd have a bunch of metrics for each job ID in the system, this wouldn't really work out, would it?

bbrazil · on May 11, 2016

> Umm, yeah, but well, isn't that the punchline of every monitoring project? ;-)

No actually :)

From reading countless websites for the multitude of monitoring solutions out there, the standard pitch is that they'll make your operations more efficient, magically detect problems and give you new insights into your systems.

It's actually really annoying, as you have to dig deep into their docs and squint at screenshots to figure out that actually it's just another Nagios clone :)

> I don't understand. Isn't that twitter blog essentially saying that they went from pull to push? Or is the main point here that they changed to have the agent on each host that collects everything on that host before transferring those metrics to the server (whether that's push or pull)?

What I take from it was had they kept the same design and just changed to push, they'd have continued to have had the exact same problems. I'd view the addition of isolation and health information as what really solved the issue, and whether push or pull is better for that is a wash.

> But I'm not sure Prometheus bends to this either, since IIUC each metric is stored in a separate file, and if we'd have a bunch of metrics for each job ID in the system, this wouldn't really work out, would it?

I'd need a bit more detail to give an exact solution, but Prometheus should work fine for for this.

Metrics are fetched over HTTP by Prometheus, and they usually don't come from files. You'd either have Prometheus scrape each job individually, or scrape your controller daemon that'd attach a labels indicating the job of each metric.

jabl · on May 11, 2016

> Metrics are fetched over HTTP by Prometheus, and they usually don't come from files. You'd either have Prometheus scrape each job individually, or scrape your controller daemon that'd attach a labels indicating the job of each metric.

I was thinking more of the data model (https://prometheus.io/docs/concepts/data_model/). Specifically, from https://prometheus.io/docs/practices/naming/ , "Remember that every unique key-value label pair represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.". So in our case the job ID would be such an unbounded value..?

bbrazil · on May 12, 2016

These would be the exact details I'd need to advise you.

It depends on how many jobs you have, how much churn there is, and what exactly you want to monitor. If jobs are short-lived then tracking individual jobs would be unwise, something like the ELK stack intended for event logging would be better. If jobs are long-lived and there's not many of them then you should be okay. Otherwise you'd just be looking at tracking system rather than per-job stats.

To give a very rough idea, if you can keep it below say 10M metrics across the history a single Prometheus server has that should be okay with the current implementation.