I try to monitor everything because it can get much more accessible to debug weird issues when sh*t hits the fan.
> Do you also keep tabs on network performance, processes, services, or other metrics?
Everything :)
> What's your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
I went with collected [1] and Telegraf [2] simply because they support tons of modules and are very stable. However, I have a couple of bespoke agents where neither collected nor Telegraf will fit.
> Lastly, what's your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
We can argue to death, but I'm for push-based agents all the way down. It is much easier to scale, and things are painless to manage when the right tool is used (I'm using Riemann [3] for shaping, routing, and alerting). I used to run Zabbix setup, and scaling was always the issue (Zabbix is pull-based). I'm still baffled how pull-based monitoring gained traction, probably because modern gens need to repeat mistakes from the past.
> For us, knowing immediate who should have had data on last scrape but didn't respond is the value.
Maybe I don't understand your use case well, but with tools like Riemann, you can detect stalled metrics (per host or service), who didn't send the data on time, etc.
> What mistakes are you referring to?
Besides scaling issues and having a simpler architecture, in Zabbix's case, there were issues with predictability: when the server would start to pull the metrics (different metrics could have different cadence) when the main Zabbix service was reloaded, had connection issues, or was oversaturated with stuck threads because some agents took more time to respond than the others. This is not only Zabbix-specific but a common challenge when a central place has to go around, query things, and wait for a response.
> you can detect stalled metrics (per host or service), who didn't send the data on time, etc
I guess the difference here is that we leverage service discovery in Prometheus for this instead of having to externally build an authoritative list of who/what should have pushed metrics.
> <...> and wait for a response.
As opposed to waiting for $thing to push metrics to you?
I guess I'm not convinced that one architecture is obviously better? There might be some downsides to a particular implementation but generally they both work and only external constraints will dictate which you use? E.g.: if you're required to ship metrics to multiple places, pushing to graphite and datadog becomes easier.
Anything that _should_ be scraped is tagged a certain way and anything that doesn't respond to a scrape gets flagged. After a few flags, an operator is paged. When $thing is destroyed or re-provisioned, different tags lead to a different set of $things to scrape metrics from.
> Do you also keep tabs on network performance, processes, services, or other metrics?
Everything :)
> What's your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
I went with collected [1] and Telegraf [2] simply because they support tons of modules and are very stable. However, I have a couple of bespoke agents where neither collected nor Telegraf will fit.
> Lastly, what's your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
We can argue to death, but I'm for push-based agents all the way down. It is much easier to scale, and things are painless to manage when the right tool is used (I'm using Riemann [3] for shaping, routing, and alerting). I used to run Zabbix setup, and scaling was always the issue (Zabbix is pull-based). I'm still baffled how pull-based monitoring gained traction, probably because modern gens need to repeat mistakes from the past.
[1] https://www.collectd.org/
[2] https://www.influxdata.com/time-series-platform/telegraf/
[3] https://riemann.io/