Very cool, would love something like this. Your video gives a fairly straightforward incident response which traditional tools would work equally well on. Can you describe a situation that Zebrium does better than legacy tools? Perhaps a hypothetical unknown unknown.
There are a few testimonials on the website, but there are plenty of other proof points we can't attribute. Off the top of my head, here are a few that stand out:
1.) A latent LDAP server issue that would have taken down a mission-critical SaaS app at a Fortune 500 enterprise SW company. Detected and showed root-cause indicators.
2.) Two production bugs that were degrading service for a subset of users for weeks in a multi-billion-$ B2B SaaS company's production deployment. Detected and showed root-cause indicators.
3.) Multiple backend bugs degrading service in a $1B e-commerce company's production deployment. Detected and showed root-cause indicators.
4.) All OpenEBS issues that had been observed YTD in real customer deployments, replicated using Litmus by MayaData. Detected and showed root-cause indicators.
5.) Here's an unsolicited quote a devops consultant from the UK posted in our community 4 months ago:
"The data has started coming through and has picked up all the incidents I deliberately caused and a couple of other that I didn't know about.
This setup so cuts through the noise of logs to the heart of the matter that it would not be over stating the case to say that this is the future of Observability.
Brilliant!"
and although this isn't quite answering your question about unknown/unknowns, this open source project lets you inject failure modes using a chaos tool (litmus) on your own app. We had really good results catching application incidents created by these chaos test.
https://github.com/zebrium/zebrium-kubernetes-demo