My company has monitoring for this, but it still seems to be a law of nature that,
1. someone adds new service/server/infra in a submarine manner
2. it goes to prod
3. the cert expires and outage begins
4. my team is asked what to do, because "we're the cert experts"
5. we add it to the monitoring
So it only happens once … per service. Which isn't great. But how do you get people to slow down and do simple shit, like add a monitor to your new service? (Or promulgate a design doc…)
I think the real answer is to only issue limited-duration certs and only via automated means (ACME or similar), thus requiring automation be in place from day 1.
This still doesn't protect against the vector where somebody else in the company has managed to prove themselves to be responsible parties to another CA/issuer.
Oh I agree! Everything would be ACME if I could. A ridiculous amount of stuff still doesn't support it, though.
And, like I said, usually it's someone who doesn't grok certs doing it without asking for help in the first place, so they're not going to get why ACME. (Because I am tired of doing cert renewals. I've had enough for a lifetime…)
Pit of success. Make it so that the Right Thing™ is super easy, whereas the Wrong Thing™ is frustrating and keeps pushing people towards the Right Thing. Humans are lazy, use that to your advantage.
For example it's one line for me to configure a new machine infrastructure built to have a certificate for myservice.myorg.example, and there's a Wiki page reminding me what the line is, or I can look at lots of services which already work. If I do that, the automation happens from the outset, my service never has a custom cert or lacks monitoring, it has monitoring from day zero and its certificates are automated. I happen to really care about ACME and the Web PKI and so on - and would have gone the extra mile to do this, but I was astonished on Week One at my current employer to realise oh, this is just how everything works here, the Right Thing™ is just easier.
Does your company have a Wiki page saying how to do it wrong? After writing the page about the right way, update the bad wiki page with a link to your new page, and cross through all the previous text, or even just delete it.
If you have firewall rules or a proxy, block random unauthorised stuff. This is probably a reasonable strategy anyway. Now they come to you to unblock their "submarine" service, and before you do that's the opportunity to insist on proper certificate behaviour.
People are really good at avoiding the pit of success! We run most of our infra on k8s, and if you want a cert, with ACME & auto-renew all managed automatically for you, you just create a Certificate object.
But then we get some vendored product that manages to be completely unable to run in k8s, devs avoid the automation for $reasons, etc.
> Does your company have a Wiki page saying how to do it wrong?
Sometimes we do! I've found a few of these after really pressing the point of "why are you doing it this way?" hard enough. But you have to a.) get an answer to that and b.) the answer has to reveal they followed some shadow docs.
> If you have firewall rules or a proxy, block random unauthorised stuff.
Your rouge service implementer just creates their own VPC; they are in control of the firewall.
Should my security team either set appropriate privileges or delegate that to my team? Perhaps. I have to get them to adopt like RBAC and ABAC first; they fervently believe our industry's regulations forbid "job function begets access" (i.e., RBAC) type policies. (They desire that, even if a job function begets access, that if you're not needing to exercise that access, it should be revoked until such a time that it is required to be exercised. But this means that you end up with "all security reqs. must flow through security team" style thing, and there are then a lot of them (because they are so ephemeral) and so any process must inherently be ignorant of whether the request is right. So your rouge implementor's request for "I need to implement $high level service" is basically carte blanche.
The thing about shadow-docs and shadow-services is that they're hard to find out about in a timely manner. A lot of these comments are fighting the very core of human nature.
(We used to be better about this as a company, back when we were very engineer heavy, and filled with good engineers — most better than me. The quality bar definitely fell at some point, and we've hired a lot of not engineers doing things that really would be better served by an engineer. Y'all are working at rainbow companies, and I don't know how to keep a company in that state or move it to that state as a bottom-of-rungs eng.)