Add "failure to rotate TLS certs before they expire."

deathanatos · on Oct 13, 2022

My company has monitoring for this, but it still seems to be a law of nature that,

1. someone adds new service/server/infra in a submarine manner

2. it goes to prod

3. the cert expires and outage begins

4. my team is asked what to do, because "we're the cert experts"

5. we add it to the monitoring

So it only happens once … per service. Which isn't great. But how do you get people to slow down and do simple shit, like add a monitor to your new service? (Or promulgate a design doc…)

ElevenLathe · on Oct 13, 2022

I think the real answer is to only issue limited-duration certs and only via automated means (ACME or similar), thus requiring automation be in place from day 1.

This still doesn't protect against the vector where somebody else in the company has managed to prove themselves to be responsible parties to another CA/issuer.

deathanatos · on Oct 14, 2022

Oh I agree! Everything would be ACME if I could. A ridiculous amount of stuff still doesn't support it, though.

And, like I said, usually it's someone who doesn't grok certs doing it without asking for help in the first place, so they're not going to get why ACME. (Because I am tired of doing cert renewals. I've had enough for a lifetime…)

HL33tibCe7 · on Oct 13, 2022

“If you keep smelling shit, look at your own shoe”.

Your processes are failing your development teams, and you need to fix them, rather than blaming your teams, which achieves nothing.

deathanatos · on Oct 14, 2022

I agree, but that level of authority is in a different level of the foodchain. I'm on duck tales, Larry.

tialaramex · on Oct 13, 2022

Pit of success. Make it so that the Right Thing™ is super easy, whereas the Wrong Thing™ is frustrating and keeps pushing people towards the Right Thing. Humans are lazy, use that to your advantage.

For example it's one line for me to configure a new machine infrastructure built to have a certificate for myservice.myorg.example, and there's a Wiki page reminding me what the line is, or I can look at lots of services which already work. If I do that, the automation happens from the outset, my service never has a custom cert or lacks monitoring, it has monitoring from day zero and its certificates are automated. I happen to really care about ACME and the Web PKI and so on - and would have gone the extra mile to do this, but I was astonished on Week One at my current employer to realise oh, this is just how everything works here, the Right Thing™ is just easier.

Does your company have a Wiki page saying how to do it wrong? After writing the page about the right way, update the bad wiki page with a link to your new page, and cross through all the previous text, or even just delete it.

If you have firewall rules or a proxy, block random unauthorised stuff. This is probably a reasonable strategy anyway. Now they come to you to unblock their "submarine" service, and before you do that's the opportunity to insist on proper certificate behaviour.

deathanatos · on Oct 14, 2022

People are really good at avoiding the pit of success! We run most of our infra on k8s, and if you want a cert, with ACME & auto-renew all managed automatically for you, you just create a Certificate object.

But then we get some vendored product that manages to be completely unable to run in k8s, devs avoid the automation for $reasons, etc.

> Does your company have a Wiki page saying how to do it wrong?

Sometimes we do! I've found a few of these after really pressing the point of "why are you doing it this way?" hard enough. But you have to a.) get an answer to that and b.) the answer has to reveal they followed some shadow docs.

> If you have firewall rules or a proxy, block random unauthorised stuff.

Your rouge service implementer just creates their own VPC; they are in control of the firewall.

Should my security team either set appropriate privileges or delegate that to my team? Perhaps. I have to get them to adopt like RBAC and ABAC first; they fervently believe our industry's regulations forbid "job function begets access" (i.e., RBAC) type policies. (They desire that, even if a job function begets access, that if you're not needing to exercise that access, it should be revoked until such a time that it is required to be exercised. But this means that you end up with "all security reqs. must flow through security team" style thing, and there are then a lot of them (because they are so ephemeral) and so any process must inherently be ignorant of whether the request is right. So your rouge implementor's request for "I need to implement $high level service" is basically carte blanche.

The thing about shadow-docs and shadow-services is that they're hard to find out about in a timely manner. A lot of these comments are fighting the very core of human nature.

(We used to be better about this as a company, back when we were very engineer heavy, and filled with good engineers — most better than me. The quality bar definitely fell at some point, and we've hired a lot of not engineers doing things that really would be better served by an engineer. Y'all are working at rainbow companies, and I don't know how to keep a company in that state or move it to that state as a bottom-of-rungs eng.)

D-Coder · on Oct 13, 2022

1. Clearly describe the correct process in your other process documentation.

2. Email everyone who might be involved a note about this and a link to the documentation and why it is important.

3. Next time someone ignores it, rip them and their manager a new orifice.

4. Wait for word of #3 to spread.

Might help...

Too · on Oct 15, 2022

Trust me, half the recipients of the email would have forgotten about it within a week. You'd be lucky if they even opened it in the first place.

deathanatos · on Oct 14, 2022

Sounds like a quick trip to getting fired?

prepend · on Oct 13, 2022

Can you check with dns for new entries and check 443 for a couple weeks to see if there’s a tls cert there?

deathanatos · on Oct 14, 2022

You can't enumerate DNS entries. (And not even privately: some of our (DNS) entries are wildcards, CNAMEs, etc. all make that hard.)

We do (now) follow the CT logs for ourselves. That catches some cases, but not everything.

prepend · on Oct 14, 2022

Why not? I can just check the zone file and go through line by line, right?

The wildcards would be tough but you could follow cnames as those would need to be in the cert as is.

HWR_14 · on Oct 13, 2022

But at least that failure just makes the services fail, not opens a security hole.

cmeacham98 · on Oct 13, 2022

Except in the scenarios where the company's support starts telling users to click through the warning, which I've seen a few times.

HWR_14 · on Oct 14, 2022

Yuck. I hadn't considered that.