Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've gone back and forth on this. Yes - sometimes your worry is completely unfounded, but this article lost me here:

>Untangling messy parts of the codebase, identifying and removing dead code, cleaning up build, deploy and monitoring systems - these tasks often feel overwhelming because they can overwhelm you, sucking up days for no real gain.

Those aren't the risks I'm thinking about when I'm worried about problems like this. Difficulty/time is not the concern. Unknown-unknowns are the concern.

Upgrading our NodeJS version takes about 10 seconds with Elastic Beanstalk on AWS, but we're not on the latest version, even though it seems to run without issue and our tests pass. Why? Because maybe it'll expose some race case that would only appear in production. And worse even, maybe that race case will affect a lot of people sometimes in an extremely hard to reproduce way. Maybe it'll break something that isn't apparent for weeks.

We had a bug where some SMS notifications weren't being sent in production, and it was hurting our retention metrics for a couple months before one of our engineers stumbled across it. It was caused by a maintenance upgrade of one of our very few dependencies. In a small company or on a new product, you don't have the kind of bulletproof reporting to know when something like that is broken - your metrics are moving around quite a bit by default.

Code that works and has been working for ages without issue is code that I am not interested in changing unless I absolutely have to. Leaving/avoiding code that hasn't been changed in forever is sometimes the best option to ensure stability.

This all being said, I agree that this worry can be carried too far. Like I said - I've gone back and forth on this. Fundamentally it's a tricky line to walk.



It sounds like the grungy work you'd need to look into is a canary deploy with a production traffic duplicator. Spin up a version N+1 in AWS, copy all the traffic N is getting, only hook N+1 to mocks and observed behavior.

If your monitoring and alerting and quotas are setup right you'll know if the version update is ok.

To me that is very scary work though. Very difficult and risky and this article re-applies (at least to me).


Canary works for immediately broken deploys. If it manifests or is caught a few days later or a week later then you need a new plan


You can also collect code coverage statistics from this canary deploy. I don't know if this is possible with EBS but I've done it with JVM apps where you can connect a debugger remotely. Keep running the canary until 100% of code paths are hit. If a code path isn't hit, find out why and repeat.

It's also a great way to empirically find dead code.


Trying to build something like this with Intel PT, and it's great. I used to do it with a patched libgcov, but now it's even better. Getting counters for every code branch, notifications after a new path has been taken once N% coverage is reached, liveness info about periodic tasks, I/O threads, execution times too.


I think it's probably the best way to be "sure" about your code and unfortunately this is an area that is under-tooled.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: