> Benchmarking and profiling doesn't help with that.
I've learned that benchmarking and profiling is the _only_ way to write performant code.
I've seen in code review a number of examples of a very fancy algorithm being broken out, and asked, "You realize N is bounded to be at most 100 here?". Or, "you realize the thread overhead here for parallel processing is two magnitudes slower than serial data access on one thread?"
Humans are bad at intuitively understanding where the slow parts of code are. I've seen processing be improved to the point of impossible to grok, shaving 10ms of a processing piece that is 50ms long, only to then spend time blocking waiting for network transfers that require 10s.
I'm of the opinion the biggest performance improvements are usually in design and architecture. What if there were a design that avoided the need to do any network IO? In that case the 10s + 50ms would be a 50ms process, rather than a hard to grok "optimized" 10s + 40ms process. Simple code leads to simpler design, which is easier to reason about and spot the places where things like "this entire network round trip can be cut out", or "we are loading this data multiple times throughout this process, we can load it once", or "we are loading this data and then spending a lot of time querying "n+1", instead if we stored the data in this format with some pre-processing we'll avoid the "n+1" query."
To further rant, the emphasis of algorithms in coding interviews, people enjoying algorithms more than cleaning up crufty architecture - that is the root of a lot of bad software rather. In sum, it's almost always the design that is slow, rarely it's the algorithm. The profiling is key as it let's you know where things are actually slow. (Recently a colleague was trying to optimize a tight loop that processed 1.5M rows. To "optimize" memory usage, they converted all variables to static to 'save' memory and avoid GC pauses. This in effect did _nothing_, the compiler instead was going to inline all the variables anyways and the resulting bytecode was not going to have any extra variables in at all. Converting local variables to static actually made the memory usage just slightly worse. So, this 'optimization' did nothing but make the code worse. A quick benchmark would have shown that optimization having no effect (to really optimize memory usage, we updated the design to stream results to a file rather than keep everything in memory for a final dump to file at the very end). Another example, I once helped a team do performance work for a DB that they spent a year tuning. They did not keep track of any performance benchmarks, what changes did what improvement; and after a year had nothing to show except for a DB that would crash after a few minutes. Taking that over, starting everything over from scratch, benchmarking everything, the project was done a month later and was stupid fast.)
Disagree, once you know what you're looking for you can thread the needle pretty easily. I've worked in high-performance areas most of my career and it's pretty wild when I solve a leetcode problem for fun and can routinely get into the 99% percentile on speed and memory usage just from knowing what to avoid.
I think you are the golf ball balancing perfectly on an upside down bowl. Optimal, but an unstable solution. Most engineers don't yet know because they don't have enough experience (and most engineers haven't worked in high performance for most of their career), so they need the benchmarks and profiling.
Plus, benchmarks are good solely to be able to show your manager that yes, spending 3 weeks on that refactor was indeed useful. Engineers shouldn't need to have to do that, but it is often useful none the less.
it's important to make sure your code base isn't doing anything dumb, like O(n^2) iterations, or 1000 sequential RPCs instead of 1 batch RPC, or etc -- this is i guess your point about architecture
but assuming that bar is cleared, performance optimizations should only be accepted when accompanied by benchmarks/profiles that demonstrate their usefulness at the whole system level
I've learned that benchmarking and profiling is the _only_ way to write performant code.
I've seen in code review a number of examples of a very fancy algorithm being broken out, and asked, "You realize N is bounded to be at most 100 here?". Or, "you realize the thread overhead here for parallel processing is two magnitudes slower than serial data access on one thread?"
Humans are bad at intuitively understanding where the slow parts of code are. I've seen processing be improved to the point of impossible to grok, shaving 10ms of a processing piece that is 50ms long, only to then spend time blocking waiting for network transfers that require 10s.
I'm of the opinion the biggest performance improvements are usually in design and architecture. What if there were a design that avoided the need to do any network IO? In that case the 10s + 50ms would be a 50ms process, rather than a hard to grok "optimized" 10s + 40ms process. Simple code leads to simpler design, which is easier to reason about and spot the places where things like "this entire network round trip can be cut out", or "we are loading this data multiple times throughout this process, we can load it once", or "we are loading this data and then spending a lot of time querying "n+1", instead if we stored the data in this format with some pre-processing we'll avoid the "n+1" query."
To further rant, the emphasis of algorithms in coding interviews, people enjoying algorithms more than cleaning up crufty architecture - that is the root of a lot of bad software rather. In sum, it's almost always the design that is slow, rarely it's the algorithm. The profiling is key as it let's you know where things are actually slow. (Recently a colleague was trying to optimize a tight loop that processed 1.5M rows. To "optimize" memory usage, they converted all variables to static to 'save' memory and avoid GC pauses. This in effect did _nothing_, the compiler instead was going to inline all the variables anyways and the resulting bytecode was not going to have any extra variables in at all. Converting local variables to static actually made the memory usage just slightly worse. So, this 'optimization' did nothing but make the code worse. A quick benchmark would have shown that optimization having no effect (to really optimize memory usage, we updated the design to stream results to a file rather than keep everything in memory for a final dump to file at the very end). Another example, I once helped a team do performance work for a DB that they spent a year tuning. They did not keep track of any performance benchmarks, what changes did what improvement; and after a year had nothing to show except for a DB that would crash after a few minutes. Taking that over, starting everything over from scratch, benchmarking everything, the project was done a month later and was stupid fast.)