> And lastly, this isn’t meant to be a prescriptive post,
> but we all know why micro-optimizing is usually a
> mistake: it wastes your time, it’s easy to screw up
> (see question 5), and it *typically* produces no
> measurable speedup.
I emphasized the typically because this is a key idea: you are fallible -- deal with it!
I visited a local hacker-space with a friend of mine to talk about a project of ours. A bunch of those guys are NASA and they know quite a bit about how to engineer for reliability. Since our project involved a lot of soldering, he suggested that we estimate our average solder join failure rate and use that as the number of expected failures to look for. That's what he does in his personal projects.
I exclaimed that was brilliant, then was a little embarrassed. Of course! Duh! I know I'm not perfect and I'm smart enough to have deduced that I have something like an average failure rate for solder joins. So why have I been proceeding for years and years as if I could just do everything perfectly if I just tried hard enough?
Every new function you write is a risk. Minimize your risks. Maximize the cost/benefit. Every routine you optimize is a risk. Minimize your risks. Maximize cost/benefit.
One's problems as a programmer are more likely the result of excess hubris and not insufficient smarts.
EDIT: Fear is as big a problem as hubris. Walk the middle path. You are your own optimization problem, and the optimal solution is almost never a no-brainer shove of a lever all the way in some direction.
That's awesome. I'm always impressed by the reliability work that comes out of NASA. I don't regret a single penny of my tax dollars which goes to them just for this sort of thing alone.
ridiculousfish is the smartest programmer I've ever met. I hired him at Apple years ago, and while I could regale you with tales of how awesome he is, Apple would probably sue me.
That said, if you've ever played World of Warcraft on the Mac, you should worship him as a god.
Sounds like the kind of guy I want to work with on a daily basis even if half the time he spent yelling at me for doing something stupid. After a few months I'd be such a better programmer.
This dude is pretty much my dream cofounder. Math background, aesthetically intuitive, intellectually curious, specifically interested in psychology, and clearly a talented programmer.
Shame about the blog chrome. Seriously all I had above the fold was the fish picture and until I looked at the scrollbar I'd nothing to indicate that the page had any content, I assumed it was some sort of flash game.
Also within the post the code overflows the code boxes for me (I'm 2 clicks up on font size).
I avoid these types of optimizations, say, 97% of the time
This reminded me of Linus' blog post on optimizing SHA1 routines in Git [0]. FTA: "Some people seem to think that C is a real programming language, but they are sadly mistaken. It really is about writing almost-portable assembly language, and it turns out that getting good results from SHA1 really is mostly about trying to fight the compilers tendency to try to be clever."
There is a self tuning linear algebra library called ATLAS. When building, probes various things about your computer, things like L1/L2 cache, if you have fused multiply/add, etc. It uses these to output very highly tuned matrix kernels in standard C.
Ironically, when compiling this C code, you have to use -O, higher optimization levels make the compiler try to be too smart for its own good and actually perform worse.
"Improving the Performance of the Secure Hash Algorithm" (by transforming the algorithm mathematically and then using hand-written assembler with SIMD instructions):
A long, long time ago, I recall that someone did a study of code generated by C compilers and found that, on average, 1 C statement = 2.5 lines of assembly. (for CISC)
GCC does this optimization, because strlen is a "built-in function:" one that gcc recognizes and optimizes specially. Disabling built-ins with -fno-builtin defeats this optimization."
No, it does it because strlen is declared to be a pure function, and it is smart enough to realize you are not modifying anything in s. -fno-builtin does not change the generated code, it still calls the strlen function (at least on my gcc), but it is hoisted out of the loop.
Gcc surprised me once. I had put some effort into replacing something like (x << 12) | (x>>20) with the appropriate roll instruction at the assembly level. Only after seeing that the "optimized" code ran no faster than plain GCC, I looked at the GCC output and of course the roll was there.
I believe it was GCC 3.*, so even older than the one mentioned in the article.
Ha, I really like this guy. I started reading another one of his blog entries (a description of how compilers often optimise a divide by a constant where, instead of issuing an expensive divide, it multiplies by a "magic" number and then right shifts the result). After hitting lots of complex (for me anyway) mathematical proofs, my brain tuned out slightly until I saw
"Did you skip ahead? It's OK. Here's the summary"
I like to think he deliberately did that for hungover-yet-still-interested devs like myself.
Well, I never read proofs on the first pass. I mean, I read each expression, but I don't bother trying to assimilate them into a coherent thought until I've finished any plain-English associated text.
And even that I'll sometimes wait to properly comprehend for a second pass.
Question number 2 really surprised me. What if the layout of memory is:
(s)(s+1)(s+2)...(result)
If `s` has no null bytes until `result`, then `strlen(s)` will terminate on `result`. However, if you change `result` while iterating over `s`, then `result` will be not null and the `strlen()` result will change. Or is there some rule in C standard that says local variables are not in the same memory range as external pointers?
Certain pointer operations may result in undefined behavior:
int main()
{
int arr[4] = {0, 1, 2, 3};
int * p = arr + 5; // undefined behavior
}
In short, you can't assume that result will be placed after the array. What if result is optimized to a register? It actually sounds quite likely to me given the tight loop.
If you're going to be pedantic, it is best to be right. I don't have a copy of the C standard here, but I do have a copy of the C++ standard. From 5.7 Additive Operators (discussing an expression of the form P+N, where P has pointer type and N has integral type):
"If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined."
That is, if either P or P+N don't point to the same array object, or one past the end of the array object, the result is undefined. Thus,
int arr[4] = {0, 1, 2, 3};
... arr + 4 ... // OK, points to one past the end of the array object
... arr + 5 ... // undefined
This is one instance where C++ and C are, indeed, the same. However, the languages do have their differences, so I would caution those developers who don't tend to read standards not to assume that C and C++ will behave identically when given valid C code. Obviously, having referenced the C++ standard yourself, you are aware of such caveats.
6.5.6.8 from ISO/IEC 9899:201x draft of March 1, 2009:
...If both the pointer
operand and the result point to elements of the same array object, or one past the last
element of the array object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined....
Since "result" is on the stack it will only be allocated at the time of the function call. Also note that the stack grows downwards in memory. To get the memory layout you've shown you'd have to be summing up the bytes in the unallocated memory between the heap and the top of the stack. Probably this is explained somehow in the standard but I don't personally know.
If you've managed to embed an integer into your char array and you are using that integer as an accumulator, you're probably "doing it wrong", very wrong. More to the point, you've got bigger problems than gcc's optimization specifics.
As for this particular example, result is allocated on the stack and is guaranteed to not lie within s, so the gcc optimization is perfectly safe.
I'm not saying this is useful for anything - I'm just asking about correctness of such optimisation. You can set the pointer to any address, which includes future stack. Also, nothing can 'lie within a pointer' really... - it might be inside an array.
It might seem silly in this example, but as soon as I change the program to a very similar one:
> You can set the pointer to any address, which includes future stack.
This is wrong. Attempting to use random pointers to outside of what your compiler and malloc allocated is undefined.
And indeed, aliasing issues like in your example will prevent gcc from optimizing out excess strlen; it can only do so if it can prove that both global memory and the buffer passed to strlen have not been modified. Local variables and return values cannot alias anything, and part of the guarantee of pure functions like strlen is that they have no side effects.
By any chance, does this sort of information exist at all for Java bytecode compilers? Whenever I write Java code I try to do things the Java way but it usually makes me cringe to think that things like accessing a boolean field requires a function call.
I console myself by thinking that this all gets optimized away by the JIT compiler, but I'd really like to know for sure.
I think the fact that I got the first one wrong made me err too far into thinking GCC was smarter than I had first assumed. And it is, but only for certain things.
Does 'const' mean that the function is not allowed to change the contents of the string, or does it mean that the contents of the string will never change? Suppose the 'sum' function was running in one thread, while in another the string that was passed in as 's' is being continuously written to. While this may not make much sense to do, and the output would not have a reliable value, this optimization would be the wrong thing to do.
const only means that the function is not allowed to change memory through that reference. It says nothing about aliases to the same memory.
But the "what if another thread changes the string" issue is a red herring. The question to keep in mind when optimizing in a multi-threaded context is "does the output of the compiler generated code correspond to at least one possible interleaving of threads". Since this thread doesn't do any synchronization, it's possible that all the function executes without being interrupted by another thread, so this is a sound optimization.
In a modern CPU, instruction selection is much more complicated than, "Oh, addl (or sll) is faster than umul," and depends on the micro-architecture/pipeline, surrounding code, etc. Addition usually takes one cycle, while multiplication might take several cycles (although, by pipelining, you can generally execute a multiply per cycle per multiplier) But, if all the other adders are full and the result isn't needed for a few cycles, it might be fastest to use the multiplier. (I'm thinking of a statically scheduled pipeline. Things get harder when the CPU does dynamic instruction reordering.)
Do any compilers attempt to take those sorts of concerns into account? It seems like it'd be an interesting experiment, but it breaks some of the nice abstractions--- afaik, strength-reduction is usually done before scheduling, with the only architecture dependency being a table of costs assigned to each operation. Letting scheduling constraints have input on those decisions seems ideal but tricky, unless I'm missing something.
It would be very hard to accurately account for a microprocessor's out-of-order instruction issue in a compiler.
A modern desktop-class microprocessor can have scores of instructions "in flight" at once. Each new instruction that's decoded gets reordered or stalled differently depending on the status of all of those previous instructions and their hardware requirements.
A compiler could try to model the processor's microarchitecture and simulate how it would execute the program, but that simulation would be inaccurate because of data-dependent branching, interrupts, and other unpredictable behaviors. Plus, the microarchitectural details you'd need to create such a simulation are usually not available.
That said, compilers can still tailor code to the microarchitecture in a general way. For example, if you know a processor can only decode one branch instruction per cycle, you can try to reorder the compiled code to avoid back-to-back branches in the instruction stream.
> It would be very hard to accurately account for a microprocessor's out-of-order instruction issue in a compiler.
This is true. However, a lot of modern processors do in-order execution, they just aren't made by Intel or AMD for desktops: GPUs, network processors, SIMD DSPs, etc. When I was hacking compilers (6 years ago), compiling for these processors presented difficult challenges and there was a real market for new compiler technology. One project was for a network processor that was not only in-order, but it was not interlocked: correctness depended on the compiler having an exact model of the processor's pipeline. Some registers were not registers, but latches. They got new values every cycle, so if you didn't gab a value on the right cycle, it was gone forever.
The phase ordering problem in compilers is notoriously hard. You wouldn't worry about the interaction between strength reduction and scheduling, but in other cases you do need to take these interactions into account. In my compiler (aka a compiler I worked on), the software pipeliner estimated register pressure, since you don't want to over-pipeline to expose parallelism if you blow the register set and cause the allocator to spill. Often register allocation is sensitive to the schedule (and there have been attempts to combine them into one pass.)
For x+x vs x*2 vs x<<1, the trick is to represent them in a uniform way and let the scheduler decide what instruction to emit based.
> Often register allocation is sensitive to the schedule (and there have been attempts to combine them into one pass.)
Yes -- for what it's worth, the compiler I'm working on at the moment has a pre-allocation scheduling pass and a post-allocation scheduling pass, and will also sometimes reschedule during allocation to shorten live ranges and thus ease register pressure.
Addition and bitshifting are easier to implement in hardware than multiplication is, with less complicated circuits that can return results in fewer clock cycles (though the difference is much less than it used to be).
Nice post. This is how the Nullstone compiler optimization benchmark suite is organized: two functionally equivalent versions of a C program, one cleaner/suboptimal, the other optimized by hand. The goal is to have the generated code perform the same in either case.
So would moving the floating point value into an integer register, shifting, incrementing, ORing, etc. ever be faster than a full FP addition? I imagine that modern desktop CPUs have effectively-single-cycle multiply, but what about something like ARM with VFP but no Neon?
Only in architectures where FP and integer values are kept in the same registers. This is true in SSE2 and Altivec, so you can do integer operations there - but since SIMD integer operations are limited too, it's pretty much only useful for flipping the sign bit.
Moving values between different register sets is INCREDIBLY slow since it involves at least two memory operations.
And faking FP operations with specialized integer code on something with soft-float like ARM might be worth it. I've never done it so I can't really say.
I got almost all of it wrong. I'm not very familiar with gcc's workings and usually just figure that the compiler is going to be dumb and fail at everything but the simplest optimizations. But then, when I actually go to do the tedious and source-obfuscating micro stuff I usually check my assumptions first - you kind of have to anyway, once you go that deep.
The point of most of the optimizations mentioned in the article is not to make programs run faster; e.g. any competent programmer can hoist constant functions out of a for loop himself, but knowing that the compiler will do this allows you to write cleaner code.
(This does not apply to e.g. auto-vectorization, which replaces 'x[0] = y[0] * z[0]; x[1] = y[1] * z[1]; ...' by a single (SSE?) instructions which calculates all of that at once. Such instructions are much faster and would require dropping down to assembly without a good optimizer.)
when I actually go to do the tedious and source-obfuscating micro stuff I usually check my assumptions first - you kind of have to anyway, once you go that deep.
Absolutely. Fun to read, made me smarter, but half of it could be different in any other version of gcc.
And frankly, if these are your problems, you're doing pretty well. I worked on a product that had some numerical parts, which were sometimes the limiting factor in performance, though usually not. So one day I was talking to a guy who wrote the numerical code about why regression testing was difficult and subtle, and it was revealed that our numerical results could differ from build to build. I.e., a single build of the product always produced the same results, but if you changed the floating point code, the new build might produce different (though equally accurate) results, even if semantically the program was supposed to perform the same C math operations on the same data.
Which makes perfect sense if you're using the old stack-based x87-style floating point instructions. We couldn't be... not in the 21st century on Opterons... check Makefile and gcc man page... biggest facepalm of my career.
We were able to announce a nice speedup on some workloads in the next release.
He could have added the switch statement optimization. That's my favorite. GCC 4 can optimize switch statements with sequential keys e.g. (1,2,3,4 ...) into a jump table. This was AFAIK introduced in GCC 4
#2 was an error I had, and a co-worker told me screaming that I was a crappy programmer because I did it, now that I know it's optimized I feel like he owns me an apology or maybe not :-)
I visited a local hacker-space with a friend of mine to talk about a project of ours. A bunch of those guys are NASA and they know quite a bit about how to engineer for reliability. Since our project involved a lot of soldering, he suggested that we estimate our average solder join failure rate and use that as the number of expected failures to look for. That's what he does in his personal projects.
I exclaimed that was brilliant, then was a little embarrassed. Of course! Duh! I know I'm not perfect and I'm smart enough to have deduced that I have something like an average failure rate for solder joins. So why have I been proceeding for years and years as if I could just do everything perfectly if I just tried hard enough?
Every new function you write is a risk. Minimize your risks. Maximize the cost/benefit. Every routine you optimize is a risk. Minimize your risks. Maximize cost/benefit.
One's problems as a programmer are more likely the result of excess hubris and not insufficient smarts.
EDIT: Fear is as big a problem as hubris. Walk the middle path. You are your own optimization problem, and the optimal solution is almost never a no-brainer shove of a lever all the way in some direction.