Will it optimize?

stcredzero · on July 23, 2010

    > And lastly, this isn’t meant to be a prescriptive post,
    > but we all know why micro-optimizing is usually a 
    > mistake: it wastes your time, it’s easy to screw up 
    > (see question 5), and it *typically* produces no 
    > measurable speedup.

I emphasized the typically because this is a key idea: you are fallible -- deal with it!

I visited a local hacker-space with a friend of mine to talk about a project of ours. A bunch of those guys are NASA and they know quite a bit about how to engineer for reliability. Since our project involved a lot of soldering, he suggested that we estimate our average solder join failure rate and use that as the number of expected failures to look for. That's what he does in his personal projects.

I exclaimed that was brilliant, then was a little embarrassed. Of course! Duh! I know I'm not perfect and I'm smart enough to have deduced that I have something like an average failure rate for solder joins. So why have I been proceeding for years and years as if I could just do everything perfectly if I just tried hard enough?

Every new function you write is a risk. Minimize your risks. Maximize the cost/benefit. Every routine you optimize is a risk. Minimize your risks. Maximize cost/benefit.

One's problems as a programmer are more likely the result of excess hubris and not insufficient smarts.

EDIT: Fear is as big a problem as hubris. Walk the middle path. You are your own optimization problem, and the optimal solution is almost never a no-brainer shove of a lever all the way in some direction.

flogic · on July 23, 2010

That's awesome. I'm always impressed by the reliability work that comes out of NASA. I don't regret a single penny of my tax dollars which goes to them just for this sort of thing alone.

wooster · on July 23, 2010

ridiculousfish is the smartest programmer I've ever met. I hired him at Apple years ago, and while I could regale you with tales of how awesome he is, Apple would probably sue me.

That said, if you've ever played World of Warcraft on the Mac, you should worship him as a god.

:-)

mark_h · on July 23, 2010

If you ever find a way to regale us without being sued, I am all ears :)

His blog is that perfect combination of really meaty articles that are published infrequently enough to really be savoured.

wheaties · on July 23, 2010

Sounds like the kind of guy I want to work with on a daily basis even if half the time he spent yelling at me for doing something stupid. After a few months I'd be such a better programmer.

zackattack · on July 23, 2010

This dude is pretty much my dream cofounder. Math background, aesthetically intuitive, intellectually curious, specifically interested in psychology, and clearly a talented programmer.

http://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Cor...

messel · on July 23, 2010

Wooster would never let him go. Better to offer woost an irresistable offer, then go after this fellow.

Seriously beautiful post layout

pbhjpbhj · on July 24, 2010

>Seriously beautiful post layout

Shame about the blog chrome. Seriously all I had above the fold was the fish picture and until I looked at the scrollbar I'd nothing to indicate that the page had any content, I assumed it was some sort of flash game.

Also within the post the code overflows the code boxes for me (I'm 2 clicks up on font size).

Looks to be some problem on the all-posts page too - http://ridiculousfish.com/blog/all-posts/ FWIW.

messel · on July 25, 2010

I'm using the latest chromium daily build and it looked fine.

jakevoytko · on July 23, 2010

I avoid these types of optimizations, say, 97% of the time

This reminded me of Linus' blog post on optimizing SHA1 routines in Git [0]. FTA: "Some people seem to think that C is a real programming language, but they are sadly mistaken. It really is about writing almost-portable assembly language, and it turns out that getting good results from SHA1 really is mostly about trying to fight the compilers tendency to try to be clever."

[0] http://torvalds-family.blogspot.com/2009/08/programming.html

grogers · on July 23, 2010

There is a self tuning linear algebra library called ATLAS. When building, probes various things about your computer, things like L1/L2 cache, if you have fused multiply/add, etc. It uses these to output very highly tuned matrix kernels in standard C.

Ironically, when compiling this C code, you have to use -O, higher optimization levels make the compiler try to be too smart for its own good and actually perform worse.

briansmith · on July 23, 2010

"Improving the Performance of the Secure Hash Algorithm" (by transforming the algorithm mathematically and then using hand-written assembler with SIMD instructions):

http://software.intel.com/en-us/articles/improving-the-perfo...

stcredzero · on July 23, 2010

A long, long time ago, I recall that someone did a study of code generated by C compilers and found that, on average, 1 C statement = 2.5 lines of assembly. (for CISC)

igorgue · on July 23, 2010

SHA1 is still pretty damn fast in Git, I couldn't be able to get an even close speed with my own implementation.

grogers · on July 23, 2010

"2. Loop-invariant strlen()

GCC does this optimization, because strlen is a "built-in function:" one that gcc recognizes and optimizes specially. Disabling built-ins with -fno-builtin defeats this optimization."

No, it does it because strlen is declared to be a pure function, and it is smart enough to realize you are not modifying anything in s. -fno-builtin does not change the generated code, it still calls the strlen function (at least on my gcc), but it is hoisted out of the loop.

astrange · on July 24, 2010

strlen() is not declared pure in the OS X headers.

gfodor · on July 24, 2010

If this is true it makes me happy because this is surely less of a hack.

praptak · on July 23, 2010

Gcc surprised me once. I had put some effort into replacing something like (x << 12) | (x>>20) with the appropriate roll instruction at the assembly level. Only after seeing that the "optimized" code ran no faster than plain GCC, I looked at the GCC output and of course the roll was there.

I believe it was GCC 3.*, so even older than the one mentioned in the article.

smcl · on July 23, 2010

Ha, I really like this guy. I started reading another one of his blog entries (a description of how compilers often optimise a divide by a constant where, instead of issuing an expensive divide, it multiplies by a "magic" number and then right shifts the result). After hitting lots of complex (for me anyway) mathematical proofs, my brain tuned out slightly until I saw

"Did you skip ahead? It's OK. Here's the summary"

I like to think he deliberately did that for hungover-yet-still-interested devs like myself.

sprout · on July 23, 2010

Well, I never read proofs on the first pass. I mean, I read each expression, but I don't bother trying to assimilate them into a coherent thought until I've finished any plain-English associated text.

And even that I'll sometimes wait to properly comprehend for a second pass.

viraptor · on July 23, 2010

Question number 2 really surprised me. What if the layout of memory is:

    (s)(s+1)(s+2)...(result)

If `s` has no null bytes until `result`, then `strlen(s)` will terminate on `result`. However, if you change `result` while iterating over `s`, then `result` will be not null and the `strlen()` result will change. Or is there some rule in C standard that says local variables are not in the same memory range as external pointers?

ramchip · on July 23, 2010

Attempting to read s[i] when i is outside the bounds of the array is undefined behaviour, so the compiler is free to ignore this case. See example at http://en.wikipedia.org/wiki/Undefined_behavior#Examples_in_...

     Certain pointer operations may result in undefined behavior:
    int main()
    {
      int arr[4] = {0, 1, 2, 3};
      int * p = arr + 5; // undefined behavior
    }

In short, you can't assume that result will be placed after the array. What if result is optimized to a register? It actually sounds quite likely to me given the tight loop.

afhof · on July 23, 2010

With C comes the pedantry

    int main(int argc, char ** argv)
    {
      int arr[4] = {0, 1, 2, 3};
      int * p = arr + 5; // valid 
      *p; // undefined behavior
      return 0;
    }

mian2zi3 · on July 23, 2010

If you're going to be pedantic, it is best to be right. I don't have a copy of the C standard here, but I do have a copy of the C++ standard. From 5.7 Additive Operators (discussing an expression of the form P+N, where P has pointer type and N has integral type):

"If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined."

That is, if either P or P+N don't point to the same array object, or one past the end of the array object, the result is undefined. Thus,

  int arr[4] = {0, 1, 2, 3};
  ... arr + 4 ... // OK, points to one past the end of the array object
  ... arr + 5 ... // undefined

I'm sure the C standard is the same.

nitrogen · on July 24, 2010

This is one instance where C++ and C are, indeed, the same. However, the languages do have their differences, so I would caution those developers who don't tend to read standards not to assume that C and C++ will behave identically when given valid C code. Obviously, having referenced the C++ standard yourself, you are aware of such caveats.

6.5.6.8 from ISO/IEC 9899:201x draft of March 1, 2009:

...If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined....

notaddicted · on July 23, 2010

Since "result" is on the stack it will only be allocated at the time of the function call. Also note that the stack grows downwards in memory. To get the memory layout you've shown you'd have to be summing up the bytes in the unallocated memory between the heap and the top of the stack. Probably this is explained somehow in the standard but I don't personally know.

ramchip · on July 24, 2010

"result" is not necessarily on the stack, and stack growth direction is implementation-specific: http://stackoverflow.com/questions/664744/what-is-the-direct...

InclinedPlane · on July 23, 2010

If you've managed to embed an integer into your char array and you are using that integer as an accumulator, you're probably "doing it wrong", very wrong. More to the point, you've got bigger problems than gcc's optimization specifics.

As for this particular example, result is allocated on the stack and is guaranteed to not lie within s, so the gcc optimization is perfectly safe.

viraptor · on July 23, 2010

I'm not saying this is useful for anything - I'm just asking about correctness of such optimisation. You can set the pointer to any address, which includes future stack. Also, nothing can 'lie within a pointer' really... - it might be inside an array.

It might seem silly in this example, but as soon as I change the program to a very similar one:

    void sum2(const unsigned char *s, unsigned char *result) {
       *result = 0;
       for (size_t i=0; i < strlen(s); i++) {
          *result += s[i];
       }
    }

It cannot be optimised in the same way anymore (gcc won't) because you can't guarantee memory areas pointed at by arguments don't overlap.

yuvi · on July 24, 2010

> You can set the pointer to any address, which includes future stack.

This is wrong. Attempting to use random pointers to outside of what your compiler and malloc allocated is undefined.

And indeed, aliasing issues like in your example will prevent gcc from optimizing out excess strlen; it can only do so if it can prove that both global memory and the buffer passed to strlen have not been modified. Local variables and return values cannot alias anything, and part of the guarantee of pure functions like strlen is that they have no side effects.

drblast · on July 23, 2010

Interesting article.

By any chance, does this sort of information exist at all for Java bytecode compilers? Whenever I write Java code I try to do things the Java way but it usually makes me cringe to think that things like accessing a boolean field requires a function call.

I console myself by thinking that this all gets optimized away by the JIT compiler, but I'd really like to know for sure.

tumult · on July 23, 2010

Got me on the first and last ones. Cool post!

I think the fact that I got the first one wrong made me err too far into thinking GCC was smarter than I had first assumed. And it is, but only for certain things.

afhof · on July 23, 2010

Concerning #2:

Does 'const' mean that the function is not allowed to change the contents of the string, or does it mean that the contents of the string will never change? Suppose the 'sum' function was running in one thread, while in another the string that was passed in as 's' is being continuously written to. While this may not make much sense to do, and the output would not have a reliable value, this optimization would be the wrong thing to do.

okmjuhb · on July 23, 2010

const only means that the function is not allowed to change memory through that reference. It says nothing about aliases to the same memory.

But the "what if another thread changes the string" issue is a red herring. The question to keep in mind when optimizing in a multi-threaded context is "does the output of the compiler generated code correspond to at least one possible interleaving of threads". Since this thread doesn't do any synchronization, it's possible that all the function executes without being interrupted by another thread, so this is a sound optimization.

murrayh · on July 24, 2010

You can prevent caching optimisations by declaring a variable 'volatile' (http://en.wikipedia.org/wiki/Volatile_variable).

albertzeyer · on July 23, 2010

`x+x` is faster than `x*2`? Really? Why? And how about `x<<1`?

mian2zi3 · on July 23, 2010

In a modern CPU, instruction selection is much more complicated than, "Oh, addl (or sll) is faster than umul," and depends on the micro-architecture/pipeline, surrounding code, etc. Addition usually takes one cycle, while multiplication might take several cycles (although, by pipelining, you can generally execute a multiply per cycle per multiplier) But, if all the other adders are full and the result isn't needed for a few cycles, it might be fastest to use the multiplier. (I'm thinking of a statically scheduled pipeline. Things get harder when the CPU does dynamic instruction reordering.)

_delirium · on July 23, 2010

Do any compilers attempt to take those sorts of concerns into account? It seems like it'd be an interesting experiment, but it breaks some of the nice abstractions--- afaik, strength-reduction is usually done before scheduling, with the only architecture dependency being a table of costs assigned to each operation. Letting scheduling constraints have input on those decisions seems ideal but tricky, unless I'm missing something.

wwalker3 · on July 23, 2010

It would be very hard to accurately account for a microprocessor's out-of-order instruction issue in a compiler.

A modern desktop-class microprocessor can have scores of instructions "in flight" at once. Each new instruction that's decoded gets reordered or stalled differently depending on the status of all of those previous instructions and their hardware requirements.

A compiler could try to model the processor's microarchitecture and simulate how it would execute the program, but that simulation would be inaccurate because of data-dependent branching, interrupts, and other unpredictable behaviors. Plus, the microarchitectural details you'd need to create such a simulation are usually not available.

That said, compilers can still tailor code to the microarchitecture in a general way. For example, if you know a processor can only decode one branch instruction per cycle, you can try to reorder the compiled code to avoid back-to-back branches in the instruction stream.

mian2zi3 · on July 23, 2010

> It would be very hard to accurately account for a microprocessor's out-of-order instruction issue in a compiler.

This is true. However, a lot of modern processors do in-order execution, they just aren't made by Intel or AMD for desktops: GPUs, network processors, SIMD DSPs, etc. When I was hacking compilers (6 years ago), compiling for these processors presented difficult challenges and there was a real market for new compiler technology. One project was for a network processor that was not only in-order, but it was not interlocked: correctness depended on the compiler having an exact model of the processor's pipeline. Some registers were not registers, but latches. They got new values every cycle, so if you didn't gab a value on the right cycle, it was gone forever.

mian2zi3 · on July 23, 2010

The phase ordering problem in compilers is notoriously hard. You wouldn't worry about the interaction between strength reduction and scheduling, but in other cases you do need to take these interactions into account. In my compiler (aka a compiler I worked on), the software pipeliner estimated register pressure, since you don't want to over-pipeline to expose parallelism if you blow the register set and cause the allocator to spill. Often register allocation is sensitive to the schedule (and there have been attempts to combine them into one pass.)

For x+x vs x*2 vs x<<1, the trick is to represent them in a uniform way and let the scheduler decide what instruction to emit based.

1amzave · on July 23, 2010

> Often register allocation is sensitive to the schedule (and there have been attempts to combine them into one pass.)

Yes -- for what it's worth, the compiler I'm working on at the moment has a pre-allocation scheduling pass and a post-allocation scheduling pass, and will also sometimes reschedule during allocation to shorten live ranges and thus ease register pressure.

_delirium · on July 23, 2010

Addition and bitshifting are easier to implement in hardware than multiplication is, with less complicated circuits that can return results in fewer clock cycles (though the difference is much less than it used to be).

rquirk · on July 23, 2010

gcc 4.4 spitting out ARM code does that. It gives you `mov r0, r0, asl #1` and for thumb `lsl r0, r0, #1` when using -O2 or -Os.

mian2zi3 · on July 23, 2010

Nice post. This is how the Nullstone compiler optimization benchmark suite is organized: two functionally equivalent versions of a C program, one cleaner/suboptimal, the other optimized by hand. The goal is to have the generated code perform the same in either case.

nitrogen · on July 24, 2010

Wouldn't it be possible to optimize floating point multiplication by 2.0f (or any power of two) to an addition or subtraction on the exponent?

astrange · on July 24, 2010

Only on architectures with such instructions. x86 doesn't have anything like that (well, maybe in x87), and I don't know of any others that do.

nitrogen · on July 25, 2010

So would moving the floating point value into an integer register, shifting, incrementing, ORing, etc. ever be faster than a full FP addition? I imagine that modern desktop CPUs have effectively-single-cycle multiply, but what about something like ARM with VFP but no Neon?

astrange · on July 25, 2010

Only in architectures where FP and integer values are kept in the same registers. This is true in SSE2 and Altivec, so you can do integer operations there - but since SIMD integer operations are limited too, it's pretty much only useful for flipping the sign bit.

Moving values between different register sets is INCREDIBLY slow since it involves at least two memory operations.

And faking FP operations with specialized integer code on something with soft-float like ARM might be worth it. I've never done it so I can't really say.

chipsy · on July 23, 2010

I got almost all of it wrong. I'm not very familiar with gcc's workings and usually just figure that the compiler is going to be dumb and fail at everything but the simplest optimizations. But then, when I actually go to do the tedious and source-obfuscating micro stuff I usually check my assumptions first - you kind of have to anyway, once you go that deep.

JoachimSchipper · on July 23, 2010

The point of most of the optimizations mentioned in the article is not to make programs run faster; e.g. any competent programmer can hoist constant functions out of a for loop himself, but knowing that the compiler will do this allows you to write cleaner code.

(This does not apply to e.g. auto-vectorization, which replaces 'x[0] = y[0] * z[0]; x[1] = y[1] * z[1]; ...' by a single (SSE?) instructions which calculates all of that at once. Such instructions are much faster and would require dropping down to assembly without a good optimizer.)

dkarl · on July 24, 2010

when I actually go to do the tedious and source-obfuscating micro stuff I usually check my assumptions first - you kind of have to anyway, once you go that deep.

Absolutely. Fun to read, made me smarter, but half of it could be different in any other version of gcc.

And frankly, if these are your problems, you're doing pretty well. I worked on a product that had some numerical parts, which were sometimes the limiting factor in performance, though usually not. So one day I was talking to a guy who wrote the numerical code about why regression testing was difficult and subtle, and it was revealed that our numerical results could differ from build to build. I.e., a single build of the product always produced the same results, but if you changed the floating point code, the new build might produce different (though equally accurate) results, even if semantically the program was supposed to perform the same C math operations on the same data.

Which makes perfect sense if you're using the old stack-based x87-style floating point instructions. We couldn't be... not in the 21st century on Opterons... check Makefile and gcc man page... biggest facepalm of my career.

We were able to announce a nice speedup on some workloads in the next release.

robryan · on July 23, 2010

Great article, only really have a basic knowledge of C from uni, but easy to follow along and learn something.

pquerna · on July 23, 2010

great post with a nice quiz format!

gwern · on July 23, 2010

I wish more posts used quizzes; they make it that much harder to lie to yourself.

Aegean · on July 24, 2010

He could have added the switch statement optimization. That's my favorite. GCC 4 can optimize switch statements with sequential keys e.g. (1,2,3,4 ...) into a jump table. This was AFAIK introduced in GCC 4

ramchip · on July 25, 2010

But he did: The switch statement was optimized to a jump table, while the if statements became a long sequence of compares.

igorgue · on July 23, 2010

#2 was an error I had, and a co-worker told me screaming that I was a crappy programmer because I did it, now that I know it's optimized I feel like he owns me an apology or maybe not :-)

yuvi · on July 24, 2010

Were you using gcc?

I wouldn't be surprised if e.g. MSVC didn't perform that optimization since it doesn't have the pure function attribute.