It's counter-intuitive that so many instructions can be dedicated to something so seemingly simple like strlen(), but it does highlight the complexity of modern processors. On that note, I do not know and am reluctant to comment on which implementation of strlen() is fastest, as it seems very difficult to decouple the non-determinism of caches, instruction scheduling, pipelines, etc. by only looking at the source, especially when other code is also involved. But there probably is a benchmark that could give some useful results -- to justify the code above too -- like in the link you posted.
Someone with a better understanding of instruction scheduling, pipelines, and other optimizations can probably give a better answer to this than me.
Wow, those examples took me a while to go through and understand; it's amazing the level you can optimize a piece of code to hardware. I'm curious to benchmark the implementations and see the performance difference.
Maybe you or someone else knows the answer to this though: it seems like they are processing 4 bytes of the string at a time. If they read over (i.e. the NULL byte is in byte position 1 2 or 3), isn't that technically undefined behavior? They are only reading in the memory, but I feel like valgrind or another tool would spit out an error if that happened. It's aligned, so it won't trigger a page fault, but it seems like an unsafe optimization.
Yeah, I see your point. Like you said, page alignment and size being a multiple of 4 won't cause a page fault, so it's technically "ok". I can only assume that at this level the corner is safely cut for the sake of performance.
Another more trivial example of something like this is in the repnz-based strlen() (slides 7-8), where %ecx is loaded with 0xffff ffff, which technically limits the routine to scan strings up to 4 gigabytes in length. It's a valid assumption that the string is under 4GB (especially on a strictly 32-bit system), but the point is that it's a semantically different routine than the C based one.
"If they read over (i.e. the NULL byte is in byte position 1 2 or 3), isn't that technically undefined behavior?"
Only if they are using C, not implementing it. That is why the C standard has that 'undefined behavior' claus. It makes optimizations like this possible.
Sorry, it's back up.
Apparently updating the PDF broke speakerdeck, permanently marking the presentation as "unpublished", even though it is public. I had to delete and re-upload. Probably shouldn't have updated the PDF in the first place, though.
http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386...
http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386...
It's counter-intuitive that so many instructions can be dedicated to something so seemingly simple like strlen(), but it does highlight the complexity of modern processors. On that note, I do not know and am reluctant to comment on which implementation of strlen() is fastest, as it seems very difficult to decouple the non-determinism of caches, instruction scheduling, pipelines, etc. by only looking at the source, especially when other code is also involved. But there probably is a benchmark that could give some useful results -- to justify the code above too -- like in the link you posted.
Someone with a better understanding of instruction scheduling, pipelines, and other optimizations can probably give a better answer to this than me.
~vsergeev/frozeneskimo