frozeneskimo's comments

frozeneskimo · on March 26, 2012

This is a great question. You can also find some of glibc's even more optimized versions of strlen() here:

http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386...

It's counter-intuitive that so many instructions can be dedicated to something so seemingly simple like strlen(), but it does highlight the complexity of modern processors. On that note, I do not know and am reluctant to comment on which implementation of strlen() is fastest, as it seems very difficult to decouple the non-determinism of caches, instruction scheduling, pipelines, etc. by only looking at the source, especially when other code is also involved. But there probably is a benchmark that could give some useful results -- to justify the code above too -- like in the link you posted.

Someone with a better understanding of instruction scheduling, pipelines, and other optimizations can probably give a better answer to this than me.

~vsergeev/frozeneskimo

tcas · on March 26, 2012

Wow, those examples took me a while to go through and understand; it's amazing the level you can optimize a piece of code to hardware. I'm curious to benchmark the implementations and see the performance difference.

Maybe you or someone else knows the answer to this though: it seems like they are processing 4 bytes of the string at a time. If they read over (i.e. the NULL byte is in byte position 1 2 or 3), isn't that technically undefined behavior? They are only reading in the memory, but I feel like valgrind or another tool would spit out an error if that happened. It's aligned, so it won't trigger a page fault, but it seems like an unsafe optimization.

frozeneskimo · on March 26, 2012

Yeah, I see your point. Like you said, page alignment and size being a multiple of 4 won't cause a page fault, so it's technically "ok". I can only assume that at this level the corner is safely cut for the sake of performance.

Another more trivial example of something like this is in the repnz-based strlen() (slides 7-8), where %ecx is loaded with 0xffff ffff, which technically limits the routine to scan strings up to 4 gigabytes in length. It's a valid assumption that the string is under 4GB (especially on a strictly 32-bit system), but the point is that it's a semantically different routine than the C based one.

~vsergeev/frozeneskimo

VMG · on March 26, 2012

(signatures are frowned upon around here)

frozeneskimo · on March 26, 2012

(I see, my bad. Habit of mine. Thanks)

Someone · on March 26, 2012

"If they read over (i.e. the NULL byte is in byte position 1 2 or 3), isn't that technically undefined behavior?"

Only if they are using C, not implementing it. That is why the C standard has that 'undefined behavior' claus. It makes optimizations like this possible.

frozeneskimo · on March 26, 2012

Sorry, it's back up. Apparently updating the PDF broke speakerdeck, permanently marking the presentation as "unpublished", even though it is public. I had to delete and re-upload. Probably shouldn't have updated the PDF in the first place, though.

Content is available here as well: https://github.com/vsergeev/apfcp