X86 Oddities

api · on Sept 5, 2011

YMMV on this because different models of chips differ, but I found years ago just playing around that many of the more obscure x86 opcodes are actually very slow to execute.

My hypothesis is that because they are obscure and rarely used by compilers, they are implemented via some kind of alternate "cruft path" using micro-ops that requires extra decode steps, pipeline flush, etc. that slows the processor down.

So if you think about using obscure x86 opcodes, benchmark them first. They might be slower than just implementing the algorithm using standard mainstream opcodes.

InclinedPlane · on Sept 5, 2011

Since the Pentium-Pro / P6 era there has no longer been such a beast as a "CISC" processor or even an "x86" processor per se. Modern IA32 processors have a RISC core that executes x86 instructions via opcode translation. So you are exactly right, the more obscure x86 opcodes are likely to be implemented inefficiently.

adgar · on Sept 5, 2011

they are implemented via some kind of alternate "cruft path" using micro-ops that requires extra decode steps, pipeline flush, etc. that slows the processor down.

This is exactly the case.

Just a few minutes ago I was implementing a parse tree data structure my thesis advisor commonly uses. It encodes the number of types of children of each node as a pair of bits, ordered from least-significant to most-significant. If the nth child is a leaf node, it's encoded as `10`, and if it's an internal node, it's `11`. I implement the "number of children" function as a loop, shifting this pattern right by 2 until the value is zero.

Naturally, x86 has a useful opcode for this: BSR, bit-scan right. It tells you the index of the first set (1) bit scanning from most-significant to least-significant. It returns the index from 0=LSB, 31=MSB. In theory, this takes my O(n) solution and makes it O(1). I just BSR the pattern, shift right 1, add 1. I know it's a micro-optimization anyway, but I want to have some fun on labor day, so I give it a shot.

In practice, since the number of children is always 8 or fewer (this is for Ruby's grammar), the loop is always very short. Even while running 10,000,000 iterations to get the total time up to a few seconds, and even while using `rand` to try to trick the branch predictor (to make the looping case less predictable), I couldn't get a discernable difference to show up between O(1) and O(n). The O(n) case was definitely executing at least 3-4 times as many instructions in the test I was running, so I can only conclude BSR is deoptimized.

Edit: DarkShikari says that the deoptimization may be because I'm on an Athlon 64 processor and that this is not the case on intel processors.

DarkShikari · on Sept 5, 2011

Naturally, x86 has a useful opcode for this: BSR, bit-scan right. It tells you the index of the first set (1) bit scanning from most-significant to least-significant. It returns the index from 0=LSB, 31=MSB.

BSR/BSF are the equivalents of CLZ/CTZ on ARM and other CPUs; it isn't a "crufty" or "redundant" instruction in the same way as LOOP, etc. Most modern x86 CPUs implement it in 1 or 2 cycles, with the exception of Athlon 64 and Atom. Check Agner's list of instruction latencies; don't guess.

BSF/BSR are extraordinarily useful instructions for a wide variety of uses, the most common being integer log2 (useful for parsing variable-length codes).

arto · on Sept 5, 2011

Here's a link to Agner's optimization manuals, for the benefit of those who may not be familiar with them: http://www.agner.org/optimize/#manuals

In short they contain what is probably the best x86 instruction latency reference, based on actual empirical measurements on various Intel/AMD/VIA chips.

adgar · on Sept 5, 2011

I never said it was crufty, just that it wasn't successful for me in improving my use-case despite requiring noticeably fewer instructions (not cycles).

I am, however, on Athlon 64. That may explain why I found it to be indistinguishable to the short loop in profiling. I didn't guess anything, I even explicitly explained my profiling methodology.

vilya · on Sept 5, 2011

Are you compiling with any optimisations on? If so, did you check the asm output for your looping version? It's possible that the compiler detects this pattern and replaces it with a BSR instruction in the binary. Even if it doesn't do that, it's quite likely that the compiler is vectorising the loop for you. Which might explain why you're not seeing a speedup from BSR: not because BSR is slow, but because the compiler is making your other code fast.

pmjordan · on Sept 5, 2011

This could just be the general slowness of bitwise operations kicking in. I remember reading in Intel's optimisation documents that shift instructions take around 3 times as long as an add, so for multiplying by 4 you're better off with 2 adds than shifting up by 2. (In reality, the compiler would probably use lea)

DarkShikari · on Sept 5, 2011

remember reading in Intel's optimisation documents that shift instructions take around 3 times as long as an add

Shift and add are both one cycle on all modern CPUs. I think this might be different on the Pentium 4, but we x86 programmers pretend that the Pentium 4 didn't exist.

With regard to the other bitwise operations (and, or, xor), I don't know any CPU in existence where these are slower than add.*

*Except for the Pentium 4, where the add/sub unit is double-pumped and nothing else on the chip is, but that's a bizarre exception.

psykotic · on Sept 5, 2011

> Shift and add are both one cycle on all modern CPUs.

It's common for variable-length shifts to be slow on modern CPUs once you move outside of the x86 space. On the PS3's PPU and the Xbox 360's PowerPC cores they are very slow and to be avoided.

> I think this might be different on the Pentium 4

What was particularly slow on the P4 were variable-length shifts and rotates. With an immediate operand they still had 1 cycle throughput. That sounds good until you realize that the P4 had a double-pumped ALU design (as you also mentioned) and it had two ALUs. Whereas shifts and rotates could only execute on the general integer unit.

DarkShikari · on Sept 5, 2011

It's common for variable-length shifts to be slow on modern CPUs once you move outside of the x86 space. On the PS3's PPU and the Xbox 360's PowerPC cores they are very slow and to be avoided.

They're hardly modern CPUs -- even outside the x86 space, slow variable-shifts are pretty rare. They're horrifically crippled, with the PS3's PPU not being intended to be used for any serious processing (supplemented by the SPEs), and the Xbox 360's CPU just being plain bad. Fast variable-shifts are critical for bitstream reading -- the end result of this crippling is that despite having 3 highly-clocked cores, and being supplemented by GPU processing, the Xbox 360 still can't play Blu-ray video.

luu · on Sept 5, 2011

What's your definition of modern? The latest and greatest ARM processor (Cortex-A9) has a penalty for shifting. Their shifts are a bit strange, because any alu op can shift one of the operands. For standard alu ops, shifting by an immediate costs one extra cycle, and shifting by a register costs two extra cycles. Mov, i.e, a pure shift, only costs an extra cycle in the register case, but it’s still more expensive than, say, an add.

I haven't done any Itanium assembly since McKinley (Itanium 2), but their shifts had 3 extra cycles of latency. Making a fast shifter has only become a harder problem since then, because the ratio of wire:gate delay has grown; they could have a single cycle shifter in the current implementation, but it would cost them even more than it did back then.

x86 is unique among modern architectures because it has so much legacy baggage. Shifts are only fast on x86 because when Intel implanted slow shifts on the P4, it killed performance in a lot of apps that were compiled with compilers that were accustomed to having fast shifts, so that they would substitute shifts for multiplies whenever possible.

mansr · on Sept 5, 2011

The latest and greatest ARM processor (Cortex-A9) has a penalty for shifting. Their shifts are a bit strange, because any alu op can shift one of the operands. For standard alu ops, shifting by an immediate costs one extra cycle, and shifting by a register costs two extra cycles. Mov, i.e, a pure shift, only costs an extra cycle in the register case, but it’s still more expensive than, say, an add.

That is not true whatsoever.

psykotic · on Sept 5, 2011

How so? The table of cycle timings on ARM's website agrees with what he said:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

On the classic ARM processors like ARM9TDMI, I think post-ALU immediate shifts were free and register shifts were 1 extra cycle.

mansr · on Sept 5, 2011

That table is vastly simplified. As the document explains, the exact timings cannot be listed in a simple table due to the complexities of out-of-order multi-issue execution.

A shifted operand is frequently required to be available as input one cycle earlier than a non-shifted one, simply because the shift happens in an earlier pipeline stage than the main ALU ops, which is presumably what the figure in the table is meant to reflect. If the operand is ready, there is no additional delay. In ARM9 documentation, this behaviour is frequently referred to as the instruction having one or more "early" operands.

I verified this just now on an actual A9 using the cycle counter. A sequence of independent adds with shifted inputs executes at two instructions per cycle. If each add is made to depend on the previous, two cycles per instruction are needed as suggested by the table. Short chains of dependent instructions are executed out of order masking the added latency.

No ARM processor has ever had a "post-ALU" shift.

psykotic · on Sept 5, 2011

Thanks, that makes sense. My ARM knowledge has been either second hand or from book reading, so this is good to know.

> No ARM processor has ever had a "post-ALU" shift.

Typo. I meant pre-ALU.

psykotic · on Sept 5, 2011

> They're hardly modern CPUs

They're annoying to program but why aren't they modern? You're just redefining terms to try to win an argument. They were intentionally scraped down to lower power consumption and manufacturing costs. That doesn't make them not modern.

maximilianburke · on Sept 5, 2011

Both the Cell and the Xenon are around 7 years old now and the POWER4-based PowerPC 970, their direct ancestor, is over 9 years old. They're not ancient in the sense of a Z80 being ancient but since the PPC970's inception IBM has itself iterated the Power architecture 3 more times (the current generation is POWER7), not to mention the strides that AMD and Intel have both made in processor architectures since that time.

adgar · on Sept 5, 2011

The compiler did use `lea` - I used inline asm for the `BSR` instruction and then did the shift and increment in C, figuring the compiler would be smarter than me.

andrewf · on Sept 5, 2011

POPCNT sounds like a more suitable instruction, and it's modern (ie not likely to be slow-pathed; although pre-Core iX processors won't support it at all!)

adgar · on Sept 5, 2011

Not for my particular encoding, as I have "10" to indicate a terminal node and "11" to indicate a non-terminal. Essentially it's a 3-valued encoding of 'empty', 'terminal', 'nonterminal' for a variable number of children.

andrewf · on Sept 5, 2011

AND your input with 0xAAAAAAAA (ie 101010...10 binary) to turn all the "11" values into "10".

adgar · on Sept 5, 2011

How does help me use popcnt to find how many bit pairs are used and how many are empty? Edit: whoops, I gotcha.

I'll profile this.

Edit 2: no I won't profile it, because I don't have the instruction. Damn.

acqq · on Sept 5, 2011

You don't need to have POPCNT to implement much faster bit counting:

http://graphics.stanford.edu/~seander/bithacks.html#CountBit...

onan_barbarian · on Sept 5, 2011

1. There's no need to debate on what's fast and what's slow. Figure out what architecture you're talking about and get the numbers. Intel:

http://www.intel.com/content/www/us/en/processors/architectu...

(Intel® 64 and IA-32 Architectures Optimization Reference Manual and read appendix C).

Similar manuals exist for most platforms, although sometimes the embedded vendors get a bit shy. Agner's stuff is good too, and he isn't prone towards leaving empty bits in latency tables due to forgetfulness or embarrassment, unlike Intel.

2. If you are futzing around with Athlon 64s and Pentium Ms and so on you are retrocomputing. Good for you, but please don't tell us that some microoperations are 'slow' in general. The facts are available in mind-numbing detail; go acquaint yourself with them.

3. Modern x86 - Core 2 and onwards:

The 'slowness' of individual operations - as long as you stay away from really nasty fiascos like INC and DEC and XLATB and integer divide and so on - is NOT necessarily all that important. Even in the unlikely event that you are l337 enough to avoid branch mispredicts, cache misses, etc. - the important thing is to be able to keep your pipeline full. You can issue and retire 4 uops per cycle; 3 ALU ops and 1 load/store.

Frankly, it just doesn't matter whether a instruction is 3 cycles or one cycle if you've got good reciprocal throughput and a full pipeline. The instructions to stay away from are the ones with both large latency and large reciprocal throughput - these will tie up a port for a startling length of time (like the SSE4.2 string match instructions, which are botched and appear to be getting slower, not quicker).

Keeping your pipeline full has far, far more to do with having a lot of independent work to do than it does with instruction selection. Variable shift vs fixed shift is a second-order (third?) compared to the difference between issuing one instruction per cycle vs. 4 (the latter is unlikely but doable in some loops).

Aspire to data-parallelism, even in single-threaded CPU code. That long sequential dependency is what's killing you. Even a Level 1 cache hit is 4 cycles on a Nehalem or Sandy Bridge; if your algorithm has nothing useful to do for those cycles you're going to be twiddling your thumbs on 3 alu ports and 1-2 load/store ports for 4 cycles.

4. Yes, most of the really obscure instructions suck. Read the forgotten Optimization Reference Manual and find out which and why.

anaisbetts · on Sept 5, 2011

If the OP happens to read this, the CRC32 instruction was designed to calculate the version of CRC32 used in iSCSI so that CRC checks on big iron iSCSI servers could be accelerated.

gaius · on Sept 5, 2011

Do you mean just SCSI? I'm sure x86 predates iSCSI.

msbarnett · on Sept 5, 2011

crc32 was introduced with SSE 4.2 on the i7.

dfox · on Sept 5, 2011

One clarification:

lock cmpxchg8b combination didn't crash the CPU, but lock followed by invalid encoding of instruction that would do two memory accesses did (canonical case was cmpxchg8b between two registers, which is obvious nonsense). More importantly, cmpxchg8b is essentially useless without lock prefix.

psykotic · on Sept 5, 2011

> More importantly, cmpxchg8b is essentially useless without lock prefix.

I haven't looked at this for a while but I'm pretty sure cmpxchg8b nowadays has an implicit lock prefix.

schiptsov · on Sept 5, 2011

any reasons to NOT using x86_64 and forget about ia32? Netbooks? They're obsolete. ARM/Android will be everywhere next year. ^_^

jwatte · on Sept 5, 2011

X86_64 is also bizarre, with mode switch as partt of jump, register aliasing, etc.

mansr · on Sept 5, 2011

There is nothing bizarre about implementing mode switching as part of a jump instruction. Every CPU I know of with more than one mode implements switching with some kind of jump instruction.

psykotic · on Sept 5, 2011

I agree it isn't bizarre. But...

> Every CPU I know of with more than one mode implements switching with some kind of jump instruction.

So you don't know x86? You enter protected mode by manipulating the cr0 register.

dfox · on Sept 5, 2011

But after entering 386-style protected mode or amd64's long mode you can switch between different modes by far jumps to specially crafted descriptors / selectors (and at least in 32 bit protected mode you can trigger various documented semi-magical behavior by that). So it's consistent.

mansr · on Sept 5, 2011

So what jwatte said is wrong? I don't know such details about x86. I was thinking mainly of ARM and MIPS.

psykotic · on Sept 5, 2011

No, he was referring to the 32-bit to 64-bit mode switch. I was talking about entering 32-bit protected mode from real mode for the first time, typically right after boot-up. As another posted mentioned, once you're in protected mode, you can go back to "fake" real mode (e.g. an MS DOS program running on Windows) with a segment descriptor based jump, which is consistent with the manner of x64's mode switching.

psykotic · on Sept 5, 2011

My least favorite part when reading and writing x64 code has to be the calling conventions.

psykotic · on Sept 5, 2011

Unfortunately, there are tons of people out there still running 32-bit Windows. Microsoft had a chance to cut the cord with Vista and 7 but they blew it.