Optimizing compilers reload vector constants needlessly

inetknght · on Dec 6, 2022

I haven't (yet) read the article, but I will. But the headline...

> Optimizing compilers reload vector constants needlessly

...is absolutely true. I wrote some code that just does bit management (shifting, or, and, xor, popcount) on a byte-level. Compiler produced vectorized instructions that provided about a 30% speed-up. But when I looked at it... it was definitely not as good as it could be, and one of the big things was frequently reloading/broadcasting constants like 0x0F or 0xCC or similar. Another thing it would do is to sometimes drop down to normal (not-SIMD) instructions. This was with both `-O2` and `-O3`, and also with `-march=native`

I ended up learning how to use SIMD intrinsics and hand-wrote it all... and achieved about a 600% speedup. The code reached about 90% of the performance of the bus to RAM which was what I theorized "should" be the limiting factor: bitwise operations like this are extremely fast and the slowest point point was popcount which didn't have a native instruction on the hardware I was targeting (AVX2). This was with GCC 6.3 if I recall, about 5 years ago.

teux · on Dec 6, 2022

I often hand write neon (and other vectorised architecture) intrinsics/assembly for my job, optimising image and signal processing routines. We have seen many many 3 digit percentage speedups from bare c/c++ code.

I got into the nastiest discussion on reddit where people were swearing up and down it was impossible to beat the compiler, and handwritten assembly was useless/pretentious/dangerous. I was downvoted massively. Sigh.

Anyways, that was a year ago. Thanks for another point of validation for that. It clearly didn’t hurt my feelings. :)

I never come across people in the wild that actually do this also, it’s such a niche area of expertise.

fwsgonzo · on Dec 6, 2022

It also slightly annoys me a bit the things JIT people write on their github READMEs about the incredibly theoretical improvements that can happen at runtime, yet it's never anywhere close to AOT compilation. Then you can add 2-3x on top of that for hand-written assembly.

I do wonder whats going on with projects like BOLT though. I have seen it was merged into LLVM, and I have tried to use it but the improvement was never more than 7%. I feel like it has a lot of potential because it does try to take run-time into account.

See: https://github.com/llvm/llvm-project/tree/main/bolt

wyldfire · on Dec 6, 2022

> improvement was never more than 7%.

If your use case isn't straining icache then you won't benefit as much.

BTW 7% is huge, odd that you would describe it as "only".

josephg · on Dec 6, 2022

> BTW 7% is huge, odd that you would describe it as "only".

It depends on what you're doing and how optimized the baseline performance is. In my area (CRDTs) the baseline performance is terrible for a lot of these algorithms. Over about 18 months of work I've managed to improve on automerge's 2021 performance of ~5 minutes / 800MB of ram for one specific benchmark down to 4ms / 2MB. Thats 75000x faster. (Yjs in comparison takes ~1 second.)

Almost all of the performance improvement came from using more appropriate data structures and optimizing the fast-path. I couldn't find an off-the-shelf b-tree or skip list which did what I needed here. I ended up hand coding a b-tree for sequence data which run-length encodes items internally, and knows how to split and merge nodes when inserts happen. CRDTs also have a lot of fiddly computations when concurrent changes edit the same data, but users don't do that much in practice. Coding optimized fast paths for the 99% case got me another 10x performance improvement or so.

I'd take another 7% performance improvement on top of where I am, but this code is probably fast enough. I hear you that 7% is huge sometimes, and a smarter compiler is a better compiler. But 7% is a drop in the bucket for my work.

saagarjha · on Dec 7, 2022

7% is huge in the context of compilers, which optimize general-purpose code.

anonymoushn · on Dec 7, 2022

7% is in the ballpark of the speedup most programs get from changing the allocator to not give almost every allocation with the same huge alignment and around half the speedup most programs get from using explicit huge pages. These changes are both a lot easier, but e.g. Microsoft doesn't think it's worthwhile to allow developers to make the latter change at all, over 26 years after the feature shipped in consumer Intel CPUs.

fwsgonzo · on Dec 7, 2022

That's unfortunate. I wrote a VMM that tries to back memory with hugepages (even the guests page tables). It's making a difference!

redox99 · on Dec 6, 2022

> about the incredibly theoretical improvements that can happen at runtime

Which in the majority of cases can be achieved by profile guided optimization anyways.

PicassoCTs · on Dec 6, 2022

It should be part of these discussions to proof what you claim. Always. With code samples, directly to the compiler and corresponding assembler.

https://godbolt.org/

Statistics are worthless alone, at the end all that counts is the arena of performance and what the code becomes and how it runs against the handcrafted version.

teux · on Dec 7, 2022

Godbolt doesn’t accurately show runtime speed of algorithms on input data, which is what you need when discussing simd performance. And often these are proprietary industry algorithms that are the core of a business’s model.

I’m all for transparency but I’m also not about to get fired for posting our kernel convolution routines, or least squares fit model.

> It should be part of these discussions to proof what you claim

Further - these aren’t subjective claims that need to be proven on a forum for legitimacy. It’s the literal state of vector based optimisations in the compiler world right now. It is a hard problem and for the time being humans are much better at it. This is quite a large area of academic research at the moment.

If someone is so uninformed of this domain that they don’t know this, the burden is on that person to learn what the industry is talking about. Not the people discussing the objective state of the industry.

saagarjha · on Dec 7, 2022

Godbolt takes practice to read. Often people who are incapable of understaning when you can beat a compiler cannot also be shown a Godbolt snippet in good faith.

ack_complete · on Dec 7, 2022

This is always deeply frustrating. You quickly get the sense that the person you're talking to hasn't experienced anything beyond simple float loops that are trivial for the compiler to autovectorize, or really bad examples of hand vectorization.

In the meantime, I constantly encounter algorithms that compilers fail to vectorize because even single vector instructions are too complex for the compiler to match, such as saturating integer adds. The compiler fails to autovectorize and the difference in performance is >5x. Even just something simple like adding up unsigned bytes, and all three major compilers generate vector code that's much slower than a simple loop leveraging absolute difference instructions.

That's even before running into the more complex operations that would require the compiler to match half a dozen lines of code, like ARM's Signed Saturating Rounding Doubling Multiply Accumulate returning High Half: https://developer.arm.com/architectures/instruction-sets/int...

Or cases where the compiler is not _allowed_ to apply vector optimizations by itself, because changes to data structures are required.

inetknght · on Dec 6, 2022

> it’s such a niche area of expertise.

It is that! As I stated in another comment, this niche ended up saving the company literally millions of dollars in hardware costs. To be fair, it really should only be done in highly performance-critical situations and only after an initial implementation is already written in pure C++ with thorough unit testing for _all_ input/output cases.

russianGuy83829 · on Dec 6, 2022

What does the company you work for do?

inetknght · on Dec 7, 2022

The company I wrote the hand-optimized code does DNA analysis. I worked there mostly because I needed a job. DNA analysis certainly isn't within my passion domain.

Now I work for a company to writing software to control drones. It doesn't pay as much as I want, but it's at least fun and there's a ton of unsolved automation problems at the company. And we're hiring like crazy -- from 6 people to 300+ in the past two years, and there's no sign of slowing down. And I'm now a manager too... that's the really scary part lol

jeffparsons · on Dec 7, 2022

> I got into the nastiest discussion on reddit where people were swearing up and down it was impossible to beat the compiler, and handwritten assembly was useless/pretentious/dangerous.

It _should_ be useless (for some reasonable definition of "should") — it just isn't in practice. And I'm continually amazed at how often people confuse one for the other, across all contexts. E.g. I have family members who refuse to consider that our justice system might have deep flaws, because to their mind if it should be some other way then it already would be.

andreareina · on Dec 7, 2022

Isn't compiler optimization np complete? I don't think I'd put anything there in "should". Yeah any single optimization (or permutation thereof) can be applied, but they're order-dependent and the combinatorial explosion means you can't try to apply all of them.

Thiez · on Dec 7, 2022

What makes you think that humans are better at solving np complete problems?

andreareina · on Dec 7, 2022

Not in general. But humans can exploit prior knowledge to select which avenues to pursue first.

astrange · on Dec 6, 2022

Tell them to read the ffmpeg code. All the platform-specific/SIMD stuff is done in asm.

This isn't only because it's faster, it's honestly easier to read than intrinsics anyway. What it does lack is debugability.

MaxBarraclough · on Dec 6, 2022

Or any other highly optimised numerical codebase. From a quick glance at OpenBLAS, it looks like they have a lot of microarchitecture-specific assembly code, with dispatching code to pick out the appropriate implementations.

https://github.com/xianyi/OpenBLAS/blob/02ea3db8e720b0ffb3e2...

teux · on Dec 6, 2022

For debugging you can actually use gdb in assembly tui mode and step through the instructions! You can even get it hooked up in vs code and remote debug an embedded target using the full IDE. Full register view, watch registers for changes, breakpoints, step instruction to instruction.

Pipelining and optimisations can make the intrinsics a bit fucky though, have to make sure it’s -O0 and a proper debug compilation.

I have line by line debugged raw assembly many times. It’s just a pain to initially set up. Honestly not very different from c/c++ debugging once running.

astrange · on Dec 6, 2022

Sure, but gdb doesn't know what the function parameters are, or on some platforms where functions start and end, crashes don't have source lines, and ASan doesn't work. (though of course valgrind does)

bitwalker · on Dec 7, 2022

If you are handwriting the function in assembly, you'll know what registers hold the function parameters, what types of values they are supposed to be, and with care, you can produce debug information and CFI directives to allow for stack unwinding, it's just annoying to do - but that's just the tradeoff you make for the performance improvement I suppose.

variadix · on Dec 7, 2022

I don’t know if this is frowned upon or not among assembly programmers, but I often just use naked functions in C with asm bodies, which gdb will provide the args for, rather than linking against a separate assembly file.

saagarjha · on Dec 7, 2022

If you write your assembly to look like C code GDB is more than happy to provide you with much of that to the extent that it can. In particular, it will identify functions and source mappings from debug symbols.

saagarjha · on Dec 7, 2022

Pipelining and optimizations…when reading in the debugger? I don't quite understand how this is relevant.

wahern · on Dec 7, 2022

ffmpeg might have amazingly efficient inner loops (i.e. low-level decoding/encoding), but the broader architecture (e.g. memory buffer implementations, etc) is quite inefficient. Like the low-level media code it's not that each component itself is inefficient, it's that the interfaces and control flow semantics between them obstruct both compiler and architectural optimizations.

When I wrote a transcoding multimedia server I ended up writing my own framework and simply pulling in the low-level decoders/encoders, most of which are maintained as separate libraries. I ended up being able to push at least an order of magnitude more streams through the server than if I had used ffmpeg (more specifically, libavcodec) itself, even though I still effectively ended up with an abstraction layer intermediating encoder and format types. And I never wrote a single line of assembly.

There's no secret sauce to optimization: it's not about using assembly, fancier data structures, etc; it's learning to identify impedance mismatches, and those exist up and down the stack. Sometimes a "dumber" data structure or algorithm can create opportunities (for the developer, for the compiler) for more harmonious data and code flow. And impedance mismatches sometimes exist beyond the code--e.g. mismatch between functionality and technical capabilities, where your best might be to redefine the problem, which can often be done without significantly changing how users experience the end product.

astrange · on Dec 7, 2022

> most of which are maintained as separate libraries

This is so confusing I can’t tell if you’re actually talking about libavcodec. The whole point is to combine codecs to share common code, “most” decoders certainly aren’t available elsewhere.

If you just want to call libx264 directly go ahead and do that of course. libx264 uses assembly just as much or more than libavcodec though.

janwas · on Dec 7, 2022

I have a lot of sympathy for wanting efficient code. But let's indeed have a look: https://github.com/FFmpeg/FFmpeg/blob/7bbad32d5ab69cb52bc92a... There are so many macros, %if and clutter here that it's difficult (for me?) to keep the big picture in mind.

This reminds me of a retrospective of an OS/window manager written in assembly - they were great about avoiding tiny overheads, but expressed regret that the whole system ended up slow because it was hard to reason about bigger things such as how often to redraw everything, similar to what people are saying here.

To be clear: let's indeed optimize and vectorize, but better to build on intrinsics than go all the way down to assembly.

Const-me · on Dec 7, 2022

I prefer intrinsics over assembly.

There're too many different assemblies: inline, MASM, NASM, FASM, YASM. They come with their unique quirks, and they complicate build.

Intrinsics are more portable. It's trivial to re-compile legacy SSE intrinsics into AVX1. You won't automatically get 32-byte vectors this way, but you will get VEX encoding, broadcasts for _mm_set1_something, and more.

Readability depends on the code style. When you write intrinsics using "assembly with types" style, actual assembly is indeed more readable. OTOH, with C++ it's possible to make intrinsics way better than assembly: arithmetic operators instead of vaddpd/vsubpd/vmulpd/vdivpd, strongly-typed classes wrapping low-level vectors for specific use cases, etc.

Update: most real-life functions contain scalar code (like loops), also auto-generated code (stack frame setup, back up / restore of non-volatile registers). When coding non-inline assembly, developer needs to do that manually in assembly, this can be hard to do, and may cause bugs like these https://github.com/openssl/openssl/issues/12328 https://news.ycombinator.com/item?id=33705209

saagarjha · on Dec 7, 2022

FFmpeg code is god-awful. A lot of it is like from 2002 and written without regards to any sort of "sanity". People who write assembly routines these days have a structure to their code, and if they overrun buffers or whatever they'll document what alignment assumptions they're making. FFmpeg will just start patching its own code at runtime because someone thought it was a good idea on Pentium processors.

janwas · on Dec 7, 2022

Nice!

> I never come across people in the wild that actually do this also, it’s such a niche area of expertise.

I'd guesstimate there are several hundred of us working on vectorization :)

saagarjha · on Dec 7, 2022

I would guess you're off by an order of magnitude or two.

janwas · on Dec 7, 2022

Interesting. Any thoughts on where they can be found?

saagarjha · on Dec 8, 2022

A lot of them don’t have accounts on the internet. Many do their job part-time as the need arises.

wyldfire · on Dec 6, 2022

> impossible to beat the compiler

Ludicrous! How could they be taken seriously? Which subreddit was this?

russianGuy83829 · on Dec 6, 2022

Aren’t all speedups 3 digit percentage? As in 100% or more? Did you mean 100x?

why_only_15 · on Dec 7, 2022

I think when people measure speedups here they deduct the first 100%, e.g. if I used to be able to process 10 items per second and can now process 20 items per second you can run 200% as fast but it's a 100% speedup.

janwas · on Dec 7, 2022

To avoid the various ambiguities here, I learned to express speedups as a ratio of the optimized throughput, divided by the baseline throughput. Or equivalently: the baseline time divided by optimized time.

4400 MB/s vs 800 MB/s = 5.5x speedup or 5.5 times as fast.

tim-- · on Dec 7, 2022

A three digit percentage means at least a quarter of the time, but possibly more.

Something that is 50% faster will take half the time. Something that is 100% faster will take a quarter, and so on.

anonymoushn · on Dec 7, 2022

Something that is 50% faster takes 2/3 as long sir.

hvs · on Dec 7, 2022

You are correct, but I would also point out that I have to think long and hard about this whenever it comes up.

phkahler · on Dec 6, 2022

>> This was with both `-O2` and `-O3`, and also with `-march=native`

Until very recently GCC didn't do vectorization at -O2 usless you told it to.

inetknght · on Dec 6, 2022

That's true. I definitely omitted a bunch of other flags that were added including the flags to turn on vectorizations

jeffreyrogers · on Dec 6, 2022

That's basically the problem the article describes although he's using vector intrinsics too and it still reloads and broadcasts the constant before each loop.

DannyBee · on Dec 6, 2022

There are three reasons it reloads constants:

1. It thinks it is cheaper than keeping them in a register ( this is known as rematerialization). It will reload constants that it lets it keep something else in a register, and it's cheaper to do this.

2. It thinks something could affect the constant.

3. It thinks it must move it through memory to use it, and then it thinks the memory was clobbered.

In this case, it definitely knows it is a constant, and it can't prove that both loops always execute, so it places it in the path where it is only executed once per loop, because it believes it will be cheaper.

I can still make at least gcc do weird things if i prove to it the loop executes once.

In that case, what is happening in gcc is that constant propagation is propagating the vector constant forward into both loops. Something later (that has a machine cost model) is expected to commonize it if it is cheaper, but never does.

DannyBee · on Dec 7, 2022

You can see it get propagated through as a constant at the high level here: https://godbolt.org/z/jxWKcnTT1

That is normal and what i would expect to happen.

You can see in lower level RTL dumps, nothing chooses to commonize it. That seems a bit weird, since it should have a cost model that says this isn't free.

caf · on Dec 7, 2022

I believe it's possible to report this as a "missed optimisation" bug.

DannyBee · on Dec 7, 2022

Yes.

A lot of the intrinsics also look like (at least in gcc), they are marked always inline but not pure/const.

it's probably worthwhile marking those that take memory as pure/const. That way, it's obvious the value is readonly even when it's inlined.

(otherwise, the compiler will have to prove it in the inlined code, which is not always possible)

pclmulqdq · on Dec 6, 2022

When I have used intrinsics, the compiler at least has a hope of getting this right, particularly when you use patterns like:

__m256i mask = _mm256_set1_epi8(0x0f)

If you just used the intrinsic that sets the register to a constant over and over, it often repeats the instruction.

The compilers just aren't that smart about SIMD yet.

jeffreyrogers · on Dec 6, 2022

He sets it once like this before the loops.

        __m256i c = _mm256_set1_epi32(10001);

And then the disassembly has

        mov     eax, 10001
        vpbroadcastd    ymm1, eax

before each loop.

pclmulqdq · on Dec 7, 2022

Yeah, if the compiler runs out of registers it will do this - it's better than spilling the constant to memory. Register allocation is one thing that compilers are still much worse at than humans, and you see it in SIMD code a lot.

DannyBee · on Dec 7, 2022

Eh. Compilers are only worse because we don't care.

Optimal register allocation + spilling is easily doable now on today's machines/algorithms.

We just don't bother because it's not truly worth it in most cases (IE we get most of the performance for a very low cost)

an1sotropy · on Dec 6, 2022

Can you recommend any favorite resources for learning how to use SIMD intrinsics?

teux · on Dec 6, 2022

Not OP but also work with this.

There’s some tutorials but honestly the best thing is to just use them.

Write an image processing routine that does something like apply a gaussian blur to a black and white image. The c++ code for this is everywhere. You have a fixed kernel (2d matrix) and you have to do repeat multiplication and addition to each pixel for each element in the kernel.

Write it in C++ or Rust. Then read the Arm SIMD manual, find the instructions that do the math you want, and switch it over to intrinsics. You are doing the same exact operations with the intrinsics as the raw c++. Just 8 or 16 of them at a single time.

Run them side by side for parity and to check speed, tweak the simd, etc.

Arm has good (well ,okay) documentation

https://developer.arm.com/documentation/den0018/a/?lang=en

https://arm-software.github.io/acle/neon_intrinsics/advsimd....

* Edit: you also have to do this on a supported architecture. Raspberry pi’s have a neon core at least in the 3’s. Not sure about the 4’s but I believe so too!

corysama · on Dec 6, 2022

Adding on:

Go to https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

Start with SSE, SSE2, SSE3

Write small functions in https://godbolt.org/ . Watch the assembly and the program output.

inetknght · on Dec 6, 2022

> https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

Intel's Intrinsics Guide is exactly what I used and it was before I learned about Compiler Explorer.

I already had a large and thorough suite of unit tests for inputs and expected outputs, including happy and sad paths. So it was pretty easy to poke around and learn what works, what doesn't.

It was definitely time intensive (took about three months for about 50 lines of code) but it also saved the company a few million dollars in hardware (DNA analysis software to compare a couple TiB of data requires a _lot_ of performance). I have since moved to a different company, partly because I never saw a bonus for saving all that money.

The intrinsics guide does good to show what's available but it does not do a good job of documenting how each instruction actually works... many intrinsics are missing pseudocode and some pseudocode can have ambiguous cases. I used GDB in assembly mode to compare that table against the register content instruction-by-instruction to figure out where I misunderstood something if something went awry.

Frustratingly, some operations are available in 64-bits but not bigger, some in 128-bits but not bigger, etc. So I wrote up a rough draft in LibreOffice Calc with 64, 128, and 256 columns to follow the bits around every intended operation. I then correlated against the intrinsics guide to determine what instructions are available to me in what bit sizes. For a given test run, each row in the spreadsheet was colored by what the original data contained, another row for what I needed the answer to be for that test case, then auto-color another row's cell green or red if the register after a candidate set of instructions did or didn't match the desired output. Any time I had to move columns around (the data was 4-bits wide), I'd color a set of 4 columns to follow where they go during swizzling.

Someone · on Dec 6, 2022

I know both gcc (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) and clang (https://clang.llvm.org/docs/LanguageExtensions.html#vectors-...) have vector extensions that are intended to make it easier to write SIMD code (you can write c = a + b; instead of having to know the specific vector instruction needed to add two vectors, for example), but don’t know how well these reach that goal.

Are these helpful, or not good enough to write performant vector code?

an1sotropy · on Dec 7, 2022

I’ll try them!

an1sotropy · on Dec 6, 2022

Thanks. It would be simpler if I were working on only one platform that I knew supported a specific set of instructions. But, even though the code does involve some convolutions and things that are a good fit for SIMD, it needs to be cross-platform, so that the intrinsics should compile to SSE/AVX on Intel and NEON (?) on ARM where possible, but to something slow but workable on older chips. Delineating and illustrating using the most cross-platform intrinsics is what I'm looking for guidance on.

corysama · on Dec 6, 2022

Bad news. For SIMD there are not cross-platform intrinsics. Intel intrinsics map directly to SSE/AVX instructions and ARM intrinsics map directly to NEON instructions.

For cross-platform, your best bet is probably https://github.com/VcDevel/std-simd

There's https://eigen.tuxfamily.org/index.php?title=Main_Page But, it's tremendously complicated for anything other than large-scale linear algebra.

And, there's https://github.com/microsoft/DirectXMath But, it has obvious biases :P

janwas · on Dec 7, 2022

I beg to differ :) std::experimental::simd has a very limited set of operations: mostly just math, very few shuffles/swizzles. Last I checked, it also only worked in a recent version of GCC.

We do indeed have cross-platform intrinsics here: github.com/google/highway. Disclosure: I am the main author.

an1sotropy · on Dec 7, 2022

cool; thanks for pointing out your project!

Do you have any advice on how someone limited to c99/c11 can still leverage the wisdom and techniques inside it?

janwas · on Dec 7, 2022

:) Tricky. Is it an option to build some source files with C++, and use C functions (the usual FFI) as the interface between them?

an1sotropy · on Dec 7, 2022

Not really, unfortunately, and it’s a pre-existing framework for teaching a class, so simplicity of compilation is extra important. Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.

For those that are new to this, can you give an example of a kind of computation or algorithm which is well-served by your project, but not possible with vector extensions like https://clang.llvm.org/docs/LanguageExtensions.html#vectors-... ?

janwas · on Dec 8, 2022

> Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.

Agreed. Usually the interface would be something like RunEntireAlgorithm(), not DotProduct().

> For those that are new to this, can you give an example of a kind of computation or algorithm which is well-served by your project but not possible with vector extensions

Sure. Vector extensions are OKish for simple math but JPEG XL includes nontrivial cross-lane operations such as transpose and boundary handling for convolution. __builtin_shufflevector requires a known vector length, and can be pessimized (fusing two into one general all-to-all permute which is more expensive than two simple shuffles).

Also, vqsort (https://github.com/google/highway/tree/master/hwy/contrib/so...) almost entirely consists of operations not supported by the extensions, and actually works out of the box on variable-length RISC-V and SVE, which compiler extensions cannot.

an1sotropy · on Dec 8, 2022

This is very helpful; thank you.

jmgao · on Dec 6, 2022

Intel wrote a header that maps NEON intrinsics onto SSE to help people port to x86 Android: https://github.com/intel/ARM_NEON_2_x86_SSE

teux · on Dec 7, 2022

Just a heads up, as far as I know that’s more of a porting/learning tool than a production tool.

I remember us looking deeply into this and decided to hand write the SSE intrinsics. They usually map 1:1 but we had some unexpected differences in algorithm output between the x86 binary and the ARM binary when compiled with this.

But this was also back in 2019 or so, maybe it’s better now!

Const-me · on Dec 7, 2022

If you want SSE/AVX, I wrote that article couple years ago:

http://const.me/articles/simd/simd.pdf

an1sotropy · on Dec 7, 2022

Thank you. I just started reading this today. My codebase is just C99 or C11 so I’m working through how to un-C++-ify the article; I’m grateful for all the generally useful background info

BoardsOfCanada · on Dec 6, 2022

Seems like the compiler puts the test for the first loop before loading the constant the first time, and therefor needs to load it again before the second loop. So the "tradeoff" is that if neither loop runs it will load the constant zero times. Of course this isn't what a human would do but at least there is some kind of sliver of logic to it. (Like if vpbroadcastd was a 2000 cycle instruction this pattern might have made sense)

ArchD · on Dec 7, 2022

In this modified version where the for-loops get converted to do-loops (possibly unsafe/incorrect behavior), unnecessary multiple vpbroadcastd's for ymm1 are still done:

https://godbolt.org/z/61jYejsra

With the original, for-loop version, it could be argued that in case no loop iteration gets run at all, the vpbroadcastd's can be totally skipped, and to generate extra code for different cases of whether each loop is empty to avoid unnecessary vpbroadcastd's is not worth it (e.g. the greater icache pressure resulting from longer code).

With the do-loop variant, both loops will get at least one iteration so the compiler really could just unconditionally do the vpbroadcastd once before the first loop. It somehow fails to realize that ymm1 did not get clobbered between the two vpbroadcastd's.

stephc_int13 · on Dec 6, 2022

My experience with optimizing compilers is that generated code is often frustratingly close to optimal (given the source is well written and taking account the constraints of the target arch).

It is perfectly reasonable to take a look at the output on Godbolt, tweak it a bit and call it a day.

Maintaining a full assembly language version of the same code is rarely justifiable.

And yet, I understand the itch, especially because there are quite often some low-hanging fruits to grab.

evancox100 · on Dec 6, 2022

This may be true for scalar code but it seems like the compilers still aren’t quite there with vector code.

metadat · on Dec 7, 2022

For anyone who, like me, need / wants a refresher primer on vectorization:

https://stackoverflow.com/questions/1422149/what-is-vectoriz...

jeffbee · on Dec 6, 2022

Moving the constant to file or anonymous namespace scope solves the issue. It's too bad that intrinsics are not `constexpr` because I have a powerful urge to hang a `constinit` in front of this line.

gumby · on Dec 6, 2022

Disturbing that this works, as it shouldn't do the reload even if the constant is passed in as a parameter.

leni536 · on Dec 6, 2022

In this particular case the broadcasting instruction can be replaced with builtin operations, allowing constexpr.

https://godbolt.org/z/Td6vG9cqG

edit: uh, the constant requires some hand adjustment

edit2: fixed version https://godbolt.org/z/4Px5Mbsx4, and I just don't get this. gcc really just wants to load that constant twice.

mzs · on Dec 7, 2022

Are you sure? I see this twice in the dis on the right:

  mov     eax, 10001
  vpbroadcastd    ymm1, eax

beagle3 · on Dec 7, 2022

Somewhat related:

Can anyone recommend a (r)introduction to modern vector programming on CPUs? I was last fluent in the SSE2 days, but an awful lot has happened since - and while I did go over the list of modern vector primitives (AVX2, not yet AVX512), what I'm missing is use cases - every such primitive has 4-5 common use cases that are the reason it was included, and I would really like to know what they are ....

phkahler · on Dec 6, 2022

The optimization here would be CSE or hoisting, or both? I'm guessing the problem is those are performed prior to vectorization.

In other words, I suspect an invariant calculation inside consecutive loops but that is not vectorized will be pulled out of the loops and also moved prior to them and executed just once.

JonChesterfield · on Dec 6, 2022

At a guess, constant rematerialision failing to cross basic block boundaries. Feels like a plausible thing for a heuristic to miss. E.g. sink the constant into the loop so it's available when optimising that block, then fail to hoist it back out afterwards because constant materialisation is cheap.

JoeAltmaier · on Dec 6, 2022

Intel had an optimizing compiler that was amazing. But unless you were intel-only it made life harder to switch compilers for that platform.

acdha · on Dec 6, 2022

I periodically wonder whether Itanium or even the Pentium 4 would have been more successful had Intel released icc under an open source license. I’m assuming that they were trying to keep AMD from using it but I’m still not sure it was worth it from the number of times I knew scientists to avoid it because it was hard to share.

berkut · on Dec 6, 2022

Yeah, I haven't used ICC for 7 years now, but at the time it was much better than clang/gcc at keeping SSE/AVX intrinsic types in registers through function calls (i.e. clang/gcc used to spill out onto the stack and re-load), and things like this in the article.

cwzwarich · on Dec 6, 2022

Were you testing on the same platform? The Microsoft ABI has callee-save XMM registers, whereas the Linux/macOS ABI does not. Regardless, it would be nice if more compilers could do interprocedural register allocation in cases where all callers are known.

tester756 · on Dec 6, 2022

I've heard similar opinions that people could just recompile their soft and receive significant speed boost

exabrial · on Dec 7, 2022

Side note: I really really like the blog theme and the complete lack of bulk on this blog. No react.js, 4500 NPM modules for a SPA or other crap and it loads instantly.

Guess what? It's jquery.

foota · on Dec 6, 2022

Maybe it's trying to avoid using SSE in the case where there's no loop? SSE on some older platforms had a cost just from using it, so it might be possible.

Narishma · on Dec 15, 2022

This is AVX though, not SSE.

_7bxa · on Dec 7, 2022

I have a theory that in another decade ML models will do this better than any optimizing compiler -- similar to a hand written chess engine vs alpha go.

dmitrygr · on Dec 7, 2022

Probably a holdover from older assumptions that using/loading a constant is free or cheap enough to be considered free.