The Ffast and the Furious

amluto · on Sept 8, 2022

I filed this bug almost 10 years ago. Sigh.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522

jcranmer · on Sept 8, 2022

Annoyingly, Clang seems unwilling to make any changes until gcc changes its mind.

jwilk · on Sept 8, 2022

an1sotropy · on Sept 8, 2022

The author admits this ia a contrived demo, but I'm still trying to understand its ostensible logic. The FTZ/DAZ problem is on line 20:

    if (x >= nextafterf(0.0, INFINITY)) {

I think the idea here is: test if x is large enough so that the subsequent ceilf() will give you an integer i_x >= 1 (because the actual array index will be PIXBUF_WIDTH - i_x, and the max array index is PIXBUF_WIDTH-1). With FTZ/DAZ, nextafterf(0.0, INFINITY) == 0, which breaks the test.

But in this scenario, wouldn't you just test "if (x > 0)", since that implies that ceilf(x) must be >= 1?

moyix · on Sept 8, 2022

Updates welcome to make it less contrived! :) It's clearly very confused code, but it does work under normal FP behavior.

an1sotropy · on Sept 9, 2022

Thank you for creating and sharing this example. I think the fact of merely loading a shared library as modifying the FP behavior is the real shocker, more than any particular FP computation that hinges on denormals.

Still, I think one of the great wins with denormals is that they make "x == y iff x-y == 0" true; absent denormals you can have two normalized x and y, x != y, such y-x evaluates to 0. So for an example, maybe something that involves y > x and yet ceil(y-x) == 0, which is surprising. The only setting I can think of is something like making a histogram (with flipped indices, like your pixel row) of differences of values, so that ceil(y-x) == 0 leads to an invalid histogram index.

Some values are x=1e-32 and y=1.0000011e-32, which parse (via sscanf) to:

   x = 0:00010100:10011111011000100011111 = (-1)^0 * 2^( 20-127) * (1+5222687/2^23)
   y = 0:00010100:10011111011000100101110 = (-1)^0 * 2^( 20-127) * (1+5222702/2^23)

and then the denormalized difference d = x-y should be:

   d = 0:00000000:11110000000000000000000 = (-1)^0 *  2^(1-127)  * (0+7864320/2^23)

Having some specific values sidesteps concerns about nextafterf(). In this case I think y is many ulps above x, not just 1.

moyix · on Sept 9, 2022

Thanks for this excellent idea! I implemented it in this histogram plotting demo:

https://github.com/moyix/2_ffast_2_furious

an1sotropy · on Sept 15, 2022

Holy moly you've converted my little idea into a lovely demo! Thanks for showcasing this and sorry to be slow to notice.

StefanKarpinski · on Sept 8, 2022

The Julia ecosystem (BinaryBuilder specifically) forbids libraries built with -ffast-math for exactly this kind of reason.

wffurr · on Sept 8, 2022

"fun and safe" math. Never-ending source of weird graphics bugs.

MauranKilom · on Sept 8, 2022

If I understand correctly, the nextafterf implementation in crtfastmath.o linked in by the empty file compiled with -ffast-math does not handle infinities, so nextafterf(0.0, INFINITY) returns 0.0 (?) and thus i_x = 0 (whereas it should be 1 at minimum since we access PIXBUF_WIDTH - i_x).

Did I get that right?

moyix · on Sept 8, 2022

Not quite. nextafterf() is trying to return the smallest float after 0.0 in the direction of infinity (infinity is still valid even when the FPU is configured to flush denormal numbers to 0; infinity isn’t a denormal number). But since the smallest float after zero is denormal, when a shared library built with -ffast-math is loaded (which sets the FPU’s FTZ and DAZ bit) denormal numbers get forced to zero, and so nextafterf returns 0.0 and the comparison becomes (x >= 0.0), which means that x=0.0 passes the check and ceilf returns 0.0 rather than 1.0.

ridiculous_fish · on Sept 8, 2022

`nextafterf` does not in fact return 0. FTZ flushes denormals to zero when they are the result of an FP calculation, but `nextafter` is implemented using integer arithmetic, because it is much easier: simply add 1 as an integer, with special handling for Inf/NaN/-0.

It's only later that it gets treated as zero by DAZ: in the FP comparison, and then the call to ceil(), both of which use the FPU.

edit: confirmed, you can see it in action on this gist. When compiled with -ffast-math then the denormal does not compare greater than zero (it is flushed to zero by the comparison), but its bit pattern is integer value 1, which is nextafter(0).

https://gist.github.com/ridiculousfish/bfa7ed6b1d76c114e19b3...

moyix · on Sept 8, 2022

I don't think this is correct. The return value of nextafterf is a float, and the ABI requires those to be returned in xmm0. So although nextafterf is using integers internally, the flush to zero happens inside of nextafterf when it moves eax into xmm0; even though eax is 1, xmm0 becomes 0.0. You can see that by stepping through it in gdb:

https://gist.github.com/moyix/846afb7fcec7c618a6ef52d93b4ca7...

ridiculous_fish · on Sept 8, 2022

Denormals are not flushed to zero merely by `mov xmm,rax` - they can't be, since xmm is also used for integer arithmetic, and the CPU doesn't "know" whether it's an FP or integer value at the point of the mov.

Your gist demonstrates this: notice in the bottom right that xmm0 contains `v2_int64 = {0x1, 0x0}` so the bottom half containing the result is nonzero. (The top half is zero as it's unused).

The "info" command rounds the value to 0 for compactness, but if you run `p $xmm0.v2_double` you will also see it is nonzero when interpreted as a double (I get 4.94e-324).

A fun aside: that `mulss` instruction you stopped at is just there to raise an underflow FP exception; its result is discarded. https://github.com/bminor/glibc/blob/595c22ecd8e87a27fd19270...

moyix · on Sept 9, 2022

Ah you're right! I was misled by the `v4_float = {0x0, 0x0, 0x0, 0x0}` portion.

stephencanon · on Sept 9, 2022

Ideally, IEEE 754 would define behavior when subnormals are flushed, and would define nextUp(0) to be the smallest representable value (i.e. min normal for FTZ/DAZ modes).

Unfortunately, the committee has never been willing to define behavior for non-standard arithmetic, despite multiple people (Ian Ollmann has suggested it a few times, and others have as well) requesting it over the years.

jcranmer · on Sept 9, 2022

I agree that IEEE 754 should define how FTZ/DAZ should work, given that it's a pretty common extension to floating-point processors, and language committees are unlikely to lift a finger to do anything floating-point related that isn't from IEEE 754.

gabcoh · on Sept 8, 2022

Would it be fair to call this a bug in nextafterf? If this FPU flag is enabled it would seem like returning a denormal here is incorrect as it isn’t representable in this cpu in this configuration.

stephencanon · on Sept 9, 2022

None of the relevant standards say anything about how math functions are supposed to behave in the non-standard DAZ/FTZ modes (because they're non-standard). The second you set the bit, you step off the path of being able to complain about any behavior not documented by the library.

an1sotropy · on Sept 9, 2022

Right - it's a bit like saying "some more undefined behavior please!" and then being surprised at being surprised.

zokier · on Sept 8, 2022

I do find it bit odd that all the floating point environment control flags are usually pretty well hidden away in most programming languages; I feel those are something that could be more clearly exposed, maybe even as a first-class language feature. But even just having some raii-style scoped environments could be neat

planede · on Sept 9, 2022

IMO if it was properly exposed in C, or even as a GNU extension, then "fast math" would be part of the function type.

1. A non-fast-math function calling a fast-math function (or the other way around) should result in the floating point environment to be changed on entering the function, and reset on leaving the function. 2. A function calling an other function of the same kind shouldn't touch the floating point environment. 3. Function pointers to different kind of functions shouldn't be convertible to each other (that is `void ()() __attribute__((fast_math))` and `void ()()` are unrelated types).

As for a compiler flag altering the default "fast_math" or "no_fast_math" attribute, I'm not sure about. It's probably a bad idea. Libraries that want to make this configurable can always do their own macro.

Altering the floating point environment for everybody in a global constructor within the library is just horrible.

kccqzy · on Sept 9, 2022

It's that way because processors make it a special register (MXCSR) rather than something that you can specify on a per-operation basis.

duped · on Sept 8, 2022

Another victim of shared libraries.

323 · on Sept 8, 2022

Victim of shared state, not shared libraries.

You can change floating point modes dynamically, and it's commonly done, and then it doesn't matter if the library was dynamically or statically linked.

vitiral · on Sept 8, 2022

But if you forget to change it back you may break other invariants, right?

Also, what about threads or fibers? I would guess it is thread-local but not fiber-local?

duped · on Sept 9, 2022

Depending on the implementation some green threads will save the mxcsr register during context switches (iirc ucontext does this on Linux) but some "fast" context switching libraries will intentionally avoid it to save some cycles since it requires a syscall.

323 · on Sept 8, 2022

If you're using fibers it's assumed you know what your doing.

tlb · on Sept 8, 2022

That’s true about many cpu registers, but compilers take care of it correctly for everything except floating point flags.

duped · on Sept 9, 2022

That's just wrong.

The problem is a constructor in the .so being called as it's loaded. There is no way to guarantee state is safe between calls to functions loaded from shared libraries or when they're linked without expensive checks. The point is that there is state being mutated due to the fact we rely on shared libraries and allow dynamic libraries to mutate that state when they're loaded.

This is not a problem with statically linked libraries. And also note, I made a point of talking about shared libraries - not dynamic linking. They're not equivalent. The issue here is that the libraries are shared and do bad things, not that they're loaded dynamically. It would be fine if the library wasn't shared!

planede · on Sept 9, 2022

Statically linked libraries can have global constructors all the same, what are you talking about?

warmwaffles · on Sept 8, 2022

Reminds me of "The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86)" [1]

[1]: https://hovav.net/ucsd/dist/geometry.pdf

IncRnd · on Sept 9, 2022

Eric Holder ruined the phrase, "Fast and Furious," for me. Now, whenever I see that phrase, I am reminded of gun-running US attorneys held in Contempt of Congress.