Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Taming floating-point sums (orlp.net)
120 points by todsacerdoti on May 25, 2024 | hide | past | favorite | 57 comments


A method which this article doesn't mention is stochastic rounding or probabilistic rounding.

Rather than striving for exact values for every computation, we make it so that the expected value is exact. For example, suppose our number system was restricted to the integers and we have incoming value 1.1, with stochastic rounding we'd round down to 1 90% of the time and 2 10% of the time giving an expected value of 1.1!

Further reading:

https://nhigham.com/2020/07/07/what-is-stochastic-rounding/


It's not really related, but that reminds me of how some games handled slow moving objects on the Mattel Intellivision console.

Your code that runs every frame to update the graphics logically wants to do something like this:

  X += Vx
  Y += Vy
where (X, Y) is the location of the object at the start of the frame, and (Vx, Vy) is x and y velocities of the object in pixels/frame.

To allow for velocities that aren't an integer number of pixels per frame and that are slower than 1 pixel per frame you'd want to actually store X in a fixed point format, say (Xi, Xf) where Xi is the integer part of the X position, and Xf is the fractional part times 256, so X = Xi + Xf/256. Similarly for Y. The object only actually moves on the screen when Xi or Yi changes.

Similarly velocity would also be in that format: Vx = Vxi + Vxf/256, and similar for Vy.

With that, the position update in your loop would be something like this:

  Xf += Vxf
  if that wrapped
    Xi += 1
  Xi += Vxi
and similar for Y.

For each object you end up needing 8 bytes (1 byte for each of Xi, Xf, Yi, Yf, Vxi, Vxf, Vyi, Vyf). That doesn't sound like much but the Intellivision only had something like 240 bytes in the console available. (If you couldn't get your RAM requirements down to that it was possible to have extra RAM in the cartridge but that would raise the cost).

So someone figured out that you didn't actually need to store Xf and Yf. Just generate then at random as needed! The loop then becomes something like this:

  rb = random_unsigned_byte()
  if Vxf + rb wraps
    Xi += 1
  Xi += Vxi
and similar for Y.

That turns out to work quite reasonably. Essentially it is interpreting Vxf as meaning that the object has a Vxf/256 chance of crossing a pixel boundary on a given frame. Thinking of it that way then suggests getting rid of the addition of the random byte with a wrap check and just doing a compare instead:

  if Vxf > random_unsigned_byte()
    Xi += 1
  Xi += Vxi
and similar for Y.

Net result: we've cut the RAM for storing position and velocity from 8 bytes per object to 6, at the cost of needing to generate 2 random bytes per object per frame.


Interesting. I know this is a purely academic question as these constraints are something of the past, but was it really necessary to have 16 bits for the velocity? Was the ratio between the quickest and the slowest objects more than 256 times?


As kevin_thibedeau noted, the bit twiddling instructions available were limited. The Intellivision used a GI 1600. Here's a copy of the manual [1] and a Wikipedia article about it [2]. It can shift or rotate by 1 or 2.

But you are right--a game might not need 16 bits for velocity. Suppose a game needed a maximum of just under 4 pixels/frame X velocity and needed a minimum of 1/64 pixel/frame.

It could have stored velocity in one byte like this FFFFFFII where the Fs are the bits of the fractional part of the velocity in units of 1/64 pixels/frame and the Is are the integer part in pixels/frame. By putting the fractional part in the upper bits and the integer part in the lower bits no shifting will be required--just AND and OR.

Then updating X would be something like this:

  X += Vx % 0x0003
  if Vx > (random_unsigned_byte() | 0x0003)
    X += 1
You probably could even skip the OR. It makes no difference if the integer portion of velocity is 0. If the integer part is 1 skipping the OR in effect increases the velocity by 1/256 pixels/frame. Similarly if the integer velocity is 2 or 3 skipping the OR increase it by 2/256 or 3/256 pixels/frame, respectively.

I don't think anyone would notice. For two objects in a race all the way across the screen with the same Vx, with one including the OR and one not, my back of the envelope calculation says that the speed boost would only get that one to the finish line less than half a millisecond earlier.

[1] http://www.bitsavers.org/components/gi/CP1600/CP-1600_Microp...

[2] https://en.wikipedia.org/wiki/General_Instrument_CP1600#Inst...


> X += Vx % 0x0003

Oops! That should be & not %.


On 8-bit micros without a barrel shifter you'd naturally want to constrain number formats to byte boundaries. Much better to have excess fractional precision than deal with extra bit twiddling. Even with 16-bit Intellivision, the scarcity of RAM would drive the use of the smallest practical representation.


Interesting, I haven't seen that before in the context of numerics.

It is used all the time however in audio and image processing, where it is called dithering.



Doesn't that require a call to generate a random number for every floating point number you encounter? That seems expensive...


A lot of hardware has built in RNGs, but even using a software algorithm (eg Xorshift) is extremely inexpensive.

Also, sometimes you’re not limited by processing speed, but by the destination data structure (eg quantized to integers).

https://en.wikipedia.org/wiki/Xorshift


I'm aware of fast RNGs, but even compared to HW floating point operations, I believe they're still more expensive than Khan summation. HW circuits maybe could do well. I see that most of the interest is around LLMs and doing quantized sums (gathering from the fact that Intel has shipped this in their accelerator), but this came up in the WiFi positioning code I was working on 10 years ago (we were using "classical" f64).


Exact addition using accumulators has progressed since the work of Zhu and Hayes in 2010.

See my paper at https://arxiv.org/abs/1505.05571 (repository at https://gitlab.com/radfordneal/xsum), which is used in the Julia xsum package (https://docs.juliahub.com/General/Xsum/stable/). There's also recent work by Marko Lange at https://web.archive.org/web/20220616031105id_/https://dl.acm...

I think these methods are probably faster than the exact methods you benchmark.


I recommend the "Handbook of Floating-point Arithmetic" if this piques interest. I think it's freely available on HAL.


I found the link to it on HAL, but clicking "Consult the full text" only seems download the table of contents and preface: https://hal.science/hal-01766584

A new copy costs $112 on Amazon: https://www.amazon.com/Handbook-Floating-Point-Arithmetic-Je...


Another alternative that the author omitted is just casting everything to doubles and summing naively (or vectorized) in double precision, then casting the final result back to a float. Would be curious to see how this compares to the other methods and whether it’s on the Pareto frontier.


I omitted this because you only have this option for f32, not for f64. I only really chose f32 as the focus point of my article because it makes the numbers a bit more readable.

That said, I should have included it, because it is on the Pareto frontier. On my machine it is ~28.9 GB/s, with 0 error (note that this doesn't mean it always produces a correctly-rounded sum, just in this benchmark test input).

I'll add it to the article tomorrow or the day after.


On x86 you have the option of using 80 bit long doubles for the accumulator. Performance is till quite decent. Not sure if rust supports them though.


You don't get SIMD with this approach though, so it's about a quarter of the speed of using vectorized arithmetic (assuming an AVX vector of doubles).


Does rust have float128 support? This is probably memory bound anyway, so software (rather than hardware) support might be fine(?).


No, it doesn't. Regardless, without hardware support, adding together intermediate f128s would basically be the same as Kahan summation, except performing even more work to convert the representations around.


There is some support behind feature flag: https://doc.rust-lang.org/nightly/std/primitive.f128.html


Ah, thank you, my understanding was out of date. It looks like these were implemented only these past few months.


On some archs, e.g. non A/H100 (and non A30) NVIDIA GPUs this is a 1:64 slow-down. Avoiding double precision is crucial there...


I think TFA is too quick to dismiss the fadd_fast intrinsic as "incredibly dangerous". In many cases you do know that your numbers are not infinities nor NaNs.

I'd be curious to see the performance of hypothetical block_pairwise_addfast and block_kahan_addfast.


It doesn't really makes sense for kahan summation, as the compiler would just make it similar to a naive summation because the errors terms would be zero under the assumptions of fadd_fast. 2sum would also break.

This is exactly what you are loosing when using fadd_fast: fine control over the errors terms of floating point operations that do matter in a lot of cases.

An other thing you are loosing is reproducibility: depending on the machine, compiler version, etc, your program will compute differently and for example may switch from linear to quadratic errors terms when you recompile. It could be the difference between a numerical algorithm converging or not, this kind of things.


Why is pairwise summation 1/5th as fast as naive? I would expect these to be essentially the same speed, there is only one more addition and a logic operation which is surely negligible ..


It's a lot more asm instructions than the tight loop of the naive one, as it's got to track more state (working out the middle of the slice, etc)...

https://godbolt.org/z/917o7oT8r


Aha. So it could be optimized into two tight loops. The linked wikipedia on pairwise says the numpy implementation is same speed as naive


Numpy also uses blocked pairwise summation, with a block size of 128: https://github.com/numpy/numpy/blob/a6e9dc7152098182b45ecd6e... .


How does it compare to converting each number to a large fixed-point integer (implemented as N 64-bit integers), summing them with exact integer math (order-invariant), and converting back to floating point at the end?

The sum part can be done with AVX-512 [1] if N=8, and N=8 is probably enough for a lot of real world summations. The conversion back to floating point at the end is only done once so not very costly.

The conversion to fixed point is probably the worst part. Is there a faster way to do it than a loop that extracts one 64-bit integer per iteration? If not we could convert several input numbers at a time with SIMD. But it would be nifty to be able to convert one double to N integers with SIMD instead of a loop.

[1] http://www.numberworld.org/y-cruncher/internals/addition.htm...



Thanks! Yes this is exactly it. Interesting about the choice of 32 bits for carries. N=67 is pretty high. I suspect N could be tuned down significantly depending on the problem. Eg global sums in geophysics models -- the range of values being summed does not span the full double range. But that would require analysis to decide about, while the full sized superaccumulator works directly. I wonder how it stacks up in terms of performance w/r/t the methods in the blog post


N=5 should cover easily single precision float range. Doubles are more tricky, you'd need N=33


Agree if we need exact for any range of inputs. It should be possible to get by with fewer for real life problems, for ex, https://www.sciencedirect.com/science/article/abs/pii/S01678...


Seems like there is a rust priority queue (https://doc.rust-lang.org/std/collections/binary_heap/index....). Would be interesting to see a version where you pop off the two smallest values, and push the sum back, until there is only one element left.


That works just great if your numbers are positive, but if they are both positive and negative, and in almost but not quite equal pairings it will fail to give the best answer even if you prioritise by magnitude.


If it works for positives, then you can add all the positives in one bin, add all the negatives in one bin, and finish off with a single subtraction.


That isn't a great method, see here: https://en.wikipedia.org/wiki/Catastrophic_cancellation

I wonder what methods give good results for this sort of input though.


A extension of the article showing the impact of pathological input for each algorithm would be interesting.


The motivating example is a mess. 15_000_000 would have been a less distracting example, as this one has more-visible problems unrelated to the problem they're trying to solve. (Further, with default options, the opening example won't have the result shown: it will crash your program.)


I don't follow. Why is 15_000_000 less distracting? What problems unrelated to what we're trying to solve? And what 'default options' are you referring to?


The motivating examples read like nonsense to me. (I don't really speak Rust, but I think I'm reading them correctly? I don't know.) They seem to be saying that 1 + 1,000,000 = 1,000,000; or 1 + 100,000,000 is 16,777,216? With no remark? That's not right even for 32-bit floats.


vec![1.0; 1_000_000_000] is Rust notation for an array that contains 1.0 one billion times. I can understand it's a bit confusing/frustrating if you're unfamiliar with Rust syntax, sorry.


That makes things make a lot more sense, thanks!

Kind of unfortunate that that syntax is so trivial to misread, but it is what it is.


Sorry, I misread the ; as a ,.


You could also use SIMD to compute N independent Kahan sums in parallel and reduce them at the end.


I tried this, it was the same speed as orlp_sum with worse accuracy.


How was the final reduction performed? Also Kahan?


Yes, although to be fair I did discard the c values for the final reduction I perhaps should've incorporated somehow.

The code from the blog post is all on Github, feel free to try and add/benchmark it yourself: https://github.com/orlp/sum-bench/.


Use doubles, use Kahan summation, use 2Sum algo, use larger precision floating point library?


+1 for TwoSum [1]. To expand on that: Kahan summation is an approximation which is a bit cheaper, but not great if some inputs can also be negative. That's because it is basically FastTwoSum(a,b) in a loop, which only works if the exponent of a >= that of b. TwoSum removes this requirement and is not that much more expensive.

[1] https://en.wikipedia.org/wiki/2Sum


For what it's worth, with clang you can avoid the no-NaNs-or-infinities problem by using '-fassociative-math -fno-signed-zeros' instead of '-ffast-math'.


I am quite confused by all the different benchmarks. If I want to add 1000 float64s accurately and as quickly as possible, which method should I use?


Is the code for aaaSum available anywhere?


Isn't sorting (smallest first) then summing another method, distinct from the ones given?


If you're going to do something like this, it seems like the TFA's suggestion of bucketizing by exponent and having one accumulator per exponent is cheaper (it doesn't require you to sort "all the way") and more correct (e.g. the input of a billion 1s was already sorted, but naively computing the sum gave an incorrect result).


sometimes i ponder the concept of a CPU built for rational numbers

a/b + c/d -> ( ad + cb ) / bd

usually in computers there is someone that has already "tried it" (example - trinary computers) but i haven't found stories of rational CPU experiments. surely some mad scientist tried it at some point. hoping HN commenters can point me to some history.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: