A method which this article doesn't mention is stochastic rounding or probabilistic rounding.
Rather than striving for exact values for every computation, we make it so that the expected value is exact. For example, suppose our number system was restricted to the integers and we have incoming value 1.1, with stochastic rounding we'd round down to 1 90% of the time and 2 10% of the time giving an expected value of 1.1!
It's not really related, but that reminds me of how some games handled slow moving objects on the Mattel Intellivision console.
Your code that runs every frame to update the graphics logically wants to do something like this:
X += Vx
Y += Vy
where (X, Y) is the location of the object at the start of the frame, and (Vx, Vy) is x and y velocities of the object in pixels/frame.
To allow for velocities that aren't an integer number of pixels per frame and that are slower than 1 pixel per frame you'd want to actually store X in a fixed point format, say (Xi, Xf) where Xi is the integer part of the X position, and Xf is the fractional part times 256, so X = Xi + Xf/256. Similarly for Y. The object only actually moves on the screen when Xi or Yi changes.
Similarly velocity would also be in that format: Vx = Vxi + Vxf/256, and similar for Vy.
With that, the position update in your loop would be something like this:
Xf += Vxf
if that wrapped
Xi += 1
Xi += Vxi
and similar for Y.
For each object you end up needing 8 bytes (1 byte for each of Xi, Xf, Yi, Yf, Vxi, Vxf, Vyi, Vyf). That doesn't sound like much but the Intellivision only had something like 240 bytes in the console available. (If you couldn't get your RAM requirements down to that it was possible to have extra RAM in the cartridge but that would raise the cost).
So someone figured out that you didn't actually need to store Xf and Yf. Just generate then at random as needed! The loop then becomes something like this:
rb = random_unsigned_byte()
if Vxf + rb wraps
Xi += 1
Xi += Vxi
and similar for Y.
That turns out to work quite reasonably. Essentially it is interpreting Vxf as meaning that the object has a Vxf/256 chance of crossing a pixel boundary on a given frame. Thinking of it that way then suggests getting rid of the addition of the random byte with a wrap check and just doing a compare instead:
if Vxf > random_unsigned_byte()
Xi += 1
Xi += Vxi
and similar for Y.
Net result: we've cut the RAM for storing position and velocity from 8 bytes per object to 6, at the cost of needing to generate 2 random bytes per object per frame.
Interesting. I know this is a purely academic question as these constraints are something of the past, but was it really necessary to have 16 bits for the velocity? Was the ratio between the quickest and the slowest objects more than 256 times?
As kevin_thibedeau noted, the bit twiddling instructions available were limited. The Intellivision used a GI 1600. Here's a copy of the manual [1] and a Wikipedia article about it [2]. It can shift or rotate by 1 or 2.
But you are right--a game might not need 16 bits for velocity. Suppose a game needed a maximum of just under 4 pixels/frame X velocity and needed a minimum of 1/64 pixel/frame.
It could have stored velocity in one byte like this FFFFFFII where the Fs are the bits of the fractional part of the velocity in units of 1/64 pixels/frame and the Is are the integer part in pixels/frame. By putting the fractional part in the upper bits and the integer part in the lower bits no shifting will be required--just AND and OR.
Then updating X would be something like this:
X += Vx % 0x0003
if Vx > (random_unsigned_byte() | 0x0003)
X += 1
You probably could even skip the OR. It makes no difference if the integer portion of velocity is 0. If the integer part is 1 skipping the OR in effect increases the velocity by 1/256 pixels/frame. Similarly if the integer velocity is 2 or 3 skipping the OR increase it by 2/256 or 3/256 pixels/frame, respectively.
I don't think anyone would notice. For two objects in a race all the way across the screen with the same Vx, with one including the OR and one not, my back of the envelope calculation says that the speed boost would only get that one to the finish line less than half a millisecond earlier.
On 8-bit micros without a barrel shifter you'd naturally want to constrain number formats to byte boundaries. Much better to have excess fractional precision than deal with extra bit twiddling. Even with 16-bit Intellivision, the scarcity of RAM would drive the use of the smallest practical representation.
I'm aware of fast RNGs, but even compared to HW floating point operations, I believe they're still more expensive than Khan summation. HW circuits maybe could do well. I see that most of the interest is around LLMs and doing quantized sums (gathering from the fact that Intel has shipped this in their accelerator), but this came up in the WiFi positioning code I was working on 10 years ago (we were using "classical" f64).
I found the link to it on HAL, but clicking "Consult the full text" only seems download the table of contents and preface: https://hal.science/hal-01766584
Another alternative that the author omitted is just casting everything to doubles and summing naively (or vectorized) in double precision, then casting the final result back to a float. Would be curious to see how this compares to the other methods and whether it’s on the Pareto frontier.
I omitted this because you only have this option for f32, not for f64. I only really chose f32 as the focus point of my article because it makes the numbers a bit more readable.
That said, I should have included it, because it is on the Pareto frontier. On my machine it is ~28.9 GB/s, with 0 error (note that this doesn't mean it always produces a correctly-rounded sum, just in this benchmark test input).
I'll add it to the article tomorrow or the day after.
No, it doesn't. Regardless, without hardware support, adding together intermediate f128s would basically be the same as Kahan summation, except performing even more work to convert the representations around.
I think TFA is too quick to dismiss the fadd_fast intrinsic as "incredibly dangerous". In many cases you do know that your numbers are not infinities nor NaNs.
I'd be curious to see the performance of hypothetical block_pairwise_addfast and block_kahan_addfast.
It doesn't really makes sense for kahan summation, as the compiler would just make it similar to a naive summation because the errors terms would be zero under the assumptions of fadd_fast. 2sum would also break.
This is exactly what you are loosing when using fadd_fast: fine control over the errors terms of floating point operations that do matter in a lot of cases.
An other thing you are loosing is reproducibility: depending on the machine, compiler version, etc, your program will compute differently and for example may switch from linear to quadratic errors terms when you recompile. It could be the difference between a numerical algorithm converging or not, this kind of things.
Why is pairwise summation 1/5th as fast as naive? I would expect these to be essentially the same speed, there is only one more addition and a logic operation which is surely negligible ..
How does it compare to converting each number to a large fixed-point integer (implemented as N 64-bit integers), summing them with exact integer math (order-invariant), and converting back to floating point at the end?
The sum part can be done with AVX-512 [1] if N=8, and N=8 is probably enough for a lot of real world summations. The conversion back to floating point at the end is only done once so not very costly.
The conversion to fixed point is probably the worst part. Is there a faster way to do it than a loop that extracts one 64-bit integer per iteration? If not we could convert several input numbers at a time with SIMD. But it would be nifty to be able to convert one double to N integers with SIMD instead of a loop.
Thanks! Yes this is exactly it. Interesting about the choice of 32 bits for carries. N=67 is pretty high. I suspect N could be tuned down significantly depending on the problem. Eg global sums in geophysics models -- the range of values being summed does not span the full double range. But that would require analysis to decide about, while the full sized superaccumulator works directly. I wonder how it stacks up in terms of performance w/r/t the methods in the blog post
That works just great if your numbers are positive, but if they are both positive and negative, and in almost but not quite equal pairings it will fail to give the best answer even if you prioritise by magnitude.
The motivating example is a mess. 15_000_000 would have been a less distracting example, as this one has more-visible problems unrelated to the problem they're trying to solve. (Further, with default options, the opening example won't have the result shown: it will crash your program.)
I don't follow. Why is 15_000_000 less distracting? What problems unrelated to what we're trying to solve? And what 'default options' are you referring to?
The motivating examples read like nonsense to me. (I don't really speak Rust, but I think I'm reading them correctly? I don't know.) They seem to be saying that 1 + 1,000,000 = 1,000,000; or 1 + 100,000,000 is 16,777,216? With no remark? That's not right even for 32-bit floats.
vec![1.0; 1_000_000_000] is Rust notation for an array that contains 1.0 one billion times. I can understand it's a bit confusing/frustrating if you're unfamiliar with Rust syntax, sorry.
+1 for TwoSum [1]. To expand on that: Kahan summation is an approximation which is a bit cheaper, but not great if some inputs can also be negative. That's because it is basically FastTwoSum(a,b) in a loop, which only works if the exponent of a >= that of b. TwoSum removes this requirement and is not that much more expensive.
For what it's worth, with clang you can avoid the no-NaNs-or-infinities problem by using '-fassociative-math -fno-signed-zeros' instead of '-ffast-math'.
If you're going to do something like this, it seems like the TFA's suggestion of bucketizing by exponent and having one accumulator per exponent is cheaper (it doesn't require you to sort "all the way") and more correct (e.g. the input of a billion 1s was already sorted, but naively computing the sum gave an incorrect result).
sometimes i ponder the concept of a CPU built for rational numbers
a/b + c/d -> ( ad + cb ) / bd
usually in computers there is someone that has already "tried it" (example - trinary computers) but i haven't found stories of rational CPU experiments. surely some mad scientist tried it at some point. hoping HN commenters can point me to some history.
Rather than striving for exact values for every computation, we make it so that the expected value is exact. For example, suppose our number system was restricted to the integers and we have incoming value 1.1, with stochastic rounding we'd round down to 1 90% of the time and 2 10% of the time giving an expected value of 1.1!
Further reading:
https://nhigham.com/2020/07/07/what-is-stochastic-rounding/