Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Fundamental Axiom of Floating Point Arithmetic (johnbcoughlin.com)
172 points by johnbcoughlin on Aug 5, 2020 | hide | past | favorite | 60 comments


This “fundamental axiom” is not true:

”This means that for any two floating point numbers, say x and y, any operation involving them will give a floating point result which is within a factor of 1+ϵ of the true result.”

It fails for x-y when x and y are close. Subtraction is the easiest way to get big floating point errors. The article does mention this towards the end but you need to wade through a lot of equations before you get told this very important caveat.

The more tricky way to accumulate errors is to add lots of small numbers to a big number. Just summing a list of floats can be fraught with peril! Kahan’s summation algorithm is enlightening: https://en.wikipedia.org/wiki/Kahan_summation_algorithm


"It fails for x-y when x and y are close"

I don't think that's quite right. Each individual operation is accurate within epsilon, the trouble with cancellation is that the error from operations prior to the subtraction is much larger than the result of the subtraction.

Example, if I do 1 + 2^30 - 2^30 in single precision I get zero. But each individual operation is correct. 1 + 2^30 = 2^30, so the error is 1 < 2^30 * epsilon. Then 2^30 - 2^30 = 0, which is exactly correct.


Yes. The big errors appear when you remember x is actually (1+ε)x, as explained towards the end of the article. If you assume the inputs are exact then you get an accurate result, but if you are doing error analysis that does not help you.


This is the right answer. Substraction is accurate, as much as it can be given the its inputs. It’s everything (combined) leading up to it that isn’t.

Every year I struggle to teach this simple concept in class :-) :-(


This is due to the non-associative nature of FP. (a+b) + c != a + (b+c).

I disagree that the "axiom" as stated is fundamental. My main argument with it is that I don't see an easy way to go from the axiom to usable theorems about FP.

Happy to be wrong on this, but I am missing this at this time.


I'd say not only is fundamental, it's pretty much the only tool you have to start the analysis of an FP algorithm. If you want to analyze a naive sum algorithm for instance you do that by recursively applying the a ⊕ b = (a + b)(1 + ε) rule and figure out how the epsilons bubble up - the result being that the error is linear with the sum length.


Yep the term for this is catastrophic cancellation, which doesn’t invalidate the axiom.


> Kahan’s summation algorithm

Radford Neal (2015) has an approach to computing sums exactly (to the correct rounded floating point number) in less than twice the time to do it naively. In some situations it can do work in parallel and saturate memory bandwidth, so takes roughly the same time: http://www.cs.toronto.edu/~radford/xsum.abstract.html

"Kahan’s (1965) method for reducing summation error (but without producing the exact result) was also tested on all systems. On modern 64-bit processors, computing the exact sum with the large superaccumulator method was faster than Kahan’s method for summations of more than about 1000 terms. Kahan’s method was significantly faster than the small superaccumulator method only for summations of less than about 100 terms."


I think you're misinterpretating the fundamental axiom because it's still true for `x-y`. What you're thinking is that the fundamental axiom applies to the composition of operations `closest_float(x) + closest_float(y)`, but the axiom only applies to each individual operation.

The axiom can be stated more rigorously "If f is a (possibly multivariate) function on the set of floats to the reals, the IEEE-compliant approximation of f must give the closest float for all float values of its arguments."


> "If f is a (possibly multivariate) function on the set of floats to the reals, the IEEE-compliant approximation of f must give the closest float for all float values of its arguments."

It should be noted that this is not exactly true for some IEEE functions, due to the table-maker's dilemma: computing the "closest float approximation" to an arbitrary function can be computationally infeasible if the true result is close to being halfway in-between two consecutive floats. (This applies to the default rounding mode, but can be extended by analogy to other rounding modes.)


They're correct that the "In mathematical terms" version of it is wrong (even for addition, if one is negative). It's missing fl()'s around the other a's and b's. Your statement is greatly superior than the poor man's version presented in the article.


Kahan summation is unfortunately slower than naive summation, which can be problematic. Pairwise summation [1], which roundoff error is proportional to log n, can be used as a drop-in replacement of naive summation as it performs more or less equally.

[1] https://en.wikipedia.org/wiki/Pairwise_summation


That's not true. The axiom is correct. Say you're using double (i.e., you have 16 decimal digits of accuracy, roughly) and you are doing ((1.0 + 1e-20) - 1.0).

That becomes (1.0 - 1.0) = 0.0 because 1e-20 was dropped during the addition. The substraction was perfectly accurate given its inputs (1.0 - 1.0 is exactly 0.0).

The issue is that float(1.0 + 1e-20) = 1.0. But that's also very accurate ! (All the 16 decimal digits of 1.0 are correct). But yeah the 20th (decimal) digit got dropped.

You see how each individual computation is maximally accurate. So the axiom is correct. What it boils down to is that you are manipulating (adding/subtracting) numbers of very different magnitudes. So cancellations, which would be negligible if all numbers where of similar magnitude, becomes catastrophic. Hence the name catastrophic cancellations.


The key is the framing; as stated, x and y are _exact_ floating-point values, which makes the statement true[1]. In fact, it's embarrassingly trivial, since it is just a weakening of the definition of rounding used by IEEE 754. In particular, if cancellation occurs in x - y, then the result is exact (Sterbenz' lemma).

[1] Under the assumption that no underflow or overflow occurs (which is the usual framing for this sort of naive error analysis). It's easy enough to correct for underflow by adding an absolute error term as well.


In other words, while we all know that addition and subtraction over real numbers is associative, addition and subtraction over floating point numbers is not.


> Subtraction is the easiest way to get big floating point errors.

That's why most modern floating point representations allow subnormal numbers: without them, subtracting two near, normal numbers will result in zero even if the numbers aren't equal.


> It fails for x-y when x and y are close. Subtraction is the easiest way to get big floating point errors.

And note: addition might be subtraction, while subtraction might be addition.

Consider x=1.0 and y=-0.99999.

x + y = 0.00001

You did addition, but you got cancellation error. The only solution is to sort the numbers from largest to smallest dynamically during runtime.


I was recently thinking about writing something about fixed point vs floating point.

The numbers that you can represent with floating point are exponentially distributed. Floating point is well-behaved when you multiply things, and it maintains a good relative error. Addition works best when you add similar numbers. Overflow is almost impossible, but it's easy to get loss of significance.

The numbers that you can represent with fixed point are uniformly distributed. Fixed point is well-behaved when you add things, and it maintains a good absolute error. Loss of significance doesn't happen, but it's easy to get an overflow.

In both systems it's possible to accumulate error, but under normal circumstances the approximation error should average out. Fixed point makes it easier to reason about the error, floating point is a bit harder, but usually works well in practice (with some exceptions -- for example, if you add a huge array of small values you will run into trouble due to the loss of significance).


The big problem with fixed point is reciprocals are really bad. If you have a fixed point format that goes from 0 to 1000, then 0.0125% of the numbers are below 1/8, whereas 99.2% of the numbers are above 8.

Obviously for the reciprocal to be accurate, there needs to be the same amount of numbers above 1 as below 1, like there is for floats.


Good point, indeed. Fixed point works best for multiplication with numbers that are approximately 1 in magnitude.


Fixed point also has the advantage that you can do it quickly without needing a floating point unit in your CPU. That means you can do it on microcontrollers, and you can do it in the kernel without needing to save extra user context on kernel entry (this is only relevant on architectures that have separate floating point registers).


I'm not sure that fixed point actually makes it easier to reason about the error, it's just that fixed point isn't the default these days, so you have to choose to use fixed point. That means, when you do use it, it likely suits the problem you need to solve.


I have meet a surprisingly large number of people who think that fixed point arithmetic is the solution to all of their problems (probably because its use is not widespread).

The big problem with fixed-point, making them a bad choice for most scientific computation (which is where the design decision for the current dominating system came from) is their lack of resolution.

You can represent only a comparatively small interval of numbers and a few division|multiplication will quickly lead to large errors in the output (even if you do not hit underflows/overflows).


> I have meet a surprisingly large number of people who think that fixed point arithmetic is the solution to all of their problems

>a bad choice for most scientific computation

Is weird this. Scientific computation is at most a niche.

Financial and real-world decimal calculations is the norm.

That is why a lot of people could be server better by the use of decimals or fixed point.


>Financial and real-world decimal calculations is the norm.

A significant amount of financial calculation is not in the nature of 'sum this column of numbers'. Most computation in e.g. derivative pricing has a lot more to do with scientific computing than it does with accountancy.


It might be me, I work with people doing simulation and ML, not game or web dev.

People doing financial computations are using fixed-point for very good reasons but, other than them, the scientific people are the only one that I know of who care deeply for the results of their numerical computations.


I'm far from a floating point expert, but the fact that the article does everything in terms of epsilons, without mentioning ULPs a single time confuses me.

In fact, that fundamental axiom of FP arithmetic is NOT in terms of some absolute value epsilonMachine. As I understand it, it says that there exists a relative epsilon value (which depends on your input numbers), and that the error of any FP operation on those numbers will be bounded by that epsilon.

What the article doesn't explain, is that floating point numbers and operations are guaranteed to be accurate to the nearest half-ULP, not to the nearest absolute epsilon. And what exactly is the absolute value of an ULP? It depends!

The difference is so fundamental, I'm not sure you can have a useful mental model of floats if you don't start by addressing this.


The epsilon model is perfectly adequate, and does not assume epsilon is constant on a given machine (the article glosses over this in the first paragraph where epsilon is introduced, but it is spelled out later).

In fact the epsilon model is key to being able to do a mathematically formulated error analysis at all.


Would you mind expanding the `ULP` acronym you’re using: if it’s so fundamental to understanding this, I’d like to know what it is.


My apology, ULP stands for "Unit in the Last Place".

The simple explanation is that an ULP is defined as the difference between two consecutive floating point numbers, which means it isn't a fixed value.

Another "intuitive" explanation, is that if you increment the binary representation of a float, the difference between the two numbers is 1 ULP.

(I'd explain more, but I have to run to a meeting! More info on Wikipedia: https://en.wikipedia.org/wiki/Unit_in_the_last_place)


Instead of calling it "ULPs" they refer to a "next floating point number" which I'm pretty sure is the same idea:

> Unlike the number line, for every representable floating point number x, there is a _next_ floating point number, and it is approximately x + ϵmachine x. So for an arbitrary real number x0, it falls between two floating point values x and x+ϵx (leaving off the subscript for conciseness). When we represent x0 in floating point, we will get one of these two values.


Machine epsilon is fine for doing error analysis. For example, I used machine epsilon instead of ULP for an article I recently wrote which walks through implementation and (numerical) analysis of various algorithms for e^x: https://www.pseudorandom.com/implementing-exp

So you don't need ULP (and if you did, you could convert to ULP from machine epsilon). But you are right that machine epsilon varies according to where you are in the interval of floating point numbers. Some authors don't mention that, but yes, since floating point values are not evenly distributed the machine epsilon for any particular x will vary. This is good to know, because even if error analysis demonstrates that your approximation function is bounded by machine epsilon, it's still going to exhibit less accuracy as you move away from the origin.


But multiplying by (1+e) is relative. I guess it's much more convenient mathematically than dealing with the absolute error of the calculation.


The equations are well integrated in the article and matches the font of the article very well. I thought it was some special font+math library but just turned out to be Georgia and MathJax with tasteful design choices.


As someone who spends most of my time at work writing firmware for embedded systems, I learned VERY quickly never to use floating point numbers when I could use fixed point numbers. Better to use an integer number of thousandths of a thing than try to use floating point numbers and have error. This mindset is clearly common, because it seems like the vast majority of microcontrollers have no floating point hardware anyway.

At some point I was listening to a talk where the guy called floating point numbers _"The incredibly misleading numbers that never really work the way that you want them to."_

That's probably the best description of floating point numbers I've ever heard.


My personal fundamental axiom of floating point numbers is that they're like a physical measurement of a continuous quantity. Examples:

1 + 2^30 - 2^30 = 0 is like adding a drop of water to the ocean then removing an ocean's worth of water. You're not going to be able to meter that accurately enough to leave the one drop behind.

Round-trip conversions from decimal to FP and back result in a slightly different value. Why do you expect special treatment for a number system in any particular base? Nature's physical quantities aren't in any base. And why do you care about the 17th significant figure? You can't measure that with common scientific equipment.

Cumulative errors. They're like tolerance stacking. Engineers define dimensions from a common datum instead of in a chain from each other because their physical measurements accumulate error like floating point does.

They don't allow extreme values. You're never going to need them because you can't find a 10^400 m or 10^-400 m long object in nature either.

Dollars and cents don't work right. Money isn't a continuous quantity.


> All floating point arithmetic operations are exact up to a relative error of [machine epsilon].

What about sub/de normal operands?


Exactly. This axiom doesn't hold for denormals which probably don't matter for finance, but really matter for calculations that regress numbers for many iterations.

OT: while obviously all fixed precision floating point must compromise on the representation of real numbers, Posits give you more precision for the same bits, behave far more sanely (eg. doesn't round to zero or Inf), and error bounds are much easier to understand.

EDIT: capitalize Posits


> Posits give you more precision for the same bits

Independent papers doing some validation on Posit computations are starting to get out and... it is not better, just a different trade-off. Overall it seems to be worth it for 16 bits precision (where floating points are very fragile) but highly debatable for 32 and 64 bits.

Note that many papers claiming good properties for Posits also introduce what they call a Quire which is... an exact dot product algorithm encapsulated in an object (something that is also easily implemented with traditional floating-points). Comparing a naive sum in IEEE floating point with an exact summation in Posit is an apple to orange comparison (the algorithms are different) when you want to evaluate the precision of a numeric type.


> At Square, we used a long amount_cents, and we got along with our lives

how does this help if you need to divide or calculate sales tax? aren't you back in the accumulating error trap?


Accounting practice generally allows / requires rounding to some quantum value at each operation, like $0.001– The issue with using floating point for money isn’t that it has errors, it‘s that those errors don’t match the ones you get when using a paper ledger.


Firstly, if you treat floating-point without the proper care, that will still be the case if you use an integer for all currency amounts (as a number of cents) but reach for floating-point for percentage calculations and such.

Secondly, you can implement pencil-and-paper rounding rules using binary floating point! It's done by serializing to decimal text, working with the digits (perhaps as integers) and then converting back to floating point.

> those errors don’t match the ones you get when using a paper ledger.

If you have errors in your paper ledger, you're not much of an accountant.


> If you have errors in your paper ledger, you're not much of an accountant.

That’s because accounting practice has solidified certain inevitable innaccuracies as “the way things are done.” Any method that doesn’t produce the same accounting-correct but mathematically-imprecise results is “wrong,” and those innaccuracies are based on a fixed-point model.


Great article, and a wonderful explanation as to why one should avoid (at all costs) doing financial calculations using IEEE-754 floating point math.

> ... where each \epsilon is very small, on the other of \epsilon_machine.

Not sure if author is reading replies here, but I think "other" should be "order."


That depends on the type of the financial calculations you're doing. Certainly accounting and billing shouldn't be done in binary floating point formats. Financial modelling and forecasting calculations aren't exact calculations, and the intermediate library functions for binary floating point may suffer less rounding error than practical alternatives. Also, computational efficiency comes into play once you're at the scale of spending millions of dollars per year on hardware and electricity for financial modelling.

Also, since IEEE-754-2008, there are decimal IEEE-754 floating point numbers, which are useful for some types of financial calculations.


... except that every financial trading system i've ever seen is full of floats (and i've seen a lot).


Many years back I worked on a project involved in heavily government regulated gambling (horse race betting). The rules were every calculation including all intermediate steps had to be done to at least 1/10,000th of a cent precision, and rounding was only allowed once at the very final step.

(Which, since my part of that project was a Javascript/jQueryMobile mobile webapp, I managed to push all that to the backend team saying "I'm doing calculations using Javascript, I don't even have proper integers to work with here - you'll need to send me all financial figures as strings, and do all the calculations at your end according to the rules. I can accurately display strings, I will not look a judge in the eye and try to claim I can accurately calculate _anything_ monetary in Javascript.")


IEEE 754 also provides for decimal radix floating point, just that there is no real hardware support for it beyond some IBM and Fujitsu chips, and what you can put together on FPGAs.


I wonder if it makes sense to treat floating points not as single points on the number line but as a range. This way when you subtract two similar numbers the result is a range that is wide in relation to the distance from 0, explicitly encoding the bigger relative rounding error.


That exists and is called interval arithmetic.


There are even more improvements (in some regard) to this, e.g. Affine Arithmetic[0, 1]. Here, error terms from different operations are accumulated symbolically so that in the case where they would cancel out (as could be useful for numerical algorithms), tighter bounds can be obtained. There are several research groups working on improving such analyses.

--

[0] - https://en.wikipedia.org/wiki/Affine_arithmetic

[1] - http://users.ece.cmu.edu/~rutenbar/pdf/rutenbar-intervalicas...

edit: formatting


And the problem with it is that it can lead to [-inf;+inf] intervals on sane computation such as Newton's method (refinements have been introduced but nothing that fully solves the problem).


I started to teach my 12-year son programming and when we started to talk about floats, his reaction that computers can do simple math with fractions properly was pretty astounding. I should've started him on Lisp instead.


I think this is a poor take on floating point numbers. Read the classic article it references instead [1]. The IEEE guarantee is not that the result is accurate to a relative epsilon but instead is the closest representable number to the true result. This is a clear guarantee but requires the reader to actually know how floating point numbers are stored (which this article fails to say!). It is also rather subtle and it needs to be explored in more depth to be of any use.

[1] https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf


> The IEEE guarantee is not that the result is accurate to a relative epsilon but instead is the closest representable number to the true result.

The latter implies the former within the magnitude range that allows normalized representation, which is the vast majority of calculations. As an introductory article for a technical audience, it does a good job of describing the abilities and limitations of floating point numbers. Going into more detail about the representation runs the risk of sacrificing clarity for precision.


I disagree and think that the article manages to fall into an "uncanny valley" of not-actually-understanding. Why else would it need the "A Cautionary Tale" section at the end? That's the first time he mentions that the previous sections only hold for positive numbers and shows that the statements made before are bad.

And personally I prefer an "exact" guarantee that the result is as close-as-possible rather than merely "within some relative epsilon". Compare his (incorrect) "in mathematical terms" description with the actual statement that would be that, if re(x) gives the real number represented by a floating point number x, then:

fl(re(x) + re(y)) = x + y

where addition on the left is of reals and on the right is the floating point addition operator. This is simpler and more accurate than his inequality! And it works for sub-normals etc.


The axiom holds for all numbers and all single operations. The final section presents a gotcha that arises when reasoning about the amount of _propagated_ error in certain calculations (called catastrophic cancellation).

The "exact" guarantee that you mention is precisely the same as saying that it's "within some relative epsilon". Using machine epsilon tells you _how_ close "as close as possible" is.

Your proposed improvement to the inequality is true, but is not really useful for computing the amount of error you can expect from numerical algorithms. For that, you want some quantification of how close is "as close as possible". Having the relative error bounded by machine epsilon is an extremely strong guarantee, as the article shows.


The key I think your article is missing though is that it should explain which numbers are representable exactly as floating points. This is very important information for other reasons too and that tells you how close "as close as possible" is. I can tell you that when I started learning about floating point numbers I knew there was some relative error and believed that operations like addition and square root were just "close enough"- in fact they're as good as possible and the only difficulty is which numbers we're allowed to write down! Enlightenment! And everything else falls out from there.

I do think your edit is a good clarification to make. My point about the "in mathematical notation" part though is that it assumes a and b are exactly represented as floating point numbers (else catastrophic cancellation can happen) though the sentence after the equation implies they are not representable.


> I prefer an "exact" guarantee that the result is as close-as-possible rather than merely "within some relative epsilon".

This is the core requirement of any system that quantizes the real numbers in order to perform arithmetic. The thing that distinguishes fixed-point from floating-point systems is the quantization error that gets introduced in this process:

For floating point, the error is proportional to the magnitude of the number being reprsented. The article is primarily exploring the implications of this property.

For fixed point, the quantization error has an absolute magnitude that doesn’t depend on the size of the input numbers. This has a completely different set of benefits and drawbacks, but still maintains your “preferred” property.


> The moral of the story is, never use a floating point number to represent money.

That is simply not true. Floating-point arithmetic has sufficient precision to represent multiples of $0.01 with vanishingly good accuracy even for astronomical sums of money that your ledger will never see. If it didn't, it wouldn't be able to "numerically solve differential equations, or linear systems, or land on the moon." Moreover, programs like Microsoft Excel wouldn't be able to use floating-point.

You have to make sure that all calculations that produce a currency amount are actually rounded to the closest $0.01 multiple (using cent-based dollar accounting as an example).

The main rookie mistake is to leave an amount represented as, say, $17.4537359..., and then just show it as $17.45 on paper or in the user interface.

Rounding to the closest $0.01 will thwart cumulative error. So for instance, suppose we start with 0, and then repeatedly add 0.01. The naive way would be:

   for (x = 0.0; x < WHATEVER; x += 0.01)
   {
   }
this will suffer from cumulative errors. Eventually the errors will add up to more than half a penny: 0.005 and we are toast: N iterations of our loop will have resulted in a value of x that is now closer to N+1 cents or N-1 cents than it is to N cents.

However, if we do this:

   for (x = 0.0; x < WHATEVER; x += 0.01)
   {
      x = round_to_nearest_penny(x);
   }
then we are fine: our code will nicely stride from one good penny approximation to another. We only start to run aground when x becomes astronomical, representing a number of pennies that is an integer in the 14 or 15 digit range or something like that. After that we start getting into a range where consecutive pennies cannot be represented any more at all, let alone good approximations.

If we are just dealing with pocket change, like a few billion dollars here and there, we are good.

So that is to say, we can't even take a column of figures that are individually the best approximations of number of pennies and simply add them together; we have to regularly clamp to the a penny as we carry on the summation. It doesn't necessarily have to be after each addition, but we can't let the error stray too far from the abstract penny we are trying to represent. Any value we record into a ledger should be crisply clamped to the floating point system's best approximation of that dollar and cent amount.

Using (binary!) floating-point, we can even implement specific pencil-and-paper rounding methods for fractional results. This is because we can render the floating point value into text, out to several more than the required number of digits of precision, and then apply a textual technique to analyze the digits.

Using double-precision IEEE floating point, I can easily write a routine that will round 0.125 to 0.12 but 0.135 to 0.14. I.e. banker's rounding to the closest even penny.

Now these numbers are actually:

   Abstract                      Actual
   0.125                         0.125
   0.12                          0.1199999999999999956
   0.135                         0.1350000000000000089
   0.14                          0.1400000000000000133
So e.g. my code will be actually taking 1350000000000000089 to 1400000000000000133. First, I will get the value as text rounded to four digits after the decimal: "0.1350". Then I process the four digits as a string, or perhaps make an integer out of them. I can work out that 1350 % 100 is 50, indicating that the number is "on the fence". Then get the parity of the hundreds digit: h = (x / 100) % 10 == 3. Aha, odd, so we have to go to 4: ((x / 1000) * 1000) + h * 400.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: