Floating-point math is fraught with subtle gotchas, and comparing values properly is no exception. Here we discuss common pitfalls, examine some possible solutions, and try to beat Boost.
Things you probably know about floats
If you need to represent a non-integer in a mainstream programming language, you’ll probably end up using IEEE 754 floating-point values. Since their standardization in 1985, they’ve become ubiquitous. Nearly all modern CPUs—and many microprocessors—contain special hardware (called floating-point units, or FPUs) to handle them.
Each float consists of a sign bit, some bits representing an exponent, and bits representing a fraction, also called the mantissa. Under most circumstances, the value of a float is:
where is our sign bit, is some fraction represented by the mantissa bits, is an unsigned integer represented by the exponent bits, and is half the maximum value of , i.e., 127 for a 32-bit float and 1023 for a 64-bit float.
There are also some special cases. For example, when all exponent bits are zero, the formula changes to:
Note the lack of an implicit 1 preceding the mantissa—this allows us to store small values close to zero, called denormal or subnormal values. And when all exponent bits are one, certain mantissa values represent , , and “Not a Number” (NaN), the result of undefined or unrepresentable operations such as dividing by zero.
We’ll make two observations that will prove themselves useful shortly:
Floats cannot store arbitrary real numbers, or even arbitrary rational numbers. They can only store numbers representable by the equations shown before. For example, if I declare some variable,
f
becomes 0.100000001490116119384765625, the closest 32-bit float value to 0.1.Since the equations are exponential, the distance on the number line between adjacent values increases (exponentially!) as you move away from zero. The distance between 1.0 and the next possible value is about , but the distance between adjacent floats near is roughly . This will prove to be our greatest challenge: when comparing floats, we want to handle inputs close to zero as well as we handle ones close to the Avogadro constant.
What is equality?
Since the result of every floating-point operation must be rounded to the
nearest possible value, math doesn’t behave like it does with real numbers.
Depending on your hardware, compiler, and compiler flags,
may produce a different result than .
Whenever we compare calculated values to each other,
we should provide some leeway to account for this.
Comparing their exact values with ==
won’t cut it.
Instead,
we should consider two values and equal if
for some sufficiently small .
As luck would have it, the C standard library contains a FLT_EPSILON
.
Let’s try it out!
boolalmostEqual(floata,floatb){returnfabs(a-b)<=FLT_EPSILON;}
We would hope that we’re done here, but we would be wrong.
A look at the language standards reveals that
FLT_EPSILON
is equal to
the difference between 1.0 and the value that follows it.
But as we noted before, float values aren’t equidistant!
For values smaller than 1, FLT_EPSILON
quickly becomes too large to be useful.
For values greater than 2, FLT_EPSILON
is smaller than the
distance between adjacent values, sofabs(a-b)<=FLT_EPSILON
will always be false.
To address these problems, what if we scaled proportionally to our inputs?
boolrelativelyEqual(floata,floatb,floatmaxRelativeDiff=FLT_EPSILON){constfloatdifference=fabs(a-b);// Scale to the largest value.a=fabs(a);b=fabs(b);constfloatscaledEpsilon=maxRelativeDiff*max(a,b);returndifference<=scaledEpsilon;}
This works better than our initial solution,
but it’s not immediately obvious what values of
maxRelativeDiff
we might want for different cases.
The fact that we scale it by arbitrary inputs also means it can fall prey
to the same rounding we’re worried about in the first place.
What about Boost?
Boost, the popular collection of C++ libraries, provides functions for similar purposes. After removing template boilerplate and edge case handling for and NaNs, they resemble:
floatrelative_difference(floata,floatb){returnfabs((a-b)/min(a,b));}floatepsilon_difference(floata,floatb){returnrelative_difference(a,b)/FLT_EPSILON;}
Unfortunately, these functions don’t seem to solve
our problems.
Since the division in relative_difference
often makes its result
quite small,
how do we know what a good threshold might be?
By dividing that result by FLT_EPSILON
,epsilon_difference
attempts to give an easier value to reason about.
But we just saw the dangers of FLT_EPSILON
!
This scheme becomes increasingly questionable as inputs move away from one.
What about ULPs?
It would be nice to define comparisons in terms of something more concrete than arbitrary thresholds. Ideally, we would like to know the number of possible floating-point values—sometimes called units of least precision, or ULPs—between inputs. If I have some value , and another value is only two or three ULPs away, we can probably consider them equal, assuming some rounding error. Most importantly, this is true regardless of the distance between and on the number line.
Boost offers a function called float_distance
to get the distance
between values in ULPs,
but it’s about an order of magnitude slower than the approaches
discussed so far.
With some bit-fiddling, we can do better.
Consider some positive float where every mantissa bit is one. must use the next largest exponent, and all its mantissa bits must be zero. As an example, consider 1.99999988 and 2:
Value | Bits | Exponent | Mantissa bits |
---|---|---|---|
1.99999988 | 0x3FFFFFFF | 127 | 0x7FFFFF |
2.0 | 0x40000000 | 128 | 0x000000 |
The property holds for denormals, even though they have a different value equation. Consider the largest denormal value and the smallest normal one:
Value | Bits | Exponent | Mantissa bits |
---|---|---|---|
0x007FFFFF | -126 | 0x7FFFFF | |
0x00800000 | -126 | 0x000000 |
Notice an interesting corollary: adjacent floats (of the same sign) have adjacent integer values when reinterpreted as such. This reinterpretation is sometimes called type punning, and we can use it to calculate the distance between values in ULPs.
Traditionally in C and C++, one used a union trick:
unionFloatPun{floatf;int32_ti;};FloatPunfp;fp.f=25.624f;// Read the same value as an integer.printf("%x",fp.i);
This still works in C,
but can run afoul of strict aliasing rules in C++.
A better approach is to use memcpy
.
Given the usual use of the function, one might assume that it would be less
efficient, but
int32_tfloatToInt(floatf){int32_tr;memcpy(&r,&f,sizeof(float));returnr;}
compiles to a single instruction that moves the value from a floating-point register to an integer one. This is exactly what we want.
With that problem solved, calculating the ULPs between values becomes quite straightforward:
int32_tulpsDistance(constfloata,constfloatb){// Save work if the floats are equal.// Also handles +0 == -0if(a==b)return0;constautomax=std::numeric_limits<int32_t>::max();// Max distance for NaNif(isnan(a)||isnan(b))returnmax;// If one's infinite and they're not equal, max distance.if(isinf(a)||isinf(b))returnmax;int32_tia,ib;memcpy(&ia,&a,sizeof(float));memcpy(&ib,&b,sizeof(float));// Don't compare differently-signed floats.if((ia<0)!=(ib<0))returnmax;// Return the absolute value of the distance in ULPs.int32_tdistance=ia-ib;if(distance<0)distance=-distance;returndistance;}
This code is quite portable—it only assumes that the platform supports 32-bit integers and that floats are stored in accordance with IEEE 754. We avoid comparing differently-signed values for a few reasons:
ULPs are the wrong tool to compare values near or across zero, as we’ll see below.
Almost all modern CPUs use two’s complement arithmetic, while floats usesigned magnitude. Converting one format to the other in order to meaningfully add or subtract differently-signed values requires some extra work. For the same reason, the sign of our result might not be what we expect, so we take its absolute value. We only care about the distance between our two inputs.
If the subtraction overflows or underflows, we get undefined behavior with signed integers and modular arithmetic with unsigned ones. Neither is desirable here.
We calculate the absolute value ourselves instead of using std::abs
for two reaons.
First, the integer versions of std::abs
only take types—such as
int
, long
, and longlong
—whose
sizes are platform-specific.
We want to avoid assumptions about implicit conversions
between those types and int32_t
.
The second is a strange pitfall related to the placement ofstd::abs
overloads in the C++ standard library.
If you include <cmath>
but not <cstdlib>
,
only the floating-point versions of std::abs
are provided.
Several toolchains I tested then promote the int32_t
value to a double
,
even if your target only has
a 32-bit FPU and must emulate double
using integer registers.
(As one might guess, this is terrible for performance.)
Warning flags such as -Wconversion
can help us notice this happening,
or we can just avoid all these gotchas by calculating the absolute value directly.
At any rate, this is a trivial detail.
No silver bullets
Relative epsilons—including ULPs-based ones—don’t make sense around zero. The exponential nature of floats means that many more values are gathered there than anywhere else on the number line. Despite being a fairly small value in the context of many calculations, 0.1 is over one billion ULPs away from zero! Consequently, fixed epsilons are probably the best choice when you expect the results to be small. What particular you want is entirely dependent on the calculations performed.
Armed with this knowledge, you may be tempted to write some end-all comparison function along the lines of:
boolnearlyEqual(floata,floatb,floatfixedEpsilon,intulpsEpsilon){// Handle the near-zero case.constfloatdifference=fabs(a-b);if(difference<=fixedEpsilon)returntrue;returnulpsDistance(a,b)<=ulpsEpsilon;}
But using it meaningfully is difficult without understanding the theory we’ve discussed.
Brief aside: Other ULPs-based functions
We can use the same techniques to write other useful functions,
such as one that increments a float by some number of ULPs.
Boost offers a similar family of functions
(float_next
, float_advance
, etc.),
but like float_distance
, they pay a performance cost to avoid
type punning.
One would hope we could simply get our ULPs, perform our addition, and pun the result back, e.g.,
floatulpsIncrement(floatf,int32_tulps){if(isnan(f)||isinf(f))returnf;int32_ti;memcpy(&i,&f,sizeof(float));i+=ulps;memcpy(&f,&i,sizeof(float));returnf;}
This naïve solution works for positive values, but on most hardware, “incrementing” a negative float by a positive number of ULPs will move us away from zero! This is probably not what we want. We mentioned before that floats use a signed magnitude scheme, whereas most CPUs use two’s complement. So, to operate on negative values, we need to convert from the former to the CPU’s native integer format.
staticconstint32_tint32SignBit=(int32_t)1<<31;int32_tfloatToNativeSignedUlps(floatf){int32_ti;memcpy(&i,&f,sizeof(float));// Positive values are the same in both// two's complement and signed magnitude.// For negative values, remove the sign bit// and negate the result (subtract from 0).returni>=0?i:-(i&~int32SignBit);}
After operating on the ULPs, we must convert back to signed magnitude:
floatnativeSignedUlpsToFloat(int32_tulps){if(ulps<0){ulps=-ulps;ulps|=int32SignBit;}floatf;memcpy(&f,&ulps,sizeof(float));returnf;}
With those functions defined, we can return to our goal:
floatulpsIncrement(floatf,int32_tulps){if(isnan(f)||isinf(f))returnf;int32_ti=floatToNativeSignedUlps(f);i+=ulps;returnnativeSignedUlpsToFloat(i);}
Takeaways
When comparing floating-point values, remember:
FLT_EPSILON
… isn’t float epsilon, except in the ranges [-2, -1] and [1, 2]. The distance between adjacent values depends on the values in question.When comparing to some known value—especially zero or values near it—use a fixed that makes sense for your calculations.
When comparing non-zero values, some ULPs-based comparison is probably the best choice.
When values could be anywhere on the number line, some hybrid of the two is needed. Choose epsilons carefully based on expected outputs.
Acknowledgments
Much of this was adapted from Bruce Dawson’s fantastic exploration of the topic on his blog,Random ASCII. Thanks also to coworkers Evan Thompson and Matt Drees for their input.
Afterward: performance concerns
The relatively poor performance of boost::float_distance
was a large motivation for implementing
our own ulpsDistance
.
For the sake of completeness, the following is a benchmark
(using Google’s benchmark library)
comparing the two with a handful of inputs.
#include <cstring> // For memcpy
#include <limits> // for numeric_limits<float>::infinity
#include <random>#include <benchmark/benchmark.h>
#include <boost/math/special_functions/next.hpp>
#include <boost/math/special_functions/relative_difference.hpp>usingnamespacestd;usingnamespaceboost::math;std::pair<float,float>pickInput(){staticautore=mt19937(random_device()());staticautocoinFlip=bernoulli_distribution(0.5);staticautoinputPicker=uniform_int_distribution<int>(1,10);constfloatinfinity=numeric_limits<float>::infinity();switch(inputPicker(re)){// Let's say there's a 5% chance our values are denormal.// (This is probably more pessimal than our actual data.)case1:if(coinFlip(re))return{1e-38f,float_advance(1e-38f,3)};// Intentional fall-through// Let's throw in some huge numberscase2:case3:case4:case5:return{6.022e23f,2.998e8f};break;// And so not-so-huge ones.case6:case7:case8:case9:return{1.0f,11.0f};// Let's say there's a 5% chance we have NaNs// and another 5% chance they're infinitycase10:if(coinFlip(re))return{42,numeric_limits<float>::quiet_NaN()};elsereturn{42,infinity};default:assert(0);}}__attribute__((noinline))// For visibility when benchmarkingint32_tulpsDistance(constfloata,constfloatb){// We can skip all the following work if they're equal.if(a==b)return0;constautomax=numeric_limits<int32_t>::max();// We first check if the values are NaN.// If this is the case, they're inherently unequal;// return the maximum distance between the two.if(isnan(a)||isnan(b))returnmax;// If one's infinite, and they're not equal,// return the max distance between the two.if(isinf(a)||isinf(b))returnmax;// At this point we know that the floating-point values aren't equal and// aren't special values (infinity/NaN).// Because of how IEEE754 floats are laid out// (sign bit, then exponent, then mantissa), we can examine the bits// as if they were integers to get the distance between them in units// of least precision (ULPs).static_assert(sizeof(float)==sizeof(int32_t),"What size is float?");// memcpy to get around the strict aliasing rule.// The compiler knows what we're doing and will just transfer the float// values into integer registers.int32_tia,ib;memcpy(&ia,&a,sizeof(float));memcpy(&ib,&b,sizeof(float));// If the signs of the two values aren't the same,// return the maximum distance between the two.// This is done to avoid integer overflow, and because the bit layout of// floats is closer to sign-magnitude than it is to two's complement.// This *also* means that if you're checking if a value is close to zero,// you should probably just use a fixed epsilon instead of this function.if((ia<0)!=(ib<0))returnmax;// If we've satisfied all our caveats above, just subtract the values.// The result is the distance between the values in ULPs.int32_tdistance=ia-ib;if(distance<0)distance=-distance;returndistance;}voidbenchFloatDistance(benchmark::State&state){while(state.KeepRunning()){state.PauseTiming();floata,b;std::tie(a,b)=pickInput();state.ResumeTiming();// float_distance can't handle NaN and Infs.if(!isnan(a)&&!isnan(b)&&!isinf(a)&&!isinf(b)){benchmark::DoNotOptimize(float_distance(a,b));}}}BENCHMARK(benchFloatDistance);voidbenchUlps(benchmark::State&state){while(state.KeepRunning()){state.PauseTiming();floata,b;std::tie(a,b)=pickInput();state.ResumeTiming();benchmark::DoNotOptimize(ulpsDistance(a,b));}}BENCHMARK(benchUlps);BENCHMARK_MAIN();
On my laptop (an Intel Core i7 Skylake), I get:
Benchmark Time CPU Iterations
---------------------------------------------------------
benchFloatDistance 717 ns 836 ns 850424
benchUlps 157 ns 176 ns 3914780
And on an ARMv6 board we use at work for embedded Linux platforms, I get:
Benchmark Time CPU Iterations
---------------------------------------------------------
benchFloatDistance 43674 ns 42609 ns 16646
benchUlps 4748 ns 4602 ns 151382
Actual timing values obviously depend on the type of inputs, the uniformity of the inputs (which influences branch prediction), and many other factors, but our function seems to outperform Boost alternatives in the general case.
For the typographically inclined, this is available as aPDF and itsTeX source.