Floating Point Arithmetic
Key Points
- •Floating-point numbers approximate real numbers using a fixed number of bits following the IEEE 754 standard.
- •Every float stores a sign, a biased exponent, and a fraction (mantissa) that together represent numbers of the form (-1)^s 1.f 2^{e-bias}.
- •Rounding is unavoidable; the rounding error is typically bounded by about half an ulp (unit in the last place) under round-to-nearest-even.
- •Naive operations like summation can accumulate significant error, but techniques like Kahan summation greatly reduce it with the same asymptotic time.
- •Comparing floats with == is unsafe; you should use tolerances (absolute and relative) or ULP-based comparisons.
- •Special values exist: +0/-0, +/-, and NaN; NaNs are unordered and propagate through most operations.
- •Subnormal numbers (very close to zero) preserve gradual underflow but can be slower and less precise.
- •In C++, use numeri, nextafter, and careful bit reinterpretation to inspect and control floating-point behavior.
Prerequisites
- →Binary numbers and powers of two — Understanding how exponents and fraction bits represent powers of two is essential for IEEE 754.
- →Scientific notation — Floating point generalizes scientific notation to binary; this helps with the sign-exponent-significand idea.
- →Rounding and significant figures — Floating-point arithmetic rounds to a fixed precision at each step.
- →C++ basics (types, functions, templates) — Code examples rely on C++ types like float/double and templated functions.
- →Bitwise operations and memory layout — Inspecting IEEE fields requires safe bit reinterpretation and bit masks.
- →Error metrics (absolute vs relative) — Choosing appropriate comparison strategies depends on these concepts.
- →Numerical stability — Motivates why some algorithms (e.g., Kahan) reduce error without extra asymptotic cost.
- →Limits and special values — Infinities and NaNs behave differently in comparisons and arithmetic.
Detailed Explanation
Tap terms for definitions01Overview
Hook → We often assume computers can store any real number exactly, but they actually use a clever coding trick that fits only a finite set of values. Concept → Floating-point arithmetic (IEEE 754) represents real numbers in binary with a fixed number of bits: one bit for the sign, several bits for the exponent (range), and the rest for the fraction/mantissa (precision). This creates a dense grid of representable numbers near zero and a sparser grid for large magnitudes. Rounding happens on nearly every operation because the exact real result may not be representable. Example → Adding one billion and then repeatedly adding 0.1 can lead to surprises: some 0.1 additions might appear to do nothing because 0.1 is not exactly representable and the spacing (ULP) around 1e9 is larger than 0.1.
Hook → Why should we care? Because tiny rounding errors can snowball in iterative algorithms. Concept → IEEE 754 also defines special values (infinities, NaNs), standardized rounding modes, and exceptions/flags for events like overflow or division by zero. Example → Dividing by zero yields ±∞, while 0/0 yields NaN, and NaNs compare unequal to everything (even themselves), which affects conditionals and sorting if not handled.
02Intuition & Analogies
Hook → Imagine a ruler that changes tick-mark spacing as you slide along it. Near the left end (small numbers) the ticks are very close; farther right (huge numbers), they spread out. Concept → Floating-point is like this variable-spacing ruler. The exponent decides which “zoom level” you’re on, and the fraction fine-tunes the position within that zoom. You can only land exactly on tick marks. When the true answer falls between ticks, the system rounds to the nearest tick (often “ties to even”). Example → 0.1 in decimal does not align with binary tick marks, so the computer stores the nearest tick, something like 0.10000000000000000555...
Hook → Think of money in whole cents. You can’t represent half a cent, so you round every transaction. Concept → Each arithmetic step introduces a tiny rounding error, usually bounded. Over many steps, those pennies of error can add up—unless you keep track of what is lost (compensated summation). Example → Kahan’s algorithm keeps a small side-account (the compensation) so the pennies don’t vanish; it improves the final total without changing big-O time.
Hook → What about special cases? Concept → IEEE 754 also encodes ±0, infinities, and NaN to keep calculations predictable even when things go off the rails. Example → 1.0/0.0 becomes +∞, and any arithmetic with NaN stays NaN so errors don’t get silently masked.
03Formal Definition
04When to Use
Hook → Should you ever avoid floating point? Concept → Use IEEE 754 floats when you need a wide dynamic range and fast approximate real arithmetic. They are ideal for science, graphics, signal processing, machine learning, and simulation where small errors are acceptable and performance matters. Example → Physics engines, 3D rendering, and training neural networks depend on floating point for speed and scale.
Hook → But what if you need exact cents or combinatorial counts? Concept → Prefer integers or decimal fixed-point for exactness in financial apps or when rounding must follow decimal rules. Example → Currency calculations are safer with scaled integers (e.g., store cents as int64) to avoid 0.1 issues.
Hook → Can you improve reliability without changing types? Concept → Use numerically stable algorithms (sorting before summation, Kahan/Neumaier summation, Horner’s method, compensated dot products) and FMA when available. Example → Summing small terms first or using Kahan preserves low-order bits and reduces error without changing big-O time.
⚠️Common Mistakes
- Using == to compare floats. NaNs compare false to everything and tiny rounding differences break equality. Prefer tolerance-based or ULP-based comparisons.
- Ignoring scale. Adding a tiny number to a huge one can be a no-op at that magnitude. Rescale or reorder operations (e.g., sum from smallest to largest) to reduce loss of significance.
- Subtracting nearly equal numbers (catastrophic cancellation). Reformulate expressions (e.g., use rationalized forms or series expansions) to avoid massive relative error.
- Forgetting special values. NaNs propagate; ±0 exist; infinities can silently appear. Always check inputs and guard critical branches.
- Assuming decimal literals are exact. 0.1 is not exactly representable in binary; parsing already introduces error.
- Overlooking rounding modes and denormal performance. Changing the rounding mode or handling subnormals can affect speed and results; do not rely on FTZ/DAZ defaults without understanding consequences.
- Misinterpreting machine epsilon. \epsilon is not the smallest positive float; it’s the gap around 1.0. The smallest positive normalized and subnormal numbers are much smaller.
- Reinterpreting bits with unsafe casts. Use memcpy or std::bit_cast (C++20) to avoid undefined behavior when inspecting IEEE fields.
Key Formulas
Normalized IEEE Representation
Explanation: A finite, normalized nonzero float is encoded by sign s, exponent e (with a bias), and fraction bits . The leading 1 is implicit, giving p bits of precision for the significand.
Subnormal Representation
Explanation: Subnormals use an exponent of 1−bias and have no implicit leading 1. This allows gradual underflow with evenly spaced values near zero.
Machine Epsilon and Unit Roundoff
Explanation: Machine epsilon ε is the gap between 1 and the next larger number; unit roundoff u is roughly the maximum relative error for one rounded operation. For double, , so and .
Standard Rounding Model
Explanation: A floating-point operation produces the exact result multiplied by 1+δ with | bounded by unit roundoff. This model underlies most error analyses.
ULP Spacing (Binary)
Explanation: For a normalized x with unbiased exponent e, adjacent representable numbers are spaced by 2^{e-(p-1)}. Spacing grows with magnitude, explaining why addition of tiny terms to huge numbers may have no effect.
Naive Summation Error Bound
Explanation: Summing n numbers in floating point can accumulate error proportional to n times unit roundoff. The bound uses , which grows roughly linearly with n when nu is small.
Kahan Summation Error (Informal)
Explanation: Compensated summation dramatically reduces error compared to naive summation. The leading term behaves like one rounding, with tiny higher-order terms in nu.
Round-to-Nearest-even Error
Explanation: Under round-to-nearest-even, the maximum absolute rounding error is half an ulp at x. This is the core guarantee that keeps per-operation errors small.
Format Parameters
Explanation: These constants define the precision (p), exponent bits (k), and bias for the two most common IEEE 754 formats, float and double.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct DoubleParts { 5 uint64_t sign; // 0 or 1 6 uint64_t exponent; // 11 bits 7 uint64_t fraction; // 52 bits 8 }; 9 10 DoubleParts dissectDouble(double x) { 11 uint64_t bits = 0; 12 static_assert(sizeof(double) == sizeof(uint64_t), "Unexpected double size"); 13 memcpy(&bits, &x, sizeof(bits)); // Safe bit reinterpretation 14 DoubleParts p{}; 15 p.sign = (bits >> 63) & 0x1ull; 16 p.exponent = (bits >> 52) & 0x7ffull; // 11 bits 17 p.fraction = bits & 0x000fffffffffffffull; // 52 bits 18 return p; 19 } 20 21 int main() { 22 cout.setf(std::ios::boolalpha); 23 24 vector<double> samples = {1.0, -0.0, 0.1, numeric_limits<double>::infinity(), -numeric_limits<double>::infinity(), 25 numeric_limits<double>::quiet_NaN(), numeric_limits<double>::denorm_min()}; 26 27 for (double x : samples) { 28 DoubleParts p = dissectDouble(x); 29 cout << "x = " << x << "\n"; 30 cout << " sign: " << p.sign << "\n"; 31 cout << " exponent: 0x" << hex << setw(3) << setfill('0') << p.exponent << dec 32 << " (" << p.exponent << ")\n"; 33 cout << " fraction: 0x" << hex << setw(13) << setfill('0') << p.fraction << dec << "\n"; 34 35 // Classify special cases 36 bool isZero = (p.exponent == 0 && p.fraction == 0); 37 bool isSubnormal = (p.exponent == 0 && p.fraction != 0); 38 bool isInf = (p.exponent == 0x7ff && p.fraction == 0); 39 bool isNaN = (p.exponent == 0x7ff && p.fraction != 0); 40 41 cout << " zero? " << isZero << "\n"; 42 cout << " subnormal? " << isSubnormal << "\n"; 43 cout << " infinity? " << isInf << "\n"; 44 cout << " NaN? " << isNaN << "\n"; 45 46 if (!isZero && !isSubnormal && !isInf && !isNaN) { 47 int64_t unbiased = static_cast<int64_t>(p.exponent) - 1023; // bias for double 48 cout << " unbiased exponent (e - bias): " << unbiased << "\n"; 49 } 50 cout << "\n"; 51 } 52 } 53
This program inspects the underlying IEEE 754 binary64 (double) layout using memcpy to avoid aliasing issues. It extracts the sign, exponent, and fraction fields, prints them in hex, and classifies each sample as zero, subnormal, infinity, or NaN. For normal numbers, it also prints the unbiased exponent. This helps visualize how decimal-looking values like 0.1 are actually represented and why special values have distinct bit patterns.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 template <class T> 5 T naive_sum(const vector<T>& a) { 6 T s = 0; 7 for (T v : a) s += v; // Each addition rounds; errors may accumulate 8 return s; 9 } 10 11 template <class T> 12 T kahan_sum(const vector<T>& a) { 13 T s = 0; // Running sum 14 T c = 0; // Compensation for lost low-order bits 15 for (T v : a) { 16 T y = v - c; // First, subtract the compensation 17 T t = s + y; // Add the corrected value 18 c = (t - s) - y; // New compensation captures low-order roundoff 19 s = t; 20 } 21 return s; 22 } 23 24 int main() { 25 // Construct a numerically challenging sum: one big number plus many tiny ones 26 const int n = 1000000; 27 vector<double> a; a.reserve(n + 1); 28 a.push_back(1e9); 29 for (int i = 0; i < n; ++i) a.push_back(1e-6); 30 31 double s1 = naive_sum(a); 32 double s2 = kahan_sum(a); 33 34 cout.setf(std::ios::fixed); cout << setprecision(10); 35 cout << "Naive sum: " << s1 << "\n"; 36 cout << "Kahan sum: " << s2 << "\n"; 37 cout << "Expected (real): 1000000000.999999\n"; 38 cout << "Absolute error naive: " << fabs(s1 - 1000000000.999999) << "\n"; 39 cout << "Absolute error Kahan: " << fabs(s2 - 1000000000.999999) << "\n"; 40 } 41
We compare naive accumulation to Kahan’s compensated summation on a difficult dataset (one large term and one million small terms). The naive method loses low-order bits when the running total is large; additions of 1e-6 may be swallowed by spacing near 1e9. Kahan’s method tracks the lost bits in a compensation variable and adds them back later, significantly reducing total error without changing big-O time.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Absolute/relative tolerance comparison 5 bool approx_equal(double a, double b, double rel_tol = 1e-12, double abs_tol = 1e-15) { 6 // Handle infinities and NaNs explicitly 7 if (std::isnan(a) || std::isnan(b)) return false; // NaN is never equal 8 if (std::isinf(a) || std::isinf(b)) return a == b; // only equal if same infinity 9 10 double diff = fabs(a - b); 11 if (diff <= abs_tol) return true; // near zero 12 13 double max_ab = max(fabs(a), fabs(b)); 14 return diff <= rel_tol * max_ab; 15 } 16 17 // ULP distance for IEEE 754 double (orders all bit patterns monotonically) 18 static inline uint64_t to_ordered_uint(double x) { 19 uint64_t u; memcpy(&u, &x, sizeof(u)); 20 // Map negatives to high range so that increasing numeric values have increasing ordered integers 21 return (u & (1ull<<63)) ? (~u + 1ull) : (u | (1ull<<63)); 22 } 23 24 uint64_t ulp_distance(double a, double b) { 25 if (std::isnan(a) || std::isnan(b)) return UINT64_MAX; // undefined distance 26 uint64_t ua = to_ordered_uint(a); 27 uint64_t ub = to_ordered_uint(b); 28 return (ua > ub) ? (ua - ub) : (ub - ua); 29 } 30 31 int main() { 32 double x = 1.0; 33 double y = std::nextafter(1.0, 2.0); // adjacent representable number above 1.0 34 35 cout.setf(std::ios::boolalpha); 36 cout << "x = " << setprecision(17) << x << ", y = " << y << "\n"; 37 cout << "approx_equal(x, y) with tight tolerances: " << approx_equal(x, y, 1e-16, 0.0) << "\n"; 38 cout << "ULP distance(x, y) = " << ulp_distance(x, y) << " (should be 1)\n"; 39 40 double a = 1e9; 41 double b = a + 0.1; // 0.1 may not change a if it rounds away 42 cout << "ULP distance(1e9, 1e9+0.1) = " << ulp_distance(a, b) << "\n"; 43 } 44
This example shows two practical strategies for comparing doubles. The first uses a combination of absolute and relative tolerances, suitable for most numeric code and stable across scales. The second uses ULP distance by mapping the bit patterns to a monotonic integer order; it counts how many representable numbers lie between the two values. We also demonstrate std::nextafter to move by exactly one ULP.