🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Floating Point Arithmetic

Key Points

  • •
    Floating-point numbers approximate real numbers using a fixed number of bits following the IEEE 754 standard.
  • •
    Every float stores a sign, a biased exponent, and a fraction (mantissa) that together represent numbers of the form (-1)^s × 1.f × 2^{e-bias}.
  • •
    Rounding is unavoidable; the rounding error is typically bounded by about half an ulp (unit in the last place) under round-to-nearest-even.
  • •
    Naive operations like summation can accumulate significant error, but techniques like Kahan summation greatly reduce it with the same asymptotic time.
  • •
    Comparing floats with == is unsafe; you should use tolerances (absolute and relative) or ULP-based comparisons.
  • •
    Special values exist: +0/-0, +∞/-∞, and NaN; NaNs are unordered and propagate through most operations.
  • •
    Subnormal numbers (very close to zero) preserve gradual underflow but can be slower and less precise.
  • •
    In C++, use numericl​imits, nextafter, and careful bit reinterpretation to inspect and control floating-point behavior.

Prerequisites

  • →Binary numbers and powers of two — Understanding how exponents and fraction bits represent powers of two is essential for IEEE 754.
  • →Scientific notation — Floating point generalizes scientific notation to binary; this helps with the sign-exponent-significand idea.
  • →Rounding and significant figures — Floating-point arithmetic rounds to a fixed precision at each step.
  • →C++ basics (types, functions, templates) — Code examples rely on C++ types like float/double and templated functions.
  • →Bitwise operations and memory layout — Inspecting IEEE fields requires safe bit reinterpretation and bit masks.
  • →Error metrics (absolute vs relative) — Choosing appropriate comparison strategies depends on these concepts.
  • →Numerical stability — Motivates why some algorithms (e.g., Kahan) reduce error without extra asymptotic cost.
  • →Limits and special values — Infinities and NaNs behave differently in comparisons and arithmetic.

Detailed Explanation

Tap terms for definitions

01Overview

Hook → We often assume computers can store any real number exactly, but they actually use a clever coding trick that fits only a finite set of values. Concept → Floating-point arithmetic (IEEE 754) represents real numbers in binary with a fixed number of bits: one bit for the sign, several bits for the exponent (range), and the rest for the fraction/mantissa (precision). This creates a dense grid of representable numbers near zero and a sparser grid for large magnitudes. Rounding happens on nearly every operation because the exact real result may not be representable. Example → Adding one billion and then repeatedly adding 0.1 can lead to surprises: some 0.1 additions might appear to do nothing because 0.1 is not exactly representable and the spacing (ULP) around 1e9 is larger than 0.1.

Hook → Why should we care? Because tiny rounding errors can snowball in iterative algorithms. Concept → IEEE 754 also defines special values (infinities, NaNs), standardized rounding modes, and exceptions/flags for events like overflow or division by zero. Example → Dividing by zero yields ±∞, while 0/0 yields NaN, and NaNs compare unequal to everything (even themselves), which affects conditionals and sorting if not handled.

02Intuition & Analogies

Hook → Imagine a ruler that changes tick-mark spacing as you slide along it. Near the left end (small numbers) the ticks are very close; farther right (huge numbers), they spread out. Concept → Floating-point is like this variable-spacing ruler. The exponent decides which “zoom level” you’re on, and the fraction fine-tunes the position within that zoom. You can only land exactly on tick marks. When the true answer falls between ticks, the system rounds to the nearest tick (often “ties to even”). Example → 0.1 in decimal does not align with binary tick marks, so the computer stores the nearest tick, something like 0.10000000000000000555...

Hook → Think of money in whole cents. You can’t represent half a cent, so you round every transaction. Concept → Each arithmetic step introduces a tiny rounding error, usually bounded. Over many steps, those pennies of error can add up—unless you keep track of what is lost (compensated summation). Example → Kahan’s algorithm keeps a small side-account (the compensation) so the pennies don’t vanish; it improves the final total without changing big-O time.

Hook → What about special cases? Concept → IEEE 754 also encodes ±0, infinities, and NaN to keep calculations predictable even when things go off the rails. Example → 1.0/0.0 becomes +∞, and any arithmetic with NaN stays NaN so errors don’t get silently masked.

03Formal Definition

Hook→We need a precise model to reason about errors. Concept→In IEEE 754 binary formats, a finite, normalized nonzero number x is represented as x = (-1)^{s} × (1 + ∑i=1p−1​ fi​ 2^{-i}) × 2^{e - bias}, where s is the sign bit, e is the stored exponent (1 ≤ e ≤ 2^{k}-2 for k exponent bits), fi​ are fraction bits, p is precision (number of significand bits including the implicit 1), and bias is a fixed constant (127 for binary32, 1023 for binary64). Example→For double (binary64): k=11, p=53, bias=1023. Rounding model: For a basic operation ∘ ∈ {+, -, ×, ÷}, the computed result satisfies fl(x ∘ y) = (x ∘ y)(1 + δ) with ∣δ∣ ≤ u, where u is the unit roundoff (typically u = 2^{-(p-1)} under round-to-nearest-even). Subnormal numbers (e=0, f=0) drop the leading 1 and use exponent 1-bias, ensuring gradual underflow. Special patterns with e=2k−1 encode infinities (f=0) and NaNs (f=0). IEEE 754 also defines flags and rounding modes (to nearest-even, toward ±∞, toward 0) and operations like fused multiply-add (FMA) that compute ax+b with a single rounding. These definitions create a portable, analyzable foundation for numerical computing.

04When to Use

Hook → Should you ever avoid floating point? Concept → Use IEEE 754 floats when you need a wide dynamic range and fast approximate real arithmetic. They are ideal for science, graphics, signal processing, machine learning, and simulation where small errors are acceptable and performance matters. Example → Physics engines, 3D rendering, and training neural networks depend on floating point for speed and scale.

Hook → But what if you need exact cents or combinatorial counts? Concept → Prefer integers or decimal fixed-point for exactness in financial apps or when rounding must follow decimal rules. Example → Currency calculations are safer with scaled integers (e.g., store cents as int64) to avoid 0.1 issues.

Hook → Can you improve reliability without changing types? Concept → Use numerically stable algorithms (sorting before summation, Kahan/Neumaier summation, Horner’s method, compensated dot products) and FMA when available. Example → Summing small terms first or using Kahan preserves low-order bits and reduces error without changing big-O time.

⚠️Common Mistakes

  • Using == to compare floats. NaNs compare false to everything and tiny rounding differences break equality. Prefer tolerance-based or ULP-based comparisons.
  • Ignoring scale. Adding a tiny number to a huge one can be a no-op at that magnitude. Rescale or reorder operations (e.g., sum from smallest to largest) to reduce loss of significance.
  • Subtracting nearly equal numbers (catastrophic cancellation). Reformulate expressions (e.g., use rationalized forms or series expansions) to avoid massive relative error.
  • Forgetting special values. NaNs propagate; ±0 exist; infinities can silently appear. Always check inputs and guard critical branches.
  • Assuming decimal literals are exact. 0.1 is not exactly representable in binary; parsing already introduces error.
  • Overlooking rounding modes and denormal performance. Changing the rounding mode or handling subnormals can affect speed and results; do not rely on FTZ/DAZ defaults without understanding consequences.
  • Misinterpreting machine epsilon. \epsilon is not the smallest positive float; it’s the gap around 1.0. The smallest positive normalized and subnormal numbers are much smaller.
  • Reinterpreting bits with unsafe casts. Use memcpy or std::bit_cast (C++20) to avoid undefined behavior when inspecting IEEE fields.

Key Formulas

Normalized IEEE Representation

x=(−1)s×(1+i=1∑p−1​fi​2−i)×2e−bias

Explanation: A finite, normalized nonzero float is encoded by sign s, exponent e (with a bias), and fraction bits fi​. The leading 1 is implicit, giving p bits of precision for the significand.

Subnormal Representation

x=(−1)s×(i=1∑p−1​fi​2−i)×21−bias

Explanation: Subnormals use an exponent of 1−bias and have no implicit leading 1. This allows gradual underflow with evenly spaced values near zero.

Machine Epsilon and Unit Roundoff

ϵ=2−(p−1),u=2ϵ​=2−p

Explanation: Machine epsilon ε is the gap between 1 and the next larger number; unit roundoff u is roughly the maximum relative error for one rounded operation. For double, p=53, so ε=2−52 and u=2−53.

Standard Rounding Model

fl(x∘y)=(x∘y)(1+δ),∣δ∣≤u

Explanation: A floating-point operation produces the exact result multiplied by 1+δ with |δ∣ bounded by unit roundoff. This model underlies most error analyses.

ULP Spacing (Binary)

ULP(x)=2e−(p−1)

Explanation: For a normalized x with unbiased exponent e, adjacent representable numbers are spaced by 2^{e-(p-1)}. Spacing grows with magnitude, explaining why addition of tiny terms to huge numbers may have no effect.

Naive Summation Error Bound

​fl(i=1∑n​xi​)−i=1∑n​xi​​≤γn​i=1∑n​∣xi​∣,γn​=1−nunu​

Explanation: Summing n numbers in floating point can accumulate error proportional to n times unit roundoff. The bound uses γn​, which grows roughly linearly with n when nu is small.

Kahan Summation Error (Informal)

∣S^−S∣≤(u+O(nu2))i=1∑n​∣xi​∣

Explanation: Compensated summation dramatically reduces error compared to naive summation. The leading term behaves like one rounding, with tiny higher-order terms in nu.

Round-to-Nearest-even Error

∣x−fl(x)∣≤21​ULP(x)

Explanation: Under round-to-nearest-even, the maximum absolute rounding error is half an ulp at x. This is the core guarantee that keeps per-operation errors small.

Format Parameters

binary32: p=24, k=8, bias=127;binary64: p=53, k=11, bias=1023

Explanation: These constants define the precision (p), exponent bits (k), and bias for the two most common IEEE 754 formats, float and double.

Complexity Analysis

Floating-point arithmetic operations like addition, subtraction, multiplication, and division execute in constant time, O(1), on typical hardware. However, constant factors vary: multiplication and division are slower than addition/subtraction; fused multiply-add (FMA) often matches multiplication latency while improving accuracy by rounding once. Operations on subnormal numbers, or when changing rounding modes, can incur significant slowdowns depending on the CPU and settings (e.g., denormals-are-zero, flush-to-zero). From an algorithmic perspective, numerical methods built from these primitives can scale differently in error even if time complexity is the same. For example, naive summation of n terms is O(n) time and O(1) space, as is Kahan summation. Yet their numerical error behaves very differently: naive error grows roughly like n·u (more precisely γn​ = nu/(1-nu)), while Kahan’s error is close to one rounding (u plus tiny higher-order terms). Thus, when n is large or when terms vary greatly in magnitude, compensated summation is preferred despite identical big-O runtime. Space complexity is typically O(1) for arithmetic and small kernels (e.g., dot products, reductions), but vectorized or blocked algorithms may temporarily allocate O(n) buffers. Many numerical algorithms trade space for stability (e.g., storing partial sums or using pairwise summation in a binary tree, which is O(n) time and O(log n) depth on parallel hardware). In summary, while time complexity often remains unchanged, numerically stable formulations can vastly improve accuracy without asymptotic cost, and hardware details (FMA, subnormals) affect practical performance constants.

Code Examples

Decompose a double into IEEE 754 sign, exponent, and fraction bits
1#include <bits/stdc++.h>
2using namespace std;
3
4struct DoubleParts {
5 uint64_t sign; // 0 or 1
6 uint64_t exponent; // 11 bits
7 uint64_t fraction; // 52 bits
8};
9
10DoubleParts dissectDouble(double x) {
11 uint64_t bits = 0;
12 static_assert(sizeof(double) == sizeof(uint64_t), "Unexpected double size");
13 memcpy(&bits, &x, sizeof(bits)); // Safe bit reinterpretation
14 DoubleParts p{};
15 p.sign = (bits >> 63) & 0x1ull;
16 p.exponent = (bits >> 52) & 0x7ffull; // 11 bits
17 p.fraction = bits & 0x000fffffffffffffull; // 52 bits
18 return p;
19}
20
21int main() {
22 cout.setf(std::ios::boolalpha);
23
24 vector<double> samples = {1.0, -0.0, 0.1, numeric_limits<double>::infinity(), -numeric_limits<double>::infinity(),
25 numeric_limits<double>::quiet_NaN(), numeric_limits<double>::denorm_min()};
26
27 for (double x : samples) {
28 DoubleParts p = dissectDouble(x);
29 cout << "x = " << x << "\n";
30 cout << " sign: " << p.sign << "\n";
31 cout << " exponent: 0x" << hex << setw(3) << setfill('0') << p.exponent << dec
32 << " (" << p.exponent << ")\n";
33 cout << " fraction: 0x" << hex << setw(13) << setfill('0') << p.fraction << dec << "\n";
34
35 // Classify special cases
36 bool isZero = (p.exponent == 0 && p.fraction == 0);
37 bool isSubnormal = (p.exponent == 0 && p.fraction != 0);
38 bool isInf = (p.exponent == 0x7ff && p.fraction == 0);
39 bool isNaN = (p.exponent == 0x7ff && p.fraction != 0);
40
41 cout << " zero? " << isZero << "\n";
42 cout << " subnormal? " << isSubnormal << "\n";
43 cout << " infinity? " << isInf << "\n";
44 cout << " NaN? " << isNaN << "\n";
45
46 if (!isZero && !isSubnormal && !isInf && !isNaN) {
47 int64_t unbiased = static_cast<int64_t>(p.exponent) - 1023; // bias for double
48 cout << " unbiased exponent (e - bias): " << unbiased << "\n";
49 }
50 cout << "\n";
51 }
52}
53

This program inspects the underlying IEEE 754 binary64 (double) layout using memcpy to avoid aliasing issues. It extracts the sign, exponent, and fraction fields, prints them in hex, and classifies each sample as zero, subnormal, infinity, or NaN. For normal numbers, it also prints the unbiased exponent. This helps visualize how decimal-looking values like 0.1 are actually represented and why special values have distinct bit patterns.

Time: O(1) per number (constant time to dissect and print)Space: O(1)
Kahan compensated summation vs naive summation
1#include <bits/stdc++.h>
2using namespace std;
3
4template <class T>
5T naive_sum(const vector<T>& a) {
6 T s = 0;
7 for (T v : a) s += v; // Each addition rounds; errors may accumulate
8 return s;
9}
10
11template <class T>
12T kahan_sum(const vector<T>& a) {
13 T s = 0; // Running sum
14 T c = 0; // Compensation for lost low-order bits
15 for (T v : a) {
16 T y = v - c; // First, subtract the compensation
17 T t = s + y; // Add the corrected value
18 c = (t - s) - y; // New compensation captures low-order roundoff
19 s = t;
20 }
21 return s;
22}
23
24int main() {
25 // Construct a numerically challenging sum: one big number plus many tiny ones
26 const int n = 1000000;
27 vector<double> a; a.reserve(n + 1);
28 a.push_back(1e9);
29 for (int i = 0; i < n; ++i) a.push_back(1e-6);
30
31 double s1 = naive_sum(a);
32 double s2 = kahan_sum(a);
33
34 cout.setf(std::ios::fixed); cout << setprecision(10);
35 cout << "Naive sum: " << s1 << "\n";
36 cout << "Kahan sum: " << s2 << "\n";
37 cout << "Expected (real): 1000000000.999999\n";
38 cout << "Absolute error naive: " << fabs(s1 - 1000000000.999999) << "\n";
39 cout << "Absolute error Kahan: " << fabs(s2 - 1000000000.999999) << "\n";
40}
41

We compare naive accumulation to Kahan’s compensated summation on a difficult dataset (one large term and one million small terms). The naive method loses low-order bits when the running total is large; additions of 1e-6 may be swallowed by spacing near 1e9. Kahan’s method tracks the lost bits in a compensation variable and adds them back later, significantly reducing total error without changing big-O time.

Time: O(n) to traverse the array onceSpace: O(1) extra space beyond the input array
Robust floating-point comparisons: absolute/relative and ULP-based
1#include <bits/stdc++.h>
2using namespace std;
3
4// Absolute/relative tolerance comparison
5bool approx_equal(double a, double b, double rel_tol = 1e-12, double abs_tol = 1e-15) {
6 // Handle infinities and NaNs explicitly
7 if (std::isnan(a) || std::isnan(b)) return false; // NaN is never equal
8 if (std::isinf(a) || std::isinf(b)) return a == b; // only equal if same infinity
9
10 double diff = fabs(a - b);
11 if (diff <= abs_tol) return true; // near zero
12
13 double max_ab = max(fabs(a), fabs(b));
14 return diff <= rel_tol * max_ab;
15}
16
17// ULP distance for IEEE 754 double (orders all bit patterns monotonically)
18static inline uint64_t to_ordered_uint(double x) {
19 uint64_t u; memcpy(&u, &x, sizeof(u));
20 // Map negatives to high range so that increasing numeric values have increasing ordered integers
21 return (u & (1ull<<63)) ? (~u + 1ull) : (u | (1ull<<63));
22}
23
24uint64_t ulp_distance(double a, double b) {
25 if (std::isnan(a) || std::isnan(b)) return UINT64_MAX; // undefined distance
26 uint64_t ua = to_ordered_uint(a);
27 uint64_t ub = to_ordered_uint(b);
28 return (ua > ub) ? (ua - ub) : (ub - ua);
29}
30
31int main() {
32 double x = 1.0;
33 double y = std::nextafter(1.0, 2.0); // adjacent representable number above 1.0
34
35 cout.setf(std::ios::boolalpha);
36 cout << "x = " << setprecision(17) << x << ", y = " << y << "\n";
37 cout << "approx_equal(x, y) with tight tolerances: " << approx_equal(x, y, 1e-16, 0.0) << "\n";
38 cout << "ULP distance(x, y) = " << ulp_distance(x, y) << " (should be 1)\n";
39
40 double a = 1e9;
41 double b = a + 0.1; // 0.1 may not change a if it rounds away
42 cout << "ULP distance(1e9, 1e9+0.1) = " << ulp_distance(a, b) << "\n";
43}
44

This example shows two practical strategies for comparing doubles. The first uses a combination of absolute and relative tolerances, suitable for most numeric code and stable across scales. The second uses ULP distance by mapping the bit patterns to a monotonic integer order; it counts how many representable numbers lie between the two values. We also demonstrate std::nextafter to move by exactly one ULP.

Time: O(1) for each comparison and ULP distance computationSpace: O(1)
#ieee 754#floating point#machine epsilon#unit roundoff#ulps#kahan summation#numerical stability#subnormal numbers#rounding modes#fused multiply-add#nextafter#nan infinity#compensated summation