Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in $\{\pm 1, \pm i\}$
Key Summary
- •Fairy2i turns any pre-trained real-valued Transformer layer into an exactly equivalent complex form, so nothing changes before quantization.
- •It then uses a tiny 2-bit complex codebook {±1, ±i} that keeps both magnitude and phase, packing more information than real binary or ternary sets.
- •A phase-aware quantizer chooses the nearest direction (±1 or ±i) and learns simple scales for real and imaginary parts to match layer statistics.
- •To further reduce error, Fairy2i adds a second (or third) tiny 2-bit correction by recursively quantizing the residual, like tightening a screw in small turns.
- •Inference can be multiplication-free: weights are just add, subtract, swap, and sign-flip operations, with small scales applied at the end.
- •On LLaMA-2 7B, Fairy2i at an effective 2 bits (W2) reaches perplexity 7.85 and 62.00% zero-shot average, close to FP16 and well above state-of-the-art 2-bit PTQ.
- •It reuses existing checkpoints, avoiding the huge cost and instability of training low-bit models from scratch.
- •Storage is tiny: about 1 bit per original real parameter at T=1 and 2 bits at T=2, with per-tensor scale metadata.
- •Accuracy scales well from T=1 to T=2, with diminishing returns at T=3, making T=2 a sweet spot.
- •This bridges complex-number efficiency with the practical world of pre-trained LLMs, enabling fast, cheap, greener deployments on commodity hardware.
Why This Research Matters
This method makes big AI models much smaller and faster without giving up much accuracy, so they can run on everyday hardware. That lowers costs for companies and makes advanced AI tools more accessible for schools, nonprofits, and startups. It saves energy, which is good for the environment and enables greener AI deployments. It also reduces latency, so chatbots and assistants can respond more quickly and work better on-device, improving privacy. Because it reuses existing checkpoints, teams don’t have to spend weeks retraining huge models from scratch. Finally, the approach opens doors for new hardware-friendly kernels, bringing even more speed and efficiency to real-world apps.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): Imagine you have a huge backpack full of books. You can carry it, but it slows you down and makes every trip longer and more tiring.
🥬 Filling (The Actual Concept): Quantization is like shrinking those books into slim notebooks so you can carry many more with the same energy.
- What it is: Quantization stores a model’s numbers using fewer bits so it uses less memory and runs faster.
- How it works: (1) Pick a small set of allowed numbers (a codebook). (2) Replace each model weight by the nearest codebook number. (3) Use simple math (often integers) instead of heavy floating-point math.
- Why it matters: Without quantization, big models are too slow, too expensive, and too power-hungry for many devices.
🍞 Bottom Bread (Anchor): Turning a 16-bit number into a 2-bit number is like going from a full-length novel to a summary card—much smaller and faster to flip through.
The world before: Large language models (LLMs) kept getting bigger and better. But that also meant heavier memory, more bandwidth, and more compute. People used quantization to reduce cost, often at 4 bits and sometimes 3 bits, with success. When pushing to 1–2 bits, things got tough: performance usually fell a lot because a tiny set of real numbers (like {+1, 0, −1}) can’t mimic the rich shapes of real model weights well enough.
🍞 Top Bread (Hook): You know how taking a photo after it’s already finished and compressing it too much can make it blurry?
🥬 Filling (The Actual Concept): Post-Training Quantization (PTQ) tries to compress the model after it’s fully trained, without updating the model.
- What it is: PTQ compresses a trained model using a small calibration set, with no or minimal fine-tuning.
- How it works: (1) Analyze activations on a few samples. (2) Pick scales and codebooks. (3) Replace weights and activations. (4) Hope the model still works well.
- Why it matters: It’s simple and cheap, but at very low bits (1–2), it often can’t dodge large accuracy drops.
🍞 Bottom Bread (Anchor): It’s like squeezing a finished poster into a tiny file after printing; details vanish because you didn’t adjust the design earlier.
🍞 Top Bread (Hook): Training for a race while wearing a heavy backpack makes the real race feel easier.
🥬 Filling (The Actual Concept): Quantization-Aware Training (QAT) practices with the quantizer inside training so the model adapts to its future tiny codebook.
- What it is: A training method that simulates quantization during learning and uses straight-through estimators to pass gradients.
- How it works: (1) Keep full-precision master weights. (2) Quantize copies for forward passes. (3) Backprop through quantizers using STE. (4) Update masters.
- Why it matters: It does better than PTQ at 1–2 bits, but training from scratch is costly and sometimes unstable with such tiny codebooks.
🍞 Bottom Bread (Anchor): Practicing piano on a keyboard with small, stiff keys helps you play a grand piano with ease later.
Failed attempts: Real-valued binary or ternary codebooks are too rigid at 1–2 bits; PTQ crumples, and QAT-from-scratch eats time and compute. Another path used complex numbers (like in iFairy) with a smarter 2-bit codebook {±1, ±i}. It kept both magnitude and phase, matching weights better and doing great at low bits. But iFairy needed training from scratch in the complex world, so you couldn’t reuse powerful real LLaMA-like checkpoints easily.
🍞 Top Bread (Hook): Think of music: a note has loudness and pitch. If you only keep loudness, you miss the tune; if you only keep pitch, you miss feel.
🥬 Filling (The Actual Concept): Complex-valued neural networks handle numbers with magnitude and phase (like loudness and pitch).
- What it is: Networks that use complex numbers to represent more structure in fewer bits.
- How it works: (1) Treat data as real and imaginary parts. (2) Use complex operations that preserve phase and magnitude relationships. (3) Map tasks in ways that exploit both.
- Why it matters: At very low bits, phase gives an extra degree of freedom to fit patterns better.
🍞 Bottom Bread (Anchor): Keeping both loudness and pitch lets you recognize a song even at low audio quality.
The gap: We needed a way to keep using existing real-valued checkpoints but still enjoy complex numbers’ 2-bit superpowers. That means (1) convert real layers into an exactly equivalent complex form (no behavior change), (2) apply a stable 2-bit complex quantizer, and (3) add tiny extra corrections if needed—without retraining from scratch.
Real stakes: This matters because lighter models mean cheaper cloud bills, faster responses, greener energy use, and even running decent LLMs on laptops or phones. For students and startups, it makes advanced AI more accessible. For everyone, it means smarter assistants can live closer to you, with lower latency and better privacy.
02Core Idea
🍞 Top Bread (Hook): Imagine translating a story into another language word-for-word so perfectly that every meaning stays the same—then using special shorthand in that new language to save space.
🥬 Filling (The Actual Concept): The key insight: Any real linear layer can be rewritten as a widely-linear complex layer that behaves exactly the same, and then quantized with a tiny complex codebook {±1, ±i} plus small residual corrections.
- What it is: A universal, lossless reparameterization from real to complex (widely-linear), followed by phase-aware 2-bit quantization and optional recursive residual quantization.
- How it works: (1) Convert real weights to an equivalent complex pair (U for linear, W for conjugate-linear), preserving outputs. (2) Quantize each complex weight by picking the nearest direction (±1 or ±i) and learning two simple scales for real and imaginary axes. (3) If needed, quantize the leftover error once or twice more and add it in. (4) Use multiplication-free adds/subs/swaps at inference.
- Why it matters: Without this, you must choose between weak PTQ at 1–2 bits or expensive QAT-from-scratch; with this, you reuse your favorite checkpoint and still enjoy near full-precision performance at 2 bits.
🍞 Bottom Bread (Anchor): It’s like perfectly converting a full-sized Lego ship into two types of snap blocks, then snapping on a couple of tiny correction pieces so the mini-ship looks and works almost exactly like the big one.
Three analogies for the same idea:
- Translation analogy: We translate real layers into complex layers without changing the meaning (outputs). Then we write the translated text using a 4-symbol shorthand (±1, ±i), and finally add a couple of footnotes (residuals) to match the original perfectly.
- Compass analogy: Each complex weight is a little arrow. We snap it to the closest compass direction (east, west, north, south) and keep separate length scales for east–west and north–south. If it’s off by a bit, we add a tiny second arrow to fix it.
- Bakery analogy: Every pastry (weight) must be boxed into one of four small boxes (±1, ±i). We use two stretching wraps (real and imaginary scales) to fit the pastry snugly, and if one corner still pokes out, we add a mini wrap (residual) until it’s neat.
Before vs. After:
- Before: Low-bit real codebooks (binary/ternary) could not match pre-trained layers well, causing big drops; complex 2-bit models performed better but needed training from scratch.
- After: We keep the beloved pre-trained real model, convert it exactly to complex, quantize it with phase-aware {±1, ±i}, and optionally add 1–2 small residual terms, restoring performance near FP16 at just 2 effective bits.
Why it works (intuition without equations):
- Complex numbers carry angle (phase) as well as size (magnitude), which lets us represent more shapes with fewer bits. The four directions ±1, ±i cover the unit circle’s axes evenly, so every complex weight has a close-by direction.
- Widely-linear maps include both a complex-linear part and a conjugate-linear part; together they exactly capture any real linear operation over paired dimensions. That’s why the pre-quantization behavior is unchanged.
- Recursive residual quantization lets us fix leftover errors with tiny extra terms; each step removes most of what remains, so T=2 is often enough.
🍞 Top Bread (Hook): Picture mixing red and blue paint to get purple, and having a recipe that can exactly undo the mix to get back red and blue.
🥬 Filling (The Actual Concept): Widely-linear transformation is that recipe: it rewrites any real linear layer as the sum of two complex parts (U and W) that together act exactly like the original.
- What it is: An exact, unique reparameterization of a real layer into complex-linear and conjugate-linear components.
- How it works: (1) Pair real channels into real/imaginary. (2) Solve for U (linear) and W (conjugate-linear) from the original real matrix. (3) Use them to compute identical outputs as before.
- Why it matters: It preserves behavior pre-quantization, letting us reuse checkpoints.
🍞 Bottom Bread (Anchor): It’s like rebuilding a Lego wall from squares into two interlocking zig-zag strips that, when joined, make the same wall.
🍞 Top Bread (Hook): You know compass directions N, S, E, W. If you force every arrow to one of those, you can still point roughly the right way.
🥬 Filling (The Actual Concept): Phase-aware quantization picks the closest of {±1, ±i} (the four directions) and learns two scales to match magnitudes.
- What it is: A 2-bit complex quantizer that keeps direction (phase) and uses simple scales for size.
- How it works: (1) Map each weight to ±1 or ±i by angle. (2) Learn one scale for real axis and one for imaginary axis. (3) Use STE to train with quantized copies.
- Why it matters: Using phase makes 2 bits go further than real binary or ternary.
🍞 Bottom Bread (Anchor): Choosing N, S, E, or W plus a ruler for each axis is enough to redraw most arrows very accurately.
🍞 Top Bread (Hook): When you tidy your room, you do one big sweep, then a smaller sweep to catch leftovers.
🥬 Filling (The Actual Concept): Recursive residual error quantization adds 1–2 small low-bit corrections after the first quantization.
- What it is: Iteratively quantizing the leftover difference and summing the tiny terms.
- How it works: (1) Quantize once. (2) Compute residual. (3) Quantize residual with same {±1, ±i} and scales. (4) Add it. (5) Repeat a small number of times.
- Why it matters: A tiny extra bit or two can close most of the gap to full precision.
🍞 Bottom Bread (Anchor): It’s like tightening a jar lid: the first twist gets you close, the second makes it snug.
03Methodology
High-level pipeline: Input (pre-trained real checkpoints) → Step A: Widely-linear transformation (exact reparameterization) → Step B: Phase-aware complex quantization (2-bit {±1, ±i}) → Step C: Recursive residual quantization (optional T=2 or T=3) → Output: Multiplication-free, low-bit complex LLM ready for inference.
Step A: Widely-linear transformation (exact, behavior-preserving)
- What happens: We take each real linear layer that maps 2m real features to 2n real features and pair channels as real and imaginary parts. We solve once for two complex matrices U and W so that the complex output y equals Ux + W times conjugate(x). This exactly reproduces the original real layer’s outputs for all inputs. In attention, we keep the same scores by using the real part of the Hermitian inner product, which matches the original dot product when you unstack.
- Why this step exists: Without an exact bridge, you couldn’t reuse existing real-valued checkpoints safely. You’d either approximate or retrain from scratch; both are costly or risky.
- Example with data: Suppose a real layer takes 4 numbers in and outputs 4 numbers. We group inputs as x_re=(x1,x2) and x_im=(x3,x4) to form x=x_re + i x_im. We compute U and W once from the old weight matrix. Feeding any input through either the old layer or the new widely-linear one gives identical outputs before we quantize.
Secret sandwich reminder: 🍞 Hook: Imagine switching from writing left-to-right to top-to-bottom but ending with the same poem. 🥬 Concept: Widely-linear transformation exactly rewrites the layer in complex form without changing outputs. 🍞 Anchor: You can check on a sample input; both old and new layers agree bit-for-bit before quantization.
Step B: Phase-aware complex quantization ({±1, ±i} with axis scales)
- What happens: For each complex weight w, pick the nearest of {+1, −1, +i, −i} by its angle (phase). Then compute two scales: one for weights mapped to ±1 (the real axis) and one for weights mapped to ±i (the imaginary axis). Replace w by s_re * b_re + i * s_im * b_im, where b_re,b_im are in {−1,0,+1} according to the chosen codeword.
- Why this step exists: At 2 bits, keeping direction information is gold. Real codebooks can’t store direction (phase), but complex ones do, capturing shape with very few symbols.
- Example with data: If w = 1.2 + 0.1i, its angle is close to +1 (the real axis). We set b(w)=+1, include it in the real-axis group to compute s_re (say s_re≈1.1), and s_im is learned from the ±i group separately. The dequantized w_hat becomes approximately 1.1 + 0i, close to the original. Another weight near +i would map to +i and use s_im instead.
Secret sandwich reminder: 🍞 Hook: You know how rounding a direction to N, S, E, W is faster than storing every compass degree? 🥬 Concept: Phase-aware quantization keeps direction via {±1, ±i} and scales axis lengths. 🍞 Anchor: A weight at 45° would pick +1 or +i depending on which is closer, and the scale nudges its length.
Step C: Recursive residual error quantization (T small, e.g., 2)
- What happens: After the first quantization, there’s a small difference r = w − w_hat. Apply the exact same phase-aware step to r, get a tiny correction r_hat, and add it: w_hat2 = w_hat + r_hat. Repeat once more if needed. We store the small codewords and scales per stage.
- Why this step exists: One more 1–2 bits per parameter gives a big accuracy boost and closes most of the remaining gap.
- Example with data: If the first w_hat overshoots by 0.1 − 0.05i, we quantize that residual and add it back. Now the total error might drop to a few thousandths.
Secret sandwich reminder: 🍞 Hook: Cleaning a whiteboard takes one wipe, then a second light wipe for the faint smudges. 🥬 Concept: Recursive residual quantization reuses the same tiny codebook to fix leftovers. 🍞 Anchor: After T=2, the curve on a validation plot usually hugs the full-precision line closely.
Attention compatibility and kernels
- What happens: Self-attention scores use Re(q^H k), which matches the original real dot product when stacking. Softmax and value aggregation remain real-valued per part, and the output projection is widely-linear too. Because complex vectors can be unstacked back to longer real vectors, highly optimized kernels (like FlashAttention) still apply.
- Why this step exists: No kernel lock-in; we can still use popular high-performance attention implementations.
- Example: Take q=(1+i, 2−i) and k=(−1+i, 0.5+i). The real part of q^H k equals the original stacked dot product; scores and softmax remain identical pre-quantization.
Storage and compute efficiency
- Storage: Original real layer had 4nm real parameters. Widely-linear gives 2nm complex weights. With a 4-symbol codebook, each complex weight takes 2 bits, totaling 4nm bits, i.e., 1 bit per original real parameter. With T=2, it’s 2 bits per original real parameter.
- Multiplication-free inference: Because codes are in {±1, ±i}, multiply becomes add/subtract and swap/sign-flip, with small axis scales applied after accumulation. For CPUs, four codes can be packed into one byte and looked up via LUTs.
- Why this matters: Saves energy and latency, and enables fast on-device runs.
- Example: For a matrix-vector multiply, each ±1 entry adds or subtracts an activation; each ±i swaps real/imag parts and flips a sign. Then multiply once by s_re or s_im per output channel.
Secret sauce summary
- Exact equivalence: The widely-linear reparameterization is lossless, so we start from a perfect copy of the pretrained model.
- Phase-aware 2-bit codebook: {±1, ±i} uses the full 2-bit budget and preserves direction.
- Tiny, powerful correction: Residual quantization (T=2) gives a big quality jump for a tiny bit-cost.
- Efficient inference path: Mostly adds/subs/swaps plus small scalar scales, friendly to GPUs and CPUs.
04Experiments & Results
The test: The authors measure language modeling quality and generalization under extreme compression. They use C4 perplexity (lower is better) and zero-shot accuracy on five common-sense reasoning tasks (ARC-Easy, ARC-Challenge, HellaSwag, PIQA, Winogrande). This covers both fluent prediction and out-of-distribution reasoning.
The competition: They compare against strong real-valued QAT baselines at similar bit-budgets (1-bit binary and 1.58-bit ternary), as well as well-known PTQ methods (GPTQ, QuIP#, AQLM). The reference upper bound is the original FP16 LLaMA-2 7B.
Scoreboard with context
- FP16 LLaMA-2 7B: C4 PPL 6.63; Zero-shot avg 64.72%. This is the A+ student with full-sized textbooks.
- Real-Binary QAT (1-bit): PPL 11.75; Avg 46.21%. This is like scoring a middling C when forced into a too-tiny set {+1, −1}.
- Real-Ternary QAT (1.58-bit): PPL 11.06; Avg 48.70%. A small bump over binary, still far from FP16.
- PTQ methods: QuIP# (2-bit) PPL 11.01; GPTQ (3-bit) PPL 10.61; AQLM (2-bit) PPL 8.54 and Avg 57.28%. These do reasonably well at 2–3 bits but struggle to match full precision.
- Fairy2i-W1 (effective 1 bit per original real parameter): PPL 11.03; Avg 48.66%. At about 1 bit, it matches or slightly surpasses strong real ternary at 1.58 bits, showing the efficiency of complex phase.
- Fairy2i-W2 (effective 2 bits): PPL 7.85; Avg 62.00%. This is a big leap: PPL close to FP16 (6.63) and avg accuracy near 64.72%. It surpasses leading 2-bit PTQ (AQLM PPL 8.54, Avg 57.28%) and even beats GPTQ at 3 bits on PPL (10.61).
What these numbers mean
- Perplexity: Dropping from 11-ish to 7.85 is like improving your reading guess game from guessing every other word wrong to getting most words right. It signals a strong recovery toward FP16.
- Zero-shot accuracy: 62.00% is near the 64.72% FP16 line, which is like scoring an A- next to the top student’s A.
Surprising findings
- T=2 sweet spot: Going from T=1 to T=2 yields a big jump (about 20.76% PPL improvement and roughly 19 points of average zero-shot accuracy in their ablation). But T=3 shows diminishing returns: only small extra gains for a 50% memory increase over W2. So T=2 hits a great accuracy-storage tradeoff.
- Stability with WSD LR schedule: The Warmup–Stable–Decay learning rate pattern consistently converges, with the decay phase clearly helping to reach lower final loss. This is reassuring because ultra-low-bit training can be fragile; here, it’s fairly robust.
- Efficient inference path: The method enables multiplication-free accumulation for the main loops, with only axis scales applied. On CPUs, LUT packing of 2-bit codes can further accelerate execution—practical for edge devices.
Takeaways for practitioners
- If you can afford a short QAT phase, Fairy2i-W2 likely gets you near FP16 quality at about 2 bits per original real parameter, beating well-known 2-bit PTQ methods on both PPL and zero-shot average.
- If you must be ultra-tiny, Fairy2i-W1 (about 1 bit per original real parameter) matches or edges ternary real QAT at 1.58 bits—impressive density.
- Tuning the learning rate schedule matters, but WSD gives a safe default.
05Discussion & Limitations
Limitations
- Grouping and scaling simplicity: The paper mainly uses per-tensor scales for real/imag axes. While simple and fast, finer-grained per-channel or per-group scaling might further boost accuracy. The current choice trades a little performance for implementation ease.
- Training budget still needed: Although we skip training from scratch, QAT fine-tuning over billions of tokens is still non-trivial. Truly training-free versions might be less accurate at 1–2 bits.
- Kernel maturity: While the math allows multiplication-free inference, peak-speed kernels for all platforms (custom CUDA, vectorized CPU LUTs) may need engineering work to fully realize the promised speedups.
- Odd dimensions: The method assumes even dimensions to pair real/imag components. Padding solves this, but it’s a tiny implementation wrinkle.
Required resources
- A pre-trained real-valued checkpoint (e.g., LLaMA-2 7B). Some GPU hours for QAT fine-tuning using a solid LR schedule (like WSD). Standard deep learning stacks (PyTorch) suffice; optional custom kernels improve deployment.
When not to use
- If you need immediate, training-free compression at 2–3 bits with zero compute budget, a strong PTQ method may be simpler, though typically less accurate at 1–2 bits.
- If your platform cannot handle complex-number packing or lacks engineering time for LUT/parallel execution, you might prefer mature integer PTQ pipelines.
- If your model is very small and runs fine in FP16 or 8-bit, the complexity overhead may not be worth it.
Open questions
- Grouping strategies: Would per-group axis scales or learned rotations further tighten the fit without large overhead?
- Larger scales and modalities: How does this perform for huge models (e.g., 70B) and for multimodal transformers? Do phase-aware benefits grow with scale?
- Theory: What properties of the complex loss landscape explain the observed robustness at 2 bits? Can we prove stronger convergence or error bounds for recursive residual quantization?
- Hardware co-design: With kernels co-designed for ±1, ±i codebooks, how far can we push inference speed and energy savings on GPUs, CPUs, and NPUs?
06Conclusion & Future Work
Three-sentence summary: Fairy2i shows that any pre-trained real Transformer layer can be turned into an exactly equivalent complex widely-linear form, so nothing changes before quantization. On top of that, a phase-aware 2-bit codebook {±1, ±i} plus one small residual correction step (T=2) restores accuracy close to FP16, beating strong 2-bit PTQ baselines. Inference can be implemented with mostly add/sub/swap operations and tiny scales, enabling fast, low-power deployment.
Main achievement: Bridging the gap between complex-number efficiency and the practical world of pre-trained checkpoints with a lossless transformation, then using a 2-bit phase-aware codebook and residual quantization to achieve near full-precision accuracy at extreme compression.
Future directions: Build specialized kernels to harvest the full multiplication-free speedups; scale to larger and multimodal models; explore smarter grouping/scaling and rotations; analyze the complex-valued loss landscape; and train longer to see if Fairy2i can even surpass the original full-precision baselines.
Why remember this: It proves you don’t have to choose between keeping your favorite real checkpoint and getting elite 2-bit performance. By converting exactly to complex, keeping phase with {±1, ±i}, and cleaning up with a tiny residual, Fairy2i makes super-lean, near-FP16 LLMs practical on everyday hardware.
Practical Applications
- •Deploy near-FP16-quality LLMs on consumer GPUs or CPUs with tight memory limits.
- •Run private on-device assistants on laptops or phones with lower latency and improved privacy.
- •Serve more concurrent users per server in the cloud by cutting model memory and compute.
- •Reduce energy costs and carbon footprint for large-scale AI services.
- •Speed up batch inference on CPUs using LUT-based kernels for 2-bit complex weights.
- •Enable edge devices (retail kiosks, robots, IoT gateways) to host stronger language models.
- •Compress domain-specialized LLMs (medical, legal) without retraining from scratch.
- •Accelerate research cycles by quickly quantizing many checkpoints and comparing results.
- •Create teachable, compact student models for classrooms and offline learning.
- •Prototype multimodal low-bit transformers by reusing real-valued backbones in complex form.