RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

Yuan-Kang Lee; Kuan-Lin Chen; Chia-Che Chang; Yu-Lun Liu

RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

Beginner

Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang et al.1/8/2026

arXiv PDF

Key Summary

•This paper teaches a camera to fix nighttime colors by combining a smart rule-based color trick (SGP-LRD) with a learning-by-trying helper (reinforcement learning).
•The camera problem is Auto White Balance (AWB): making whites look white even under streetlights, neon signs, or moonlight.
•Night scenes are hard because there’s little light, lots of noise, and many different lamps, so old methods guess poorly and deep models often fail on new cameras.
•Their new algorithm, SGP-LRD, finds trustworthy gray-ish pixels and measures local differences to estimate the scene’s light safely in noise.
•A reinforcement learning (RL) agent then nudges two key knobs (the gray-pixel percent N and the Minkowski order p) per image, like a careful tuner.
•They train the agent with Soft Actor-Critic and an easy-to-hard curriculum so it learns fast from very few images (as few as five).
•They also release LEVI, the first multi-camera nighttime dataset, to test if methods work across different sensors.
•Results show lower color errors than previous methods at night and strong generalization to new cameras and even daytime scenes.
•The approach is interpretable (you can see what the knobs do), data-efficient, and robust to sensor changes.
•This is the first reinforcement learning framework for automatic white balance, opening a path to smarter, adaptable camera pipelines.

Why This Research Matters

Good nighttime color isn’t just about pretty photos—it affects safety, trust, and decision-making. Phones can produce clearer, more natural night shots without heavy training data, making everyone’s memories look real. Security cameras gain reliable colors under messy streetlights, helping people and software judge scenes more accurately. Cars and robots working in the dark need stable color cues to avoid mistakes caused by weird light tints. Because the method generalizes across different cameras, it can be deployed widely without retraining for every device. Its mix of interpretability and adaptability shows a practical path to smarter, more trustworthy camera pipelines.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how your eyes quickly adjust when you walk from a bright kitchen into a dark backyard at night? You still know what “white” looks like even under a yellow street lamp.

🥬 Filling (The Actual Concept): Auto White Balance (AWB)

What it is: AWB is the camera’s skill for making white things look white no matter the lighting.
How it works: (1) The camera guesses the color of the light shining on the scene. (2) It adds opposite color gains (like adding blue to cancel yellow). (3) It outputs a photo where colors look natural.
Why it matters: Without AWB, your photos would look too yellow, too blue, or strangely green whenever lights change. 🍞 Bottom Bread (Anchor): When you take a photo under warm indoor bulbs, AWB adds some blue so a white shirt doesn’t look orange.

🍞 Top Bread (Hook): Imagine trying to read a book with a tiny flashlight—everything is dim and a bit fuzzy.

🥬 Filling (The Actual Concept): Low-light Image Processing

What it is: Techniques that help cameras make clear, accurate images when there isn’t much light.
How it works: (1) Reduce noise, (2) find trustworthy pixels, (3) avoid being tricked by dark, grainy areas.
Why it matters: In the dark, random sensor noise can look like real color, confusing the camera’s AWB. 🍞 Bottom Bread (Anchor): A night street photo might look purple and speckly; good low-light processing keeps the scene natural and less noisy.

🍞 Top Bread (Hook): Think of guessing what color a room’s light bulb is—warm yellow or cool white—just by looking at how objects look.

🥬 Filling (The Actual Concept): Illumination Estimation

What it is: Figuring out the color of the light shining on the scene so the camera can correct colors.
How it works: (1) Examine the image’s color patterns, (2) predict the light’s color, (3) use that to balance the photo.
Why it matters: If the light’s color is mis-guessed, all colors shift, and whites won’t look white. 🍞 Bottom Bread (Anchor): Under a greenish streetlamp, the camera estimates “green light” and cancels it so people’s faces don’t look sickly.

🍞 Top Bread (Hook): A great chef can cook the same dish perfectly on different stoves.

🥬 Filling (The Actual Concept): Cross-Sensor Generalization

What it is: Making a method work well on different camera models with different sensors.
How it works: (1) Avoid overfitting to one camera’s quirks, (2) rely on sensor-agnostic cues, (3) adapt per image.
Why it matters: A model that looks great on one phone might give weird colors on another unless it generalizes. 🍞 Bottom Bread (Anchor): A color-correction method trained on Camera A should still fix colors on Camera B without turning whites pink.

The world before: In daytime, many AWB methods already work well because there’s plenty of light and the colors are clean. But at night, lights are mixed (neon signs, sodium lamps, car headlights), pictures are dim, and sensor noise is heavy. Classic statistical AWB tricks assume clean signals and enough variety in the scene, which often isn’t true at night. Deep learning methods can learn powerful corrections, but they usually need a lot of labeled data (which is rare for nighttime) and can break when moved to a new camera (different sensors, different pipelines).

The problem: Nighttime AWB becomes unstable—small parameter changes can give very different answers, and models trained on one dataset don’t transfer well to another camera. You need good color but also robustness.

Failed attempts: Fixed-parameter statistical methods are fast but brittle in extreme low light. End-to-end deep models can be accurate in one domain but demand lots of nighttime labels and often fail on new sensors. Some methods use pseudo-labels or heavy training tricks; these can pass along errors or rely on settings that don’t generalize.

The gap: We need a method that (1) works with little data, (2) adapts its settings per image, (3) stays interpretable and sensor-agnostic, and (4) handles noisy night scenes.

Real stakes: This matters for your phone’s night photos, for security cameras that must see true colors in the dark, and for cars that need reliable color cues at night. Getting white balance right improves trust in what we see and what machines decide.

02Core Idea

The “Aha!” in one sentence: Treat nighttime white balance like a careful game of turning two smart knobs on a reliable color rule, using reinforcement learning to adjust them per photo until whites look right.

🍞 Top Bread (Hook): Imagine a student practicing piano: try a little faster, listen, try a little softer, listen—improve step by step.

🥬 Filling (The Actual Concept): Reinforcement Learning (RL)

What it is: A way for AI to learn by trying actions and getting rewards when it improves.
How it works: (1) See the situation (state), (2) choose an action, (3) get a reward or penalty, (4) learn which actions work best over time.
Why it matters: Instead of guessing the correct answer in one shot, RL gets better by practicing and adjusting. 🍞 Bottom Bread (Anchor): The agent tweaks AWB settings on a night photo, checks if the color error shrank, and keeps the helpful tweaks.

🍞 Top Bread (Hook): You know how detectives look for trustworthy clues to tell the true story?

🥬 Filling (The Actual Concept): SGP-LRD (Salient Gray Pixels with Local Reflectance Differences)

What it is: A statistical color method that finds reliable gray-ish spots and compares local patterns to estimate the light color, even in noise.
How it works: (1) Detect likely gray pixels, (2) filter out noisy or odd ones, (3) give more weight to brighter, trustworthy areas, (4) compare local neighborhoods to reduce noise effects, (5) combine everything to estimate the illumination.
Why it matters: At night, noise fakes many pixels; SGP-LRD keeps the dependable ones and uses spatial consistency to resist noise. 🍞 Bottom Bread (Anchor): In a dim alley, it picks stable gray patches on walls or road lines and ignores sparkly noise, then infers the lamp’s tint.

🍞 Top Bread (Hook): Think of measuring how wrong a guess is, like seeing how far your arrow missed the bullseye.

🥬 Filling (The Actual Concept): Angular Error

What it is: A way to measure how far the predicted light color is from the true light color.
How it works: (1) Compare two color directions, (2) compute the angle between them, (3) smaller angle means better.
Why it matters: It gives a simple score the agent can try to reduce. 🍞 Bottom Bread (Anchor): If the true light is slightly yellow and the estimate is also slightly yellow, the angle is small—good job!

🍞 Top Bread (Hook): Like seasoning soup: a pinch more salt, or a little less spice, until it tastes right.

🥬 Filling (The Actual Concept): The Two Knobs N and p

What it is: N is “what percent of pixels we treat as gray candidates;” p is a “how-strongly-to-emphasize” setting (Minkowski order) when combining evidence.
How it works: (1) Lower N for scenes rich in clean gray areas; raise N if gray cues are rare. (2) Smaller p treats all candidates more evenly; larger p trusts high-confidence pixels more.
Why it matters: The best N and p depend on the scene; fixed choices are brittle at night. 🍞 Bottom Bread (Anchor): Under a strongly orange lamp with few clean grays, increase N to widen the search and lower p to avoid over-trusting any single noisy pixel.

Three analogies for the core idea:

Thermostat: The agent adjusts heat (N, p) a little up or down, checks the temperature (error), and settles when comfy (low error).
Radio tuner: It dials left or right (N, p) to reduce static (noise) until the music (true color) is clear.
Cooking: Taste the soup, tweak salt and spice (N, p), taste again, stop when it’s delicious (small error).

Before vs. After:

Before: One-size-fits-all parameters or big deep models that struggle on new cameras.
After: Per-image, gentle adjustments guided by a robust statistical core, needing very little data and traveling well across sensors.

Why it works (intuition): SGP-LRD gives a sturdy starting point by trusting spatially consistent grays and local comparisons, which naturally ignore random noise. The RL agent doesn’t try to invent colors; it only fine-tunes the two knobs to match the scene. This keeps things interpretable and sensor-agnostic while adding adaptability. With a reward tied to error reduction, the agent learns safe, helpful moves.

Building blocks we’ll use:

SGP-LRD: the strong, interpretable color estimator.
Knobs N and p: the adjustable parameters.
RL agent: the tuner that learns how to move the knobs.

🍞 Top Bread (Hook): When you study, you sometimes start easy and then do harder questions.

🥬 Filling (The Actual Concept): Curriculum Learning

What it is: Training that goes from simple tasks to harder ones.
How it works: (1) Practice on one image until stable, (2) rotate through a small set of images to build adaptability.
Why it matters: It makes learning faster and steadier with few samples. 🍞 Bottom Bread (Anchor): The agent first learns to tune a single night photo, then cycles through five different photos to handle variety.

🍞 Top Bread (Hook): Picking between exploring new paths and using what you know best is like deciding a new hiking route versus the familiar one.

🥬 Filling (The Actual Concept): Soft Actor-Critic (SAC)

What it is: An RL method that balances trying new actions (exploration) and sticking with good ones (exploitation).
How it works: (1) A policy proposes actions with some randomness, (2) critics judge action quality, (3) the system learns to choose high-reward yet still-exploratory moves.
Why it matters: It trains stably and sample-efficiently, perfect when you only have a few images. 🍞 Bottom Bread (Anchor): With SAC, the AWB agent learns reliable knob tweaks but keeps a tiny bit of curiosity to avoid getting stuck.

03Methodology

At a high level: Input RAW image → SGP-LRD estimates illumination using parameters (N, p) → RL agent reads image statistics and recent moves → agent proposes small deltas to (N, p) → update and re-run SGP-LRD → repeat until stable → output white-balanced image.

Step A: Find and trust the right gray pixels

What happens: SGP-LRD detects likely gray pixels, removes unreliable ones with two filters (low-variance noise and color outliers), gives higher weights to brighter, cleaner pixels, and uses local neighborhoods to dampen noise.
Why this exists: At night, many pixels are too dark or noisy to trust; picking and weighting the right ones stops the estimate from drifting.
Example: In a dim street photo, SGP-LRD keeps stable crosswalk lines and ignores sparkly sensor noise in the shadows, then estimates the lamp’s yellow tint.

🍞 Top Bread (Hook): When you check a word’s meaning, you don’t just look at the word—you read nearby words too.

🥬 Filling (The Actual Concept): Gray Pixels

What it is: Pixels that should be neutral (R≈G≈B) and so are good clues about the light’s color.
How it works: (1) Find pixels whose color channels act similarly, (2) test if they’re consistent across nearby areas, (3) keep the reliable ones.
Why it matters: Good gray pixels point straight at the light color; bad ones mislead the estimate. 🍞 Bottom Bread (Anchor): Chalk road markings often serve as strong gray cues; random sparkle in a dark corner does not.

🍞 Top Bread (Hook): When combining many votes, sometimes you want to give super-trustworthy voters more weight.

🥬 Filling (The Actual Concept): Minkowski Order p

What it is: A setting for how strongly to emphasize large, confident signals when averaging.
How it works: (1) Small p spreads weight more evenly, (2) large p gives more weight to high-confidence pixels.
Why it matters: Night scenes vary; sometimes you need to trust the few good pixels more, sometimes not. 🍞 Bottom Bread (Anchor): If you find a very clean gray patch on a sign, a larger p helps it guide the final estimate more strongly.

🍞 Top Bread (Hook): If a forest has few mushrooms, you look wider. If it’s packed, you can be picky.

🥬 Filling (The Actual Concept): Gray-Pixel Percentage N

What it is: The fraction of pixels treated as gray candidates.
How it works: (1) Lower N when there are many clean gray cues, (2) raise N to widen the net when grays are rare.
Why it matters: Picking too few misses clues; too many invites noise. 🍞 Bottom Bread (Anchor): In a neon-lit scene with few neutrals, a higher N helps capture the few useful gray hints.

Step B: Describe the image to the agent

What happens: We build a compact description from color statistics and recent knob moves so the agent can decide what to change next.
Why this exists: The agent doesn’t know the ground truth light at test time; it needs rich, label-free clues from the image and memory of recent actions.
Example: A histogram of color relationships (RGB-uv) plus a tiny history vector tells the agent, “We’ve already tried raising N; it helped a bit; what next?”

🍞 Top Bread (Hook): To summarize an entire book, you might count how often certain words appear.

🥬 Filling (The Actual Concept): RGB-uv Histogram

What it is: A compact summary of color ratios that captures how colors are distributed.
How it works: (1) Convert colors to a relative (log-chrominance) space, (2) bin them into a histogram, (3) normalize and flatten for the agent.
Why it matters: It gives the agent a bird’s-eye view of scene colors without needing labels. 🍞 Bottom Bread (Anchor): In a photo bathed in orange light, the histogram shows a skew toward warm tones, hinting at the needed correction.

🍞 Top Bread (Hook): When a coach makes the next play call, they consider both the field and the last few plays.

🥬 Filling (The Actual Concept): RL State

What it is: The information the agent sees at each step—image color statistics plus a short history of recent adjustments.
How it works: (1) Feed the histogram and history through two small neural branches, (2) fuse them for a final state embedding.
Why it matters: Without state, the agent would be guessing blindly; without history, it could repeat unhelpful moves. 🍞 Bottom Bread (Anchor): The agent knows it already tried a big increase of p last step, so maybe try a small decrease now.

Step C: Choose small, safe moves

What happens: The policy outputs two small deltas—how to nudge N and p—squashed into valid ranges so updates are smooth.
Why this exists: Big jumps can overshoot; small coordinated steps converge safely.
Example: For a too-warm scene, the agent might slightly raise N and slightly lower p to gather more balanced gray evidence.

🍞 Top Bread (Hook): Adjusting a microscope needs tiny, careful turns.

🥬 Filling (The Actual Concept): RL Action

What it is: The agent’s chosen change to the parameters (N and p) at each step.
How it works: (1) Propose a continuous move, (2) limit it to a safe range, (3) apply it and re-run the estimator.
Why it matters: The path of tiny, smart moves makes the final color reliable. 🍞 Bottom Bread (Anchor): The agent decides “+0.1 for N, −0.5 for p” and checks if whites look whiter.

Step D: Score progress and learn

What happens: We reward the agent when angular error shrinks, penalize needlessly large moves, and give a bonus when it finishes well.
Why this exists: Good rewards teach the agent which strategies steadily improve color without wild swings.
Example: If error drops from 6° to 3°, that step earns a nice reward; if it gets worse, the agent learns not to repeat that move.

🍞 Top Bread (Hook): When playing a game, you get points for good moves and lose points for risky mistakes.

🥬 Filling (The Actual Concept): RL Reward

What it is: A score that gets bigger when the color estimate improves and smaller for bad or too-big moves.
How it works: (1) Compare current error to starting error, (2) reward improvements, (3) lightly penalize large actions, (4) give a final bonus for strong overall gains.
Why it matters: The reward is the teacher; without it, the agent can’t learn. 🍞 Bottom Bread (Anchor): After three steady improvements, the episode ends and a bonus is awarded for a well-tuned image.

Step E: Learn efficiently from few images

What happens: We train with Soft Actor-Critic and a two-stage curriculum: first master one image, then cycle through five.
Why this exists: It stabilizes learning and extracts maximum value from scarce nighttime data.
Example: The agent trains on a loop of five diverse night photos, seeing each many times to learn robust adjustments.

🍞 Top Bread (Hook): Practicing scales before playing full songs makes you a steadier musician.

🥬 Filling (The Actual Concept): Soft Actor-Critic (SAC)

What it is: A training method that encourages both high rewards and healthy exploration.
How it works: (1) A policy proposes actions with some randomness, (2) twin critics rate them to avoid bias, (3) updates keep the policy effective yet curious.
Why it matters: It learns stable, data-efficient strategies—a must for few-shot training. 🍞 Bottom Bread (Anchor): SAC lets the AWB agent keep discovering slightly better knob settings instead of getting stuck early.

🍞 Top Bread (Hook): First ride a bike in a quiet park, then try gentle streets.

🥬 Filling (The Actual Concept): Curriculum Learning (recap)

What it is: Training from easy to hard to speed up mastery.
How it works: (1) Single image until stable, (2) rotate across five images to broaden skills.
Why it matters: The agent quickly learns how to tune many scenes, not just one. 🍞 Bottom Bread (Anchor): After nailing one alley scene, it learns to handle a plaza under neon signs.

Secret sauce:

A sturdy core (SGP-LRD) that’s naturally noise-resistant.
Tiny, safe, per-image knob tweaks guided by clear rewards.
An easy-to-hard curriculum and SAC for strong learning from only five images.

04Experiments & Results

The test: They measure how close the estimated light color is to the ground truth using angular error (in degrees). Smaller is better—like arrows closer to the bullseye. They also report statistics such as median (typical case) and worst-25% (how bad the bad cases get). They even check a “reproduction” version that asks: if we apply the correction, do grays look truly gray?

The competition: They compare against classic statistical methods (like edge- or gray-based approaches) and modern learning-based methods (like Cascading Convolutional Color Constancy and PCC). They follow a strict few-shot setup: only five training images per dataset for learning-based methods and their RL agent, plus a fully trained baseline as an upper bound.

The scoreboard with context:

In-dataset: On NCC, RL-AWB gets about 1.98° median angular error—like acing an exam where others are still missing several questions. On LEVI, it’s similarly strong (~2.24° median reproduction error) and competitive across other metrics.
Cross-dataset: When trained on one dataset and tested on the other, many deep models’ errors jump dramatically (like falling from an A to a D). RL-AWB stays low (e.g., around 3.03° median in one direction, 1.99° in the other), which is like keeping an A- while others drop to C or worse.
Daytime: With two small filter changes, the same framework also generalizes well to a daytime dataset (Gehler–Shi), showing it’s not only for night.

🍞 Top Bread (Hook): It’s like cooking on two very different stoves and still serving the same tasty dish.

🥬 Filling (The Actual Concept): LEVI Dataset

What it is: A new nighttime benchmark with two different camera systems and 700 RAW images, each with a color checker for ground-truth light.
How it works: (1) Capture night scenes under many lights, (2) include exact color references, (3) provide RAW data for fair testing.
Why it matters: It finally tests cross-sensor generalization in nighttime AWB—the real-world challenge. 🍞 Bottom Bread (Anchor): A method that looks great on one phone must still look great on a mirrorless camera; LEVI checks that.

Surprising findings:

Few-shot power: Training with only five images per dataset, RL-AWB still outperforms deeper models that usually crave lots of data.
Stability: The agent tends to improve within just a few steps (often three), like a careful tuner rather than a gambler.
Flexibility: Minor changes let it handle daytime scenes too, suggesting the approach can become a unified AWB solution for all times of day.

Big picture: The hybrid recipe—reliable statistics + gentle learned tuning—beats both purely fixed rules and pure deep nets in the harsh nighttime arena, and stays steady across cameras.

05Discussion & Limitations

Limitations:

Only two knobs are tuned (N and p); more parameters exist in the pipeline that might further improve results if tuned safely.
Rare scenes can still be over-corrected, leading to odd colors; adding safety constraints could help.
Training currently mixes CPU (RL updates) and GPU (image computations), which isn’t the fastest possible pipeline.
The method assumes some gray-ish cues exist; scenes with zero neutral content can still be tricky.

Required resources:

A CPU (e.g., an i5-class machine) suffices for training with Soft Actor-Critic and a handful of images.
A single GPU helps speed the SGP-LRD passes during training; inference can be efficient.

When NOT to use:

Ultra-stylized scenes with no neutral references (e.g., clubs bathed entirely in saturated colors) may have too few clues.
If you need a one-shot, zero-iteration estimate with no time for tiny adjustments, a pre-fixed fast method may be preferred.
If you lack RAW or linear data and only have heavily processed JPEGs, results may vary.

Open questions:

How to safely tune more knobs at once (hierarchical or structured policies)?
Can we add guardrails (safe RL, preference penalties) to avoid occasional over-corrections?
How well does it scale to multi-illuminant scenes where different areas have truly different lights?
Could a full GPU pipeline with batched rollouts make training near real-time and enable a unified day-night AWB agent?

06Conclusion & Future Work

Three-sentence summary: RL-AWB treats nighttime white balance as a step-by-step tuning game, using a robust statistical core (SGP-LRD) and a cautious RL agent to nudge two key parameters until colors look right. With a smart reward, Soft Actor-Critic training, and an easy-to-hard curriculum, it learns from very few images and generalizes across cameras. The new LEVI dataset proves the method stays stable when sensors change and even adapts to daytime with minimal tweaks.

Main achievement: The first reinforcement learning framework for AWB that combines interpretability and sensor-agnostic statistics with adaptive, per-image tuning to achieve state-of-the-art nighttime performance and strong cross-sensor robustness.

Future directions: Expand to more parameters via hierarchical/structured policies, add safety constraints to prevent rare over-corrections, build a fully GPU-resident training loop, and move toward a single agent that handles day and night seamlessly.

Why remember this: It shows a practical, data-efficient path to smarter camera pipelines—keep the sturdy classical core, add a gentle learning tuner—and it works where it’s hardest: noisy, mixed-light night scenes on different cameras.

Practical Applications

•Smartphone night photography with truer skin tones and more natural street scenes.
•Surveillance systems that keep reliable colors under mixed streetlights for better recognition.
•Automotive cameras that maintain stable color perception during nighttime driving.
•Body cams and dash cams with robust AWB across diverse cities and lighting conditions.
•Consumer cameras that auto-adapt AWB when switching lenses or sensors.
•Video conferencing in dim rooms with improved, consistent color balance.
•Action cameras that handle tunnels, night rides, and rapidly changing lighting.
•Drones performing night inspections with faithful color for safety markers or signs.
•Retail and warehouse cameras that keep consistent colors for inventory and safety checks.
•Medical or industrial imaging in low light where accurate color is critical.

Version: 1