Steering LLMs via Scalable Interactive Oversight

Enyu Zhou; Zhiheng Xi; Long Ma; Zhihao Zhang; Shihan Dou; Zhikai Lei; Guoteng Wang; Rui Zheng; Hang Yan; Tao Gui; Qi Zhang; Xuanjing Huang

Steering LLMs via Scalable Interactive Oversight

Intermediate

Enyu Zhou, Zhiheng Xi, Long Ma et al.2/4/2026

arXiv PDF

Key Summary

•The paper tackles a common problem: people can ask AI to do big, complex tasks, but they can’t always explain exactly what they want or check the results well.
•It introduces Scalable Interactive Oversight, which breaks a big, fuzzy goal into a tree of small, easy choices that regular users can answer.
•Instead of open-ended prompts, the system asks low-burden questions (like picking or ranking options) at each small step and collects preferences as it goes.
•These tiny decisions get combined into a clear, expert-level Product Requirements Document (PRD) before any heavy execution happens.
•On website-building tasks, this approach helped non-experts create PRDs that matched expert intent much better, with up to a 54% improvement over common baselines.
•The team shows the interaction data itself can train better question-asking using reinforcement learning from online human feedback.
•Training with both user-only rewards and expert-based rewards further boosts alignment, and the trained system asks fewer, smarter questions over time.
•Alignment keeps improving as more tree nodes are discussed, showing the method scales with task complexity.
•Ablations reveal that both low-burden feedback and tree-based preference propagation are key to the gains.
•This provides a practical path to keep humans in control as AI handles longer, more complex projects.

Why This Research Matters

This work shows how everyday people can still be the boss when AI grows more capable. By turning big, fuzzy goals into many tiny choices, non-experts can craft expert-like plans without writing long specifications. That saves time, avoids expensive rework, and keeps projects aligned with real needs. It provides a practical recipe for safer, more reliable AI collaboration in schools, startups, nonprofits, and enterprises. The method also learns from real user clicks and choices, so it gets faster and smarter over time. As AI handles longer projects, this keeps human intent front and center.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Hook: Imagine you’re trying to tell a friend how to build your dream treehouse, but you don’t know the right tools or building words. You can describe the vibe—cozy, sunny, safe—but making a full blueprint is hard.

Filling: The World Before

What it is: Before this research, people used large language models (LLMs) to do complicated jobs, like building websites, by giving one big prompt or having a loose chat.
How it worked: 1) The user said a high-level idea. 2) The model planned and produced huge outputs (code, specs). 3) The user tried to steer after the fact. 4) Fixing missteps was slow and confusing.
Why it matters: Without a clear, shared plan up front, the AI could run far in the wrong direction, and fixing that later took a lot of time and expertise.

Hook: You know how ordering a custom cake is easier when the bakery gives you a checklist (size, flavor, frosting, message) instead of asking, “What do you want?”

Filling: The Problem (Two Gaps)

What it is: As AI got better at long, multi-step tasks, people stayed the same. That created a supervision gap—humans struggled to precisely guide or check the AI.
How it works:
1. Specification gap: Users can’t or don’t want to spell out every detail (not enough time, knowledge, or both).
2. Verification gap: Final outputs are huge and technical, so non-experts can’t easily tell if they’re right.
Why it matters: If the AI commits to a wrong plan early, you get impressive-looking results that don’t match what you wanted.

Hook: Imagine sorting a giant puzzle by first grouping corner pieces, then edges, then colors. It’s way easier than dumping the box and hoping.

Filling: Failed Attempts

What it is: Past methods mostly checked or debated answers after the AI finished, rather than shaping intent beforehand.
How it works:
1. AI critique and debate: useful for catching problems, but often too late and too big to be easy for non-experts.
2. Plain multi-turn chats: still ask users to write open-ended instructions; users get tired or stuck.
Why it matters: Post-hoc fixes are expensive; vague chats don’t reliably convert fuzzy ideas into precise plans.

Hook: Think of turning a rough wish list into a step-by-step menu, where each choice is small and simple.

Filling: The Gap

What it is: We lacked a pre-execution interaction layer that helps users unpack their fuzzy goals into clear, checkable specs.
How it works: If we can guide users through small, low-burden decisions and remember them, we can build a precise plan before any heavy coding.
Why it matters: It turns weak supervision (non-expert signals) into strong guidance (expert-like specs) that the AI can follow.

Hook: Picture telling a travel agent, one small question at a time (window or aisle? beach or mountains?), and ending with a perfect trip plan without writing a novel.

Filling: Real Stakes (Daily Life)

What it is: People want to build websites, apps, and tools without being engineers.
How it works: By using structured, bite-sized questions, non-experts can still make expert-level decisions.
Why it matters: You save time, avoid rework, and stay in control of what the AI builds—helpful for startups, schools, nonprofits, and anyone turning ideas into working products.

Sandwich explanations for key prerequisites and early concepts:

Hook: You know how a teacher shows you examples and the right answers so you can learn a pattern? Filling: Supervised Learning
- What it is: A way models learn from labeled examples of what’s right.
- How it works: 1) Show inputs and correct outputs. 2) Model guesses. 3) Compare to the correct answer. 4) Adjust to be closer next time.
- Why it matters: It gives models a strong starting point for good behavior. Anchor: Learning to add by seeing many “2+3=5”-type examples.
Hook: Imagine a coach giving thumbs-up or thumbs-down after each move you try. Filling: Online Feedback
- What it is: Feedback that arrives during interaction, not only at the end.
- How it works: 1) You do a step. 2) Someone reacts. 3) You adapt the next step.
- Why it matters: Early nudges keep you on track and prevent big mistakes. Anchor: A GPS saying “recalculating” when you miss a turn.
Hook: Think of a family tree or a folder tree on your computer. Filling: Tree Structures
- What it is: A way to organize big things into smaller, connected parts.
- How it works: 1) Start with a root topic. 2) Split into branches (subtopics). 3) Keep splitting until items are small and clear.
- Why it matters: It makes huge problems manageable. Anchor: A website menu with sections, subsections, and pages.
Hook: When someone asks you either/or questions, it’s easier than writing an essay. Filling: User Interaction Techniques (low-burden questions)
- What it is: Ways to ask for feedback that are easy to answer, like multiple choice or rankings.
- How it works: 1) Present clear options. 2) User picks or ranks. 3) System records preferences.
- Why it matters: Easier questions mean better, more reliable answers from non-experts. Anchor: Voting for your favorite ice cream flavors from a short list.
Hook: Imagine telling a coder, “Make it look modern and friendly,” and they handle the rest. Filling: Vibe Coding
- What it is: Building software by describing the feel and goals in plain language instead of detailed specs.
- How it works: 1) User shares high-level intent. 2) AI plans and implements. 3) User checks the outcome.
- Why it matters: It lowers barriers for non-coders, but risks misalignment without better guidance. Anchor: Asking for a “cozy, bright” website rather than specifying every CSS rule.

02Core Idea

Hook: You know how building a Lego castle is easy if someone first sorts the pieces into small bins and then asks you simple choices like “Tower here or here?”

Filling: The Aha! in one sentence

What it is: Turn a fuzzy goal into a requirement tree and guide the user through tiny, easy decisions; then combine those decisions into a precise plan the model can follow.
How it works (recipe):
1. Make a tree from the initial request (big parts → small parts).
2. Visit a leaf (a small decision) and ask low-burden questions (pick, rank, or say “Don’t know/Don’t care”).
3. Save that preference and update the tree so later questions fit better.
4. Repeat until all leaves are resolved.
5. Generate a professional PRD that encodes all choices.
Why it matters: It lets non-experts steer expert models without needing to write expert specs.

Multiple analogies

Flight booking: An agent asks step-by-step (dates, window vs aisle, nonstop vs cheaper) and builds your perfect itinerary.
Restaurant menu: Instead of describing a whole meal, you choose appetizer, main, and sides from curated options; the kitchen creates a great dish.
Map navigation: A turn-by-turn GPS uses your preferences (avoid tolls, fastest route) and keeps adjusting with each choice.

Before vs After

Before: Users typed long prompts or chatted freely; the model guessed, often making early wrong bets that were hard to undo.
After: The model first becomes an interviewer, collecting clear, structured preferences; only then does it execute, so results match intent.

Why it works (intuition)

Small, concrete questions are easier than big, fuzzy ones; many easy signals beat one hard signal.
A tree focuses attention so users don’t have to hold the whole project in their head.
Accumulated preferences propagate forward, shrinking ambiguity over time.
Early guidance prevents costly detours in long-horizon tasks.

Building blocks (with sandwich explanations):

Hook: Imagine you can’t eat a whole pizza at once, so you slice it. Filling: Recursive Task Decomposition
- What it is: Repeatedly breaking a big task into smaller tasks until each piece is easy.
- How it works: 1) Split the goal. 2) Split again. 3) Stop when choices are simple.
- Why it matters: It turns overwhelming decisions into bite-sized steps. Anchor: Writing a book outline before drafting chapters.
Hook: Choosing from a short list is faster than writing an essay answer. Filling: Low-burden Feedback
- What it is: Feedback that’s quick to give (select/rank) and easy to interpret.
- How it works: 1) Offer options with pros/cons. 2) User picks or ranks. 3) System records certainty (can say Don’t Know/Don’t Care).
- Why it matters: It boosts reliability and reduces user fatigue. Anchor: Clicking radio buttons in a survey.
Hook: When you say “less spicy” for one dish, a good chef also tones down the rest of the meal. Filling: Preference Propagation
- What it is: Using past choices to shape future questions and decisions.
- How it works: 1) Store each choice. 2) Update the plan. 3) Ask the next best question based on history.
- Why it matters: Interactions get smarter and faster; alignment steadily improves. Anchor: A shopping app recommending items that match your previous likes.
Hook: A table of contents guides a book; a PRD guides a build. Filling: Product Requirements Document (PRD)
- What it is: A structured blueprint describing what to build and why.
- How it works: 1) Capture product overview, core features, non-functional needs, UX, and business rules. 2) Organize clearly. 3) Use it to direct implementation.
- Why it matters: It’s easier to judge a plan than thousands of lines of code. Anchor: A recipe card that any cook can follow.
Hook: A fair referee scores how close a performance is to the target routine. Filling: Alignment Score
- What it is: A score for how well the output matches the intended requirements.
- How it works: 1) Break intent into checkable rubrics. 2) Count how many are satisfied. 3) Report the fraction.
- Why it matters: Numbers with context help compare methods fairly. Anchor: Getting 87% on a test that measures the material you meant to learn.
Hook: Like learning a game by trying moves and seeing if your score goes up. Filling: Reinforcement Learning (for interaction)
- What it is: Training the question-asking policy using rewards from users and evaluators.
- How it works: 1) Run interactions. 2) Reward fewer “Don’t Care” answers and better PRD outcomes. 3) Update the policy to ask smarter, faster questions.
- Why it matters: The interviewer gets better over time without needing expert labels for everything. Anchor: A coach fine-tunes practice drills after watching your performance.

Core idea in one breath: Make the model a great interviewer first, so execution later goes right the first time.

03Methodology

High-level pipeline: Input → Tree Initialization → Node-level Interaction → Preference Update (Tree evolves) → Repeat until done → PRD Generation → (Optional) RL improves the interviewer

Step A: Tree Initialization

What happens: From the user’s first request (e.g., “I want a smart-lighting website”), the system builds a requirement tree with five top sections: Product Overview, Core Functional Modules, Non-functional Requirements, User Experience Design, and Business Rules, then adds sub-branches.
Why it exists: Starting broad-to-specific keeps the scope organized so users face focused choices.
Example: “Core Functional Modules” might branch into “Device Compatibility,” “Payment,” and “Review System.”

Step B: Depth-first Node Selection

What happens: The system visits the next unresolved leaf (smallest unit) and focuses the conversation there.
Why it exists: Small, local decisions are easier and reduce cognitive overload.
Example: Under “Payment,” it asks which methods matter: AliPay, Apple Pay, credit card, etc.

Step C: Low-burden Interaction at the Node

What happens: The model asks selection/ranking questions with clear pros/cons. The user can also say “Don’t Know” (too technical) or “Don’t Care” (out of scope).
Why it exists: This makes feedback reliable for non-experts and avoids vague, hard-to-score replies.
Example: “Pick your top two payment methods” or “Rank these from most to least important.”

Step D: Preference Summarization

What happens: The system summarizes the user’s answers into a compact node preference and confidence, then stores it in a global preference state.
Why it exists: Summaries make later steps consistent and allow the plan to adapt.
Example: “Payment: 1) AliPay (high), 2) Credit Card (medium), Apple Pay (low). Confidence: 0.8.”

Step E: Tree Update (Preference Propagation)

What happens: The system updates the remaining plan using the new preference signals—adding, removing, or reordering submodules when appropriate.
Why it exists: This keeps future questions relevant and prevents re-asking what’s settled.
Example: If the user cares deeply about guest uploads, the system may add “Tourist upload rules” under “Core Functions” and remove “Advanced moderation tiers” if marked “Don’t Care.”

Step F: Loop Until All Leaves Are Resolved

What happens: Repeat B–E, collecting preferences across the project.
Why it exists: Coverage matters; you want the entire plan captured, not just scattered parts.
Example: After 20–40 nodes, enough detail exists to write a professional PRD.

Step G: PRD Generation

What happens: The system compiles all collected specifications into a cohesive PRD, resolving overlaps and ensuring consistent terminology.
Why it exists: A single, clear blueprint is easier to evaluate (by humans or LLM judges) and to implement downstream.
Example: The PRD sections mirror the tree and include the user’s ranked choices.

Step H (Optional): RL from Online Human Feedback

What happens: The interviewer policy is trained using rewards like:
- User Reward (UR): penalize frequent “Don’t Care” answers;
- Outcome Reward (OR): PRD alignment score versus ground-truth intent;
- Progressive Reward (PR): whether a new node preference measurably improves interim alignment.
Why it exists: To automate improvement of question quality and efficiency without needing expert labels for every turn.
Example: Over training, the average number of turns per node drops while alignment rises, meaning the system asks smarter questions.

The secret sauce

Early alignment: The system shapes intent before any heavy generation, preventing costly detours.
Low-burden signals: Constrained choices create reliable, easy-to-aggregate supervision.
Preference propagation: Each answer makes the next question sharper, compounding benefits.
RL fine-tuning: The interviewer learns to be brief, targeted, and effective.

Sandwich highlights for key method pieces:

Hook: It’s easier to answer, “Pick A or B,” than to write a paragraph. Filling: Low-burden Question Design
- What it is: Multiple choice/ranking with pros/cons and safe exits (“Don’t Know/Care”).
- How it works: 1) Offer curated options. 2) Capture choice + confidence. 3) Move on quickly.
- Why it matters: High-quality signals from non-experts, fast. Anchor: A guided online form.
Hook: If you like bold colors for one room, your designer suggests bold accents elsewhere too. Filling: Preference Propagation via Tree Updates
- What it is: Using past choices to reconfigure what remains.
- How it works: 1) Summarize node preference. 2) Adjust remaining nodes. 3) Ask better next questions.
- Why it matters: Fewer irrelevant questions, quicker convergence to intent. Anchor: Streaming apps recommending content based on your last plays.
Hook: Report cards help you and your teacher see progress. Filling: Alignment Evaluation with Rubrics
- What it is: Checking how many concrete requirements the PRD satisfies.
- How it works: 1) Break intent into atomic checks. 2) Count satisfied. 3) Compute a fraction.
- Why it matters: Fair, consistent comparisons across methods. Anchor: A checklist where more boxes ticked means better alignment.

04Experiments & Results

The test: Measuring alignment you can trust

What they measured: How well the final PRD matches the target intent, using a rubric-based Alignment Score (fraction of satisfied requirements).
Why this matters: PRDs are easier to evaluate than full codebases and zoom in on whether the plan truly reflects the user’s goals.
Evaluation setting: “Sandwich” protocol—non-expert guides a strong model; an expert (or LLM-judge validated across models) scores the final PRD versus the ground-truth intent.

The competition: Who was compared?

Direct vibe-coding frameworks (e.g., Codex, Claude Code, Gemini CLI) with no structured interaction.
Vanilla multi-turn chat with free-form Q&A.
The proposed method using the tree-based, low-burden oversight agent.

The scoreboard (with context)

GPT-5 as the PRD generator: Proposed method averaged 0.670 vs. 0.503 (vanilla) and 0.481 (Codex). That’s like jumping from a mid B- to an A-.
Claude-sonnet-4.5: 0.618 (ours) vs. 0.597 (Claude Code) and 0.565 (vanilla). Solid, consistent lead.
Gemini-2.5-pro: 0.554 (ours) vs. 0.359 (vanilla) and 0.464 (CLI). That +54% relative boost is a big swing from struggling to solid.
Module-level: Biggest gains in Core Functions (Module 2), where user needs are most central; up to +0.24 over Codex on GPT-5.

Scaling with interaction

As more tree nodes were resolved, alignment steadily rose. This shows the method scales: more guided steps → clearer intent → better PRDs.

From plan to product

Using the PRDs to actually generate full-stack websites (e.g., via Claude code), the implementations aligned better too: LLM-judge improved from 0.338 to 0.656; human-judge also improved (0.453 to 0.520). Translating better plans into better products.

Ablations: What really matters?

Low-burden feedback alone improved scores modestly—so question design helps.
Adding tree-based preference propagation delivered large additional gains—even with a small controller model—showing the structure/state management itself is a key driver.

Surprising findings

User-only online rewards (fewer “Don’t Care” responses) during RL training already improved alignment—proof that weak-to-strong learning can work with just non-expert feedback.
Combining user rewards with expert-based rewards (PRD outcomes and progressive node gains) pushed performance higher still and reduced the number of turns needed. The interviewer literally learned to be more efficient.

Sandwich for the “Sandwich Protocol” itself

Hook: A science fair where a non-expert sets up the project, a robot does the work, and a judge scores fairly. Filling: Sandwich Protocol
- What it is: A test where a non-expert supervises a capable model, and an expert (or validated judge) evaluates the result.
- How it works: 1) Non-expert tries to steer. 2) Model produces output. 3) Expert scores alignment to the true intent.
- Why it matters: It measures how much a method amplifies weak human supervision. Anchor: A coach (non-expert) guiding a star player (model), with a referee (expert) scoring the game.

Numbers in plain words

If alignment were a report card, the proposed method frequently lifted grades from the C/B- range to the B+/A- range across systems, and sometimes turned a struggling performance into a solid pass.

05Discussion & Limitations

Limitations (be specific)

Early bias risk: If initial preferences are misunderstood, the system might amplify them and steer the tree the wrong way; clarification loops help, but can’t fully eliminate this.
Domain limits: The method is validated on website PRDs; very different domains (e.g., safety-critical medical systems) need stronger checks and expert oversight.
User variance: Some users may overuse “Don’t Know” or “Don’t Care,” slowing progress; training helps, but UI design could further reduce friction.
Judge reliance: LLM-judges are validated, but not perfect; human expert evaluations remain the gold standard where stakes are high.

Required resources

A capable generator model (e.g., GPT-5 class) to produce high-quality PRDs.
An interaction policy model (can be smaller) to manage the tree and questions.
Optional RL training with online users and occasional expert rewards to sharpen efficiency.

When not to use

Safety-critical or compliance-heavy contexts (medical, aviation, legal) where every requirement must be verified by domain experts.
Extremely ambiguous goals with no stable preferences yet; first do discovery work to define aims.

Open questions

Robustness: How to detect and correct early misalignment automatically without extra burden on users?
Generality: How well does this approach transfer to complex, code-level supervision or non-software domains (e.g., policy drafting, scientific workflows)?
UI innovation: Could visual sliders, drag-and-drop ranking, or templates further boost speed and accuracy versus text-only interaction?
Reward design: What other dense, cheap rewards best correlate with final alignment without needing experts?
Joint training: Can co-training the tree updater and generator further improve preference propagation and coherence?

06Conclusion & Future Work

Three-sentence summary

This paper turns fuzzy user ideas into precise, expert-like plans by guiding people through small, structured decisions in a requirement tree. By aggregating low-burden feedback before execution, it boosts alignment on complex tasks and helps non-experts produce high-quality PRDs. The same interaction signals can train the interviewer via reinforcement learning, improving both accuracy and efficiency over time.

Main achievement

A practical, scalable pre-execution oversight layer—Scalable Interactive Oversight—that reliably amplifies weak human supervision into strong, actionable guidance.

Future directions

Build richer UIs (clicks, sliders, previews) to reduce typing and speed decisions; broaden real-user studies; jointly train the tree updater and expand to end-to-end software delivery with requirement-level and code-level loops.

Why remember this

As AI gets stronger, what matters is not only what it can do, but how well we can steer it. Turning many tiny, easy choices into a solid plan keeps humans in control, cuts costly detours, and makes powerful AI a safer, more dependable teammate.

Practical Applications

•Interactive website scoping: Gather client preferences and auto-generate a PRD before any coding.
•Product discovery workshops: Replace long note-taking with structured, low-burden choices that compile into specs.
•Enterprise IT intake forms: Turn business requests into clear requirement trees that downstream teams can implement.
•EdTech project builders: Help students assemble project requirements step-by-step and learn trade-offs.
•Civic tech portals: Capture community priorities (accessibility, languages, mobile-first) and create implementable plans.
•Startup MVP planning: Rapidly align founders’ high-level vision into a concrete build plan and roadmap.
•Vendor RFP creation: Convert stakeholder preferences into precise, comparable requirement documents.
•Internal tool requests: Guide non-technical teams to specify data fields, workflows, and permissions reliably.
•Design system adoption: Standardize UX choices via structured options and propagate them across modules.
•Feature prioritization: Aggregate stakeholder rankings into a documented, defensible product scope.

Version: 1