Learning Situated Awareness in the Real World

Chuhan Li; Ruilin Han; Joy Hsu; Yongyuan Liang; Rajiv Dhawan; Jiajun Wu; Ming-Hsuan Yang; Xin Eric Wang

Learning Situated Awareness in the Real World

Intermediate

Chuhan Li, Ruilin Han, Joy Hsu et al.2/18/2026

arXiv

Key Summary

•SAW-Bench is a new test that checks if AI can understand the world from a first-person view, like wearing smart glasses.
•It uses 786 real videos recorded with Ray-Ban Meta (Gen 2) glasses and 2,071 multiple-choice questions across six tasks.
•The best AI model reached 53.89% accuracy, while humans scored 91.55%, showing a large gap in observer-centered reasoning.
•Models often confuse head turns (camera rotation) with walking (body movement), which breaks their sense of the route shape.
•As paths get more complex (more turns), models’ accuracy drops sharply, while humans stay strong.
•Many models don’t keep a stable memory of objects when those objects leave the camera view, causing change-blindness-like mistakes.
•Outdoor scenes were not always harder than indoor scenes, showing that scene size isn’t the main driver of difficulty.
•SAW-Bench focuses on six skills: Self-Localization, Relative Direction, Route Shape, Reverse Route Plan, Spatial Memory, and Spatial Affordance.
•The benchmark encourages models to build a coherent, observer-centric understanding rather than relying on shortcuts from a few key frames.
•This work pushes AI toward safer, more reliable behavior in robotics, AR/VR, and assistive technology by testing the skills humans use to move and act in the real world.

Why This Research Matters

If AI can keep track of where “I” am and what “I” can do, it becomes far safer and more helpful in the real world. Robots that understand observer-centric space can navigate crowded homes without bumping into people or pets. AR/VR systems that align graphics to your exact viewpoint feel natural and reduce motion sickness. Assistive wearables can give precise, timely directions like “the crosswalk button is within arm’s reach on your right” instead of vague hints. Delivery and warehouse robots can retrace routes reliably after multiple turns, cutting errors and downtime. SAW-Bench pushes AI toward these practical, human-aligned skills by testing the same mental tools people use every day to move, remember, and act.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re walking to the cafeteria while wearing a camera on your glasses. Even if you look left and right, you still know where you are, which way you’re moving, and how to get back. That quiet superpower is what helps you not get lost.

🥬 Filling (The Actual Concept):

What it is: Situated awareness is knowing where you are, which way you’re facing, and what you can do right now, all from your own point of view.
How it works (step by step):
1. You see the world from your own eyes (egocentric view).
2. You update your inner map as you move and turn (path integration).
3. You remember what changed and what stayed the same (spatial memory).
4. You judge what actions are possible here and now (affordances).
Why it matters: Without it, you’d confuse looking left with walking left, get lost after a few turns, forget where objects were, and try actions that aren’t physically possible.

🍞 Bottom Bread (Anchor): If you spin in place to look around, you still know you haven’t actually moved across the room. That’s situated awareness keeping rotation separate from translation.

The World Before:

You know how many school tests measure facts, not how you use them in real life? Many AI benchmarks did something similar: they tested models on environment-centric skills—like distances between objects in a photo—without asking, “Where am I in this scene? What did I just do? How can I get back?”

🥬 New Concept Sandwich – Egocentric Video

Hook: Imagine a GoPro on your forehead recording exactly what you see.
What it is: Egocentric video shows the world from the camera wearer’s own perspective.
How it works: (1) You wear the camera; (2) it records scenes as you look and move; (3) frames show how objects shift with your head and body.
Why it matters: It preserves the observer’s viewpoint, which is essential for testing first-person reasoning.
Anchor: Turning your head makes a lamp swing across the screen even if you didn’t walk—egocentric video captures that.

The Problem:

Models were mostly evaluated as passive spectators (third-person), not as agents who move, turn, remember, and plan from their own viewpoint. This misses skills that robots, AR glasses, and assistive tools must have to work safely around people.

Failed Attempts:

Some benchmarks tested object-to-object relations, counting, or rough distances from static images or detached videos. Others stitched together multiple views but assumed the “observer” was fixed. These tests didn’t stress the core challenge: keeping track of yourself as you move in real time.

🥬 New Concept Sandwich – Multimodal Foundation Models (MFMs)

Hook: Think of a super student who can read text, look at pictures, and watch videos.
What it is: MFMs are AI systems that process several data types (like video + text) together.
How it works: (1) Encode visuals; (2) encode language; (3) fuse them; (4) reason to answer questions.
Why it matters: To understand a moving world from a first-person view, AI must combine vision and language over time.
Anchor: An MFM watches your hallway video and answers, “Am I near the classroom door or the stairs?”

The Gap:

We lacked a real-world, first-person benchmark that demands observer-centric reasoning: separating head turns from walking, remembering scene changes, tracing routes, and planning the way back.

Real Stakes (Why care?):

Robots need to grasp where they are and what they can reach without bumping into you.
AR/VR must align graphics with your exact viewpoint to avoid nausea and confusion.
Assistive wearables should say, “The crosswalk button is within arm’s reach on your right,” not just, “There’s a button somewhere.”
Navigation aids should help you reverse a route, even after several turns.

🥬 New Concept Sandwich – Observer-Centric Tasks

Hook: When you play a first-person video game, you must track where you are and what you can do.
What it is: Tasks that focus on what the world means relative to the observer (you), not just objects to each other.
How it works: (1) Anchor everything to the observer’s head/body; (2) update with each move or turn; (3) reason over time, not just one frame; (4) decide actions you can take.
Why it matters: It’s the difference between “The door is left of the desk” and “The door is to my left and within arm’s reach right now.”
Anchor: “From where I’m standing at the end, where was I at the start—front left or back right?” That’s observer-centric.

This paper introduces SAW-Bench to fill that missing test. It uses real egocentric videos and carefully designed, multiple-choice questions that force observer-centric reasoning. The results show a big human–AI gap, and they reveal exactly where models get confused: head turns vs body motion, error buildup across turns, object memory when out of view, and why big outdoor spaces aren’t automatically harder than cluttered indoor ones.

02Core Idea

🍞 Top Bread (Hook): You know how tracing your steps back home only works if you remember when you turned and how far you walked? That “self-aware map” in your head is the key.

🥬 Filling (The Actual Concept):

What it is: The core idea is a new benchmark—SAW-Bench—that tests whether AI can build and use a self-centered understanding of space from real, first-person videos.
How it works: (1) Record real egocentric videos across varied indoor/outdoor scenes; (2) ask six observer-centric questions per video; (3) require models to integrate head turns, body motion, memory, and feasibility; (4) score how close they get to human-level answers.
Why it matters: Without this test, models might look smart on detached tasks but fail in real-world navigation, assistance, and safety-critical actions.

🍞 Bottom Bread (Anchor): If an AI watches you walk an L-shaped path, SAW-Bench checks if it calls it an L (not a zigzag) even when your head was panning around.

The “Aha!” Moment (one sentence): Test AI not as a spectator but as the camera wearer—so it must keep track of itself over time.

Multiple Analogies (3 ways):

GPS vs. Headlights: Old tests are like judging a car by its headlights (what it sees); SAW-Bench checks the GPS-like sense of self-location and direction.
Dance Partner: It’s not enough to see your partner; you must also know where you are and how you just spun—SAW-Bench tests that embodied awareness.
Breadcrumb Trail: It’s the difference between glancing at crumbs and continuously tracking the winding path they make.

Before vs. After:

Before: Benchmarks mostly rewarded recognizing objects and relations from a stable viewpoint, or stitching views without tracking the moving observer.
After: With SAW-Bench, success requires keeping a running, observer-centric coordinate system—separating camera rotation from body translation, remembering what changed, and planning reverse routes.

Why It Works (intuition, no equations):

Egocentric videos preserve the exact way objects slide across your view when you turn vs. when you walk. Questions that demand reverse route planning and relative direction make shortcuts (like only looking at first/last frames) fail. Memory and affordance tasks force temporal reasoning and physical constraints. Together, they pressure-test the hidden “self-map” a good model needs.

Building Blocks (the idea in pieces):

🍞 New Concept Sandwich – SAW-Bench
- Hook: Think of a report card that checks if you can navigate a maze with a camera on your head.
- What it is: A benchmark of 786 real egocentric videos with 2,071 human-written multiple-choice questions across six observer-centric skills.
- How it works: (1) Film varied, controlled routes with smart glasses; (2) ask questions tied to self-location, direction, path shape, reverse planning, memory, and action feasibility; (3) score models in zero-shot mode.
- Why it matters: It reveals whether models truly understand space like people do.
- Anchor: “From my current view, how do I get back to the start—left, straight, then left?”
🍞 New Concept Sandwich – The Six Tasks (overview)
1. Self-Localization
  - Hook: You know if you’re at a lawn’s corner or center just by looking around.
  - What: Identify where the observer is in the scene.
  - How: Use the view to place yourself among landmarks.
  - Why: Without it, other spatial reasoning collapses.
  - Anchor: “Am I along the side or near the center of the lawn?”
2. Relative Direction
  - Hook: Remember where you started compared to where you ended.
  - What: Compare the starting spot relative to the ending viewpoint.
  - How: Track turns and steps to map start to end.
  - Why: Prevents getting lost after rotations.
  - Anchor: “From here at the end, was I back-right at the beginning?”
3. Route Shape
  - Hook: A head pan shouldn’t turn a straight walk into a zigzag.
  - What: Identify the geometric path walked.
  - How: Separate rotation (turning) from translation (moving).
  - Why: Models often confuse these and mislabel the shape.
  - Anchor: Calling a straight walk with head turns “straight,” not “zigzag.”
4. Reverse Route Plan
  - Hook: Retracing steps after two turns is tricky unless you tracked them.
  - What: Plan the step-by-step path back to the start.
  - How: Invert the forward moves and turns in order.
  - Why: Forces coherent memory over the whole video.
  - Anchor: “Turn around, go straight, left, straight, then left.”
5. Spatial Memory
  - Hook: Spot what changed between earlier and later.
  - What: Detect an added/removed/moved object.
  - How: Compare views over time, not just one frame.
  - Why: Counters change blindness.
  - Anchor: “The A-frame sign appears later; that’s the change.”
6. Spatial Affordance
  - Hook: Can I reach it without stepping?
  - What: Judge if an action is physically possible from here.
  - How: Use size/depth cues and body reach.
  - Why: Prevents unsafe or impossible moves.
  - Anchor: “I can touch the chair to my right with just my arm.”

In short, SAW-Bench’s innovation is to grade the model’s inner compass and memory—not just its object-spotting.

03Methodology

At a high level: Egocentric Video → Task Design (6 skills) → Recording Protocol → QA Annotation + Quality Checks → Zero-shot Model Evaluation → Accuracy + Error Analyses.

Step-by-step (like a recipe):

Collect Real Egocentric Videos

What happens: Participants wear Ray-Ban Meta (Gen 2) smart glasses and record continuous, first-person clips across 15+ indoor/outdoor scenes (e.g., lawns, courtyards, classrooms, recreation rooms).
Why it’s needed: First-person capture preserves observer-centric cues: how objects move when you turn vs. when you walk.
Example: A participant walks straight across a plaza but looks left-right frequently, creating a challenge to separate rotation from translation.

Predefine Trajectories and Coverage

What happens: Researchers design 37 movement templates (e.g., in-place rotations, Manhattan-style two turns, simple geometric shapes like L/U/zigzag) to control difficulty and ensure variety.
Why it’s needed: Guarantees a spread of route complexities to test “error accumulation” over turns.
Example: Two videos in the same courtyard: one straight/no head turns; another straight/with frequent head pans. Translational motion is the same; only rotation differs.

Design Six Observer-Centric Tasks and Questions

What happens: For each video, human annotators (often the recorders) write multiple-choice questions tied to the observer’s viewpoint and motion.
Why it’s needed: Forces models to use egocentric reasoning rather than object-only shortcuts.
Example: “What’s the route shape?” or “From my end-viewpoint, where is my starting point?”

Ensure Clear Ground Truth with Controlled Changes

What happens: For Spatial Memory, two short clips from the same scene (before/after) are concatenated after a controlled object change (e.g., place an A-frame sign later). Audio is removed to keep it vision-only.
Why it’s needed: Makes the “changed object” unambiguous and isolates visual memory.
Example: Early video: no sign by the lamp post. Later video: sign appears there.

Quality Control (Video + Annotation)

What happens: Reviewers filter out recordings with too-rapid head motion, poor visibility, occlusions, or insufficient coverage; re-film if needed. Each QA pair is double-annotated; disagreements are resolved with guidelines.
Why it’s needed: Reduces noise that could confuse both humans and models; keeps answers unambiguous.
Example: If a turn is too jerky to judge, the clip is redone for clarity.

Evaluation Protocol (Zero-shot)

What happens: 24 MFMs (16 open-source, 8 proprietary) answer the multiple-choice questions purely from the video frames (sampled at 2 fps for most models) without fine-tuning. A strict prompt format asks for a choice and a reasoning trace.
Why it’s needed: Tests raw, general-purpose ability rather than specialized training for this benchmark.
Example: A model must decide if the path was L-shaped or zigzag using only the provided frames.

Baselines for Context

What happens: Compare against (a) Chance (Random), (b) Chance (Frequent answer), (c) Blind LLM (no visuals), (d) Socratic Model (caption the video, then answer from text only), and (e) Human level.
Why it’s needed: Shows how much visuals and egocentric cues matter; reveals how close models are to humans.
Example: The Blind LLM scores just above chance, proving visuals are needed.

Parsing and Scoring

What happens: A regex extracts each model’s option choice; if it fails, a small helper model extracts it. Accuracy is computed overall and per task.
Why it’s needed: Makes evaluation consistent across diverse models with varying output styles.
Example: If a model writes, “I choose B (L-shape),” the parser logs “B.”

Error Probing Analyses

What happens: Researchers run controlled comparisons and stratify results by turns and environment type.
Why it’s needed: Diagnoses where models fail: rotation vs. translation confusion, error build-up across turns, view-bound memory, indoor vs. outdoor assumptions.
Example: Same straight path with vs. without head pans: many top models wrongly flip to “zigzag” when head pans appear.

Secret Sauce (what’s clever):

The benchmark is built to defeat shortcutting. Questions like Reverse Route Plan force models to remember the whole journey, not just first/last frames. Route Shape trials separate camera rotation from walking movement. Memory tasks require a persistent world model, not just a caption of the latest frame. Affordance checks force depth and reach reasoning from the current pose.

Sandwich callouts for the six tasks (operational view):

Self-Localization
- Hook: You can tell if you’re at a corner just by scanning the view.
- What: Place yourself within the scene (corner/side/center).
- How: Match landmarks and edges from your perspective.
- Why: Without it, directions and plans float without anchor.
- Anchor: “Along the side of the lawn.”
Relative Direction
- Hook: After walking, you can still point to where you began.
- What: From the end view, say where the start is (e.g., back-right).
- How: Accumulate turns and steps.
- Why: Prevents losing the start point after rotations.
- Anchor: “From here, the start is front-left.”
Route Shape
- Hook: Head panning shouldn’t bend your path.
- What: Identify the geometric route.
- How: Separate rotation (pan) from translation (walk).
- Why: Avoids mislabeling straight as zigzag.
- Anchor: Straight walk with head turns is still “straight.”
Reverse Route Plan
- Hook: To go back, invert each step/turn in reverse order.
- What: Provide a step-by-step return path.
- How: Reverse the forward trajectory.
- Why: Requires coherent, global memory of the path.
- Anchor: “Turn around, straight, right, straight, left.”
Spatial Memory
- Hook: Notice what changed between earlier and later views.
- What: Pick the object that changed.
- How: Compare snapshots over time.
- Why: Guards against change blindness.
- Anchor: “A-frame sign appeared later.”
Spatial Affordance
- Hook: Can you touch it without stepping?
- What: Judge action feasibility from here.
- How: Use depth cues and body reach.
- Why: Avoids impossible/unsafe actions.
- Anchor: “Yes, the chair is within arm’s reach.”

04Experiments & Results

The Test (what was measured and why):

Models had to answer observer-centric, multiple-choice questions about egocentric videos across six tasks: Self-Localization, Relative Direction, Route Shape, Reverse Route Plan, Spatial Memory, and Spatial Affordance. Accuracy shows whether they truly keep a first-person, time-updated understanding.

The Competition (who/what was compared):

24 MFMs (16 open-source; 8 proprietary) in zero-shot mode.
Baselines: Chance (Random), Chance (Frequent), Blind LLM (no visuals), Socratic Model (caption-then-reason), and Human level.

The Scoreboard (with context):

Overall human accuracy: 91.55% (an A-level performance).
Best model (Gemini 3 Flash): 53.89%—that’s more like a near-failing D when compared to humans, leaving a 37.66% gap.
Blind LLM: 31.34%, barely above chance—proving video content is essential.
Socratic Model: 31.34%—captions alone drop key egocentric cues (turns, depth, timing), so no gain over blind.
Proprietary vs. Open-source: Proprietary models generally do better, especially on Reverse Route Plan, which stresses whole-journey memory and step inversion.

Per-task highlights:

Route Shape: Many models confuse head panning with walking direction; straight-with-pans is often mislabeled as zigzag.
Reverse Route Plan: Big gaps; it forces inverting the full sequence, not just guessing from first/last frames.
Relative Direction: Accuracy drops sharply as number of turns increases—error accumulation.
Spatial Memory & Affordance: Gaps to humans are smaller; models leverage visible depth and simple before/after cues better here.

Surprising Findings:

Rotation vs. Translation Confusion: Even top models frequently treat camera rotation like body movement. In controlled tests, Gemini 3 Flash mislabeled 60% of straight-with-head-pan cases as zigzag; Qwen3-VL 235B did so in 53.3% of cases.
Turns Hurt a Lot: As paths gain more turns, accuracy drops steeply. Humans remained strong (e.g., 100% on straight; 90% after two turns), but models like Gemini 2.5 Pro fell from 73.33% (straight) to 33.41% (two turns).
View-bound Memory: Models often assume “not visible” means “doesn’t exist,” failing to keep a persistent world-state when objects leave the frame.
Indoor vs. Outdoor: Outdoor scenes weren’t consistently harder; sometimes they were easier due to less clutter. Scene size (openness) alone doesn’t determine difficulty.

Concrete Examples:

Route Shape question where both Gemini 3 Flash and Qwen3-VL 235B overcount head pans as path changes, picking “zigzag” instead of “straight.”
Reverse Route Plan where a proprietary model correctly inverts each step (turn around, straight, right, straight, left), while an open-source model shortcut-guesses from start/end frames and gets it wrong.

Takeaway: SAW-Bench exposes a deep, still-open challenge—keeping a stable, observer-centered coordinate system and memory across rotations, translations, and time.

05Discussion & Limitations

Limitations (what this can’t do):

It evaluates understanding, not control: SAW-Bench checks if models can reason from video, but it doesn’t make robots move.
Scene coverage vs. world diversity: While videos span many places, they’re still a subset of real-world complexity; some layouts, lighting, or crowd dynamics aren’t covered.
Annotation scope: Questions are multiple-choice and tied to predefined trajectories and changes; this ensures clarity but narrows open-ended reasoning.
Temporal sampling: Most models see frames at 2 fps; fine motion cues between frames may be missed, possibly hiding some abilities (or inflating errors).

Required Resources:

Hardware: Smart glasses or an equivalent egocentric capture device; compute to run video MFMs.
People/time: Coordinated filming, double-annotation, and quality control.
Model access: APIs or local deployments for diverse MFMs.

When NOT to Use It:

If you need fine-grained physical measurements (exact meters) or 3D reconstructions; SAW-Bench is about observer-centric reasoning, not metric mapping.
If you’re evaluating second/third-person, static scenes; other benchmarks may fit better.
If you need long-horizon, hour-long narratives; SAW-Bench centers on short, controlled routes.

Open Questions:

How to encode a stable, egocentric coordinate system that cleanly separates rotation from translation?
What memory architecture best maintains a persistent world-state (objects that leave view still “exist”)?
Can self-supervised path integration on egocentric streams reduce error accumulation across turns?
How much do higher fps, inertial sensors (IMU), or depth cues help, and can models fuse them gracefully?
What training signals (e.g., reverse-route tasks during pretraining) boost situated awareness without overfitting to SAW-Bench?

Bottom line: SAW-Bench is a strong, necessary diagnostic, but pushing models to human-like situated awareness will likely need better egocentric representations, longer-term memory, and training signals tied directly to movement and action.

06Conclusion & Future Work

3-Sentence Summary:

SAW-Bench is a real-world, first-person video benchmark that tests whether AI can keep track of itself—its location, directions, path shape, way back, memory of changes, and what it can reach—like people do.
Across 24 models, even the best reached only 53.89% accuracy compared to humans at 91.55%, revealing confusion between head turns and actual movement, compounding errors with more turns, and fragile object memory.
These results spotlight the missing ingredient in many MFMs: a robust, observer-centric coordinate system and persistent world-state that updates reliably over time.

Main Achievement:

Turning observer-centric situated awareness into a concrete, measurable target with six carefully designed tasks and real egocentric videos, exposing exactly where current MFMs fall short.

Future Directions:

Build representations that explicitly separate rotation from translation; develop persistent memory that survives occlusion and viewpoint changes; explore training with reverse-route and relative-direction objectives; fuse visual streams with IMU/depth when available; and scale to richer, longer, more dynamic environments.

Why Remember This:

Because it reframes the goal: not just seeing the world, but knowing where “I” am in it. SAW-Bench pushes AI beyond passive watching toward the lived, physical understanding people rely on to move, plan, and act safely every day.

Practical Applications

•Evaluate and select AI models for AR glasses that must overlay arrows or labels accurately from the wearer’s viewpoint.
•Benchmark service robots on their ability to retrace paths in homes, hospitals, or warehouses after multiple turns.
•Train assistive navigation aids to give observer-relative instructions (e.g., “doorway is back-right”) for low-vision users.
•Stress-test mobile robots to avoid confusing head rotations (camera pans) with actual navigation steps.
•Improve retail inventory bots’ memory of object locations across aisles even when items leave the camera view.
•Optimize indoor delivery drones to plan safe reverse routes to charging bases without GPS.
•Tune VR locomotion systems to separate head looking from character movement, reducing motion sickness.
•Design classroom telepresence robots that keep a stable map of the environment as they pan to different speakers.
•Assess autonomous wheelchair software on reachability judgments (can the user press this button without moving forward?).
•Evaluate body-worn cameras’ AI for reliable incident recall, distinguishing viewpoint motion from actual displacement.

Version: 1