-Agent Spatial tool-use elicits reasoning for spatial intelligence.

Yalun Dai1*, Hao Li1,4*◆, Shulin Tian1, Runmao Yao1, Yuhao Dong1, Fangzhou Hong1◆, Zhaoxi Chen1◆, Fangfu Liu2, Baoliang Tian3, Dingwen Zhang4, Tao Wang3†, Kim-Hui Yap1†, Ziwei Liu1◆

1Nanyang Technological University   2Tsinghua University   3ByteDance   4NWPU   Ropedia
*Equal contribution.   Corresponding author.

Agent Trajectory Sample
64 sampled video frames
Step 00 · Question
Standing by the headphones, facing away from the entrance door — where is the camera?
Choices: front-left, front-right, back-left, back-right.
Headphones grounding Entrance door grounding Camera grounding
Step 01 · 2D Evidence
Ground the entities before reasoning.
headphones + door + camera · 3 / 3 located
Dense point cloud with object anchors
Step 02 · 3D Lift
Lift grounded objects into the scene frame.
3 anchors · headphones, door, camera centers
Point cloud aligned to top-down scene view
Step 03 · Orientation
Build the egocentric frame.
Stand at headphones, face away from the door
FACING DOOR CAMERA LEFT BACK-RIGHT
Step 04 · Direction
Resolve the camera quadrant.
Camera falls at the person's back-right
Dback-right
Final · Answer
Camera is at the person's back-right.
correct · GT D · 3 turns · 6 tool calls
01Performance Dashboard

Direct benchmark performance comparison.

A compact view of the paper tables: bars grow on entry, values count up, and S-Agent stays highlighted against the strongest baselines.

Zero-Shot Results

S-Agent leads the broad zero-shot picture.

Table 1, 2, and 6 show S-Agent first on MMSI-Bench and ViewSpatial-Bench, with a 58.8 ReVSI average.

S-Agent Gemini 3 Pro GPT-5.x VST-3B / 7B InternVL3.5-38B Qwen3-VL / SN-SI VLM3R-7B Seed 1.6

MMSI-Bench

Accuracy (%)
0.0S-Agent
0.0Gemini
0.0GPT-5.4
0.0Seed
0.0VST-7B

ReVSI

Accuracy (%)
0.0S-Agent
0.0InternVL
0.0Qwen32B
0.0GPT-5.2
0.0VLM3R

ViewSpatial-Bench

Accuracy (%)
0.0S-Agent
0.0VST-3B
0.0SN-SI
0.0VST-7B
0.0Gemini
Zero-Shot Results

S-Agent +1.2 on MMSI-Bench
vs Gemini 3 Pro.

SFT Results

S-Agent-8B lifts Qwen3-VL-8B.

Table 3 shows +10.5 on MMSI-Bench and +4.6 on ViewSpatial-Bench over the base Qwen3-VL-8B.

S-Agent-8B S-Agent (Qwen3-VL) GPT-5.4 Qwen3-VL-8B

MMSI-Bench

Accuracy (%)
0.0GPT-5.4
0.0S-Agent-8B
0.0Qwen
0.0S-Agent Qwen

ViewSpatial-Bench

Accuracy (%)
0.0S-Agent-8B
0.0GPT-5.4
0.0S-Agent Qwen
0.0Qwen
SFT Results

S-Agent-8B +10.5 on MMSI-Bench
vs Qwen3-VL-8B.

02The framework

A planner, an evidence hierarchy, and a dual memory.

At each step, a tool-calling planner maps the question, observations, and memory to an evidence request; a tool or expert executes it and updates both memories. The agent terminates once evidence is sufficient.

The S-Agent pipeline: VLM planner, three-level spatial tool hierarchy, and scene/agent memory.

The S-Agent pipeline. A VLM acts as semantic planner, spatial tools and experts as scene-specific evidence providers, and memory as the carrier of persistent 3D state across views, frames, and reasoning steps.

Level 1

2D Visual Evidence Acquisition

Pulls useful clues from many overlapping, incomplete views: picking the frames that matter, finding the objects the question asks about, and locating candidates with open-vocabulary detection.

vlm_ground detect (GDINO) depth keyframe
Level 2

2D-to-3D Geometric Lifting

Turns flat image clues into 3D: depth, real-world coordinates, camera poses, and bird's-eye / new-view evidence — so scattered observations all live in one shared space.

metric_3d (DA3) camera pose BEV
Level 3

Spatial Knowledge Aggregation

Expert tools turn the clues into clear answers — how many, which direction, which way things face, and how big or far — handed back to the planner ready to use.

measure count relpos vis_orient obj_view

Scene Memory

Builds a growing memory organized around objects — tying repeated sightings to the same object and collecting its visual and 3D evidence. It remembers only what the question needs, not a full 3D scan of the scene.

Agent Memory

Keeps a record of the reasoning — thoughts, tool calls, results, failures, and partial conclusions — so the planner can see what's missing, recheck what's unsure, and avoid repeating or contradicting itself.

03Results

Best overall zero-shot spatial reasoning.

Without any spatial fine-tuning, S-Agent tops MMSI-Bench and ViewSpatial-Bench, with the largest gains on motion and perspective reasoning.

MMSI-Bench — zero-shot, by reasoning dimension

Per-dimension accuracy following the MMSI-Bench taxonomy. C/O/R = camera / object / region. Δ on the S-Agent row is the absolute gain over the InternVL3.5-8B base VLM.

ModelC-CO-OR-RC-OO-RC-RMeas.Appr.Cam.Obj.MSRAvg.
Proprietary models
Gemini 3 Pro47.348.942.043.037.660.264.139.441.947.437.945.2
GPT-5.441.933.035.849.842.468.754.737.428.340.836.441.9
Grok 436.635.139.534.945.950.621.922.740.543.438.437.8
Open-weight general models
InternVL3.5-8B (base)29.026.629.624.431.825.329.725.814.934.236.429.0
Qwen3-VL-8B-Instruct28.037.232.131.435.338.537.515.227.028.929.831.1
Qwen3.5-9B34.436.234.639.538.854.256.328.836.526.328.836.5
Open-weight spatial models
SN-SI-1.1-Qwen3VL-8B44.138.333.365.138.859.048.424.229.734.222.238.1
VST-7B-SFT39.836.235.837.229.433.729.747.036.535.518.232.5
S-Agent (Ours) 46.243.637.043.043.563.957.840.946.048.744.4 46.4 +17.4

Accuracy in %. The largest gains concentrate in camera motion (+31.1), multi-step reasoning (+8.0), and the camera–region relation (+38.6), categories that benefit most from accumulated geometric evidence.

ViewSpatial-Bench (zero-shot)

Camera- and person-perspective spatial reasoning. Δ over the base VLM.

ModelC-OVOC-RDP-OVOP-RDP-SSRDAvg.
Gemini 3 Pro31.661.941.174.438.950.4
GPT-5.427.960.241.048.540.145.6
Qwen3-VL-8B29.754.247.340.331.142.2
VST-7B-SFT29.652.751.950.764.550.5
S-Agent (Ours)55.562.542.281.160.660.0 +14.4

Best on C-OVO (55.5) and P-RD (81.1); +20.5 over GPT-5.4 on P-SSRD.

Trajectory distillation → S-Agent-8B

Fine-tuning Qwen3-VL-8B on 292K S-Agent trajectories (S-300K). Accuracy in %.

ModelMMSIViewSpatial
GPT-5.4 (proprietary)41.945.6
Qwen3-VL-8B (base)31.142.2
S-Agent-8B (Ours)41.6 +10.546.8 +4.6

A ~15-point lift over the base model on MMSI-Bench, reaching parity with advanced closed-source models.

04Reasoning trajectories

Browse real agent sessions.

Each module is a real S-Agent run — the question, the key visual evidence, the tools it called, and the answer it submitted. Open one to step through the full reasoning trace.

DistanceReVSI
Keyframes selected for a distance trajectory
Question
Measuring from the closest point of each object, what is the direct distance between the computer mouse and the smoke detector (in meters)?
GT 3.0 m · S-Agent 3.09 m — T★ keyframe search → metric-depth 3D lifting → measurement expert.
Open reasoning trajectory →
DistanceReVSI
Keyframes selected for a distance trajectory
Question
Measuring from the closest point of each object, what is the direct distance between the laptop and the desktop printer (in meters)?
GT 3.4 m · S-Agent 3.33 m — T★ keyframe search → metric-depth 3D lifting → measurement expert.
Open reasoning trajectory →
CountingReVSI
Keyframes selected for a counting trajectory
Question
How many monitor(s) are in the scene?
GT 3 · S-Agent 3 — T★ keyframe search → open-vocab detection → counting expert (multi-frame NMS).
Open reasoning trajectory →
CountingReVSI
Keyframes selected for a counting trajectory
Question
How many monitor(s) are in the scene?
GT 2 · S-Agent 2 — T★ keyframe search → open-vocab detection → counting expert (multi-frame NMS).
Open reasoning trajectory →
RouteReVSI
Keyframes selected for a route trajectory
Question
Beginning at the range hood and facing the dishwasher, which turns navigate to the floor lamp?
GT A · S-Agent A — detection → grounding → Metric3D → spatial reconstruction.
Open reasoning trajectory →
RouteReVSI
Keyframes selected for a route trajectory
Question
Beginning at the laptop and facing the window, which turns navigate to the closet?
GT B · S-Agent B — detection → grounding → Metric3D → spatial reconstruction.
Open reasoning trajectory →
SizeReVSI
Keyframes selected for a size trajectory
Question
What is the longest dimension of the wet floor sign, measured in centimeters?
GT 61 cm · S-Agent 62.6 cm — detection → grounding → Metric3D → measurement expert.
Open reasoning trajectory →
MeasurementMMSI
Input image for a measurement trajectory
Question
Assuming the width of this cabinet is 10 cm, what is the length of this cabinet?
GT D · S-Agent D — detection → Metric3D geometry → measurement expert.
Open reasoning trajectory →
MeasurementMMSI
Input image for a measurement trajectory
Question
Compared to the plate under the wooden table, which has a larger radius: the ceramic pot or the plate?
GT A · S-Agent A — object detection → 3D metric probes → size comparison.
Open reasoning trajectory →
Multi-stepMMSI
Input image for a multi-step trajectory
Question
With the direction of going up the stairs facing north, which spatial statement is correct?
GT D · S-Agent D — detection → grounding → Metric3D → relative-position expert.
Open reasoning trajectory →
Multi-stepMMSI
Input image for a multi-step trajectory
Question
Given several room facts, which statement about the scene is incorrect?
GT B · S-Agent B — count, detection, grounding, metric and relative-position checks.
Open reasoning trajectory →
Multi-stepMMSI
Input image for a multi-step trajectory
Question
If the window faces north, where is the cardboard box?
GT A · S-Agent A — grounding → Metric3D reconstruction → southeast-corner resolution.
Open reasoning trajectory →
Cite this work

BibTeX

@article{dai2026sagent,
  title   = {S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence},
  author  = {Dai, Yalun and Li, Hao and Tian, Shulin and Yao, Runmao and
             Dong, Yuhao and Hong, Fangzhou and Chen, Zhaoxi and Liu, Fangfu and
             Tian, Baoliang and Zhang, Dingwen and Wang, Tao and Yap, Kim-Hui and
             Liu, Ziwei},
  journal = {Technical Report},
  year    = {2026}
}