-Agent Spatial tool-use elicits reasoning for spatial intelligence.

Yalun Dai^1*, Hao Li^1,4*◆, Shulin Tian¹, Runmao Yao¹, Yuhao Dong¹, Fangzhou Hong^1◆, Zhaoxi Chen^1◆, Fangfu Liu², Baoliang Tian³, Dingwen Zhang⁴, Tao Wang^3†, Kim-Hui Yap^1†, Ziwei Liu^1◆

¹Nanyang Technological University ²Tsinghua University ³ByteDance ⁴NWPU ^◆Ropedia
^*Equal contribution. ^†Corresponding author.

Paper arXiv Code BibTeX

Agent Trajectory Sample

Step 00 · Question

Standing by the headphones, facing away from the entrance door — where is the camera?

Choices: front-left, front-right, back-left, back-right.

Step 01 · 2D Evidence

Ground the entities before reasoning.

headphones + door + camera · 3 / 3 located ✓

Step 02 · 3D Lift

Lift grounded objects into the scene frame.

3 anchors · headphones, door, camera centers

Point cloud aligned to top-down scene view

Step 03 · Orientation

Build the egocentric frame.

Stand at headphones, face away from the door

Step 04 · Direction

Resolve the camera quadrant.

Camera falls at the person's back-right

Dback-right

Final · Answer

Camera is at the person's back-right.

✓ correct · GT D · 3 turns · 6 tool calls

01Performance Dashboard

Direct benchmark performance comparison.

A compact view of the paper tables: bars grow on entry, values count up, and S-Agent stays highlighted against the strongest baselines.

Zero-Shot Results

S-Agent leads the broad zero-shot picture.

Table 1, 2, and 6 show S-Agent first on MMSI-Bench and ViewSpatial-Bench, with a 58.8 ReVSI average.

S-Agent Gemini 3 Pro GPT-5.x VST-3B / 7B InternVL3.5-38B Qwen3-VL / SN-SI VLM3R-7B Seed 1.6

MMSI-Bench

Accuracy (%)

0.0S-Agent

0.0Gemini

0.0GPT-5.4

0.0Seed

0.0VST-7B

ReVSI

Accuracy (%)

0.0S-Agent

0.0InternVL

0.0Qwen32B

0.0GPT-5.2

0.0VLM3R

ViewSpatial-Bench

Accuracy (%)

0.0S-Agent

0.0VST-3B

0.0SN-SI

0.0VST-7B

0.0Gemini

Zero-Shot Results

S-Agent +1.2 on MMSI-Bench
vs Gemini 3 Pro.

SFT Results

S-Agent-8B lifts Qwen3-VL-8B.

Table 3 shows +10.5 on MMSI-Bench and +4.6 on ViewSpatial-Bench over the base Qwen3-VL-8B.

S-Agent-8B S-Agent (Qwen3-VL) GPT-5.4 Qwen3-VL-8B

MMSI-Bench

Accuracy (%)

0.0GPT-5.4

0.0S-Agent-8B

0.0Qwen

0.0S-Agent Qwen

ViewSpatial-Bench

Accuracy (%)

0.0S-Agent-8B

0.0GPT-5.4

0.0S-Agent Qwen

0.0Qwen

SFT Results

S-Agent-8B +10.5 on MMSI-Bench
vs Qwen3-VL-8B.

02The framework

A planner, an evidence hierarchy, and a dual memory.

At each step, a tool-calling planner maps the question, observations, and memory to an evidence request; a tool or expert executes it and updates both memories. The agent terminates once evidence is sufficient.

The S-Agent pipeline: VLM planner, three-level spatial tool hierarchy, and scene/agent memory.

The S-Agent pipeline. A VLM acts as semantic planner, spatial tools and experts as scene-specific evidence providers, and memory as the carrier of persistent 3D state across views, frames, and reasoning steps.

Level 1

2D Visual Evidence Acquisition

Pulls useful clues from many overlapping, incomplete views: picking the frames that matter, finding the objects the question asks about, and locating candidates with open-vocabulary detection.

vlm_ground detect (GDINO) depth keyframe

Level 2

2D-to-3D Geometric Lifting

Turns flat image clues into 3D: depth, real-world coordinates, camera poses, and bird's-eye / new-view evidence — so scattered observations all live in one shared space.

metric_3d (DA3) camera pose BEV

Level 3

Spatial Knowledge Aggregation

Expert tools turn the clues into clear answers — how many, which direction, which way things face, and how big or far — handed back to the planner ready to use.

measure count relpos vis_orient obj_view

Scene Memory

Builds a growing memory organized around objects — tying repeated sightings to the same object and collecting its visual and 3D evidence. It remembers only what the question needs, not a full 3D scan of the scene.

Agent Memory

Keeps a record of the reasoning — thoughts, tool calls, results, failures, and partial conclusions — so the planner can see what's missing, recheck what's unsure, and avoid repeating or contradicting itself.

03Results

Best overall zero-shot spatial reasoning.

Without any spatial fine-tuning, S-Agent tops MMSI-Bench and ViewSpatial-Bench, with the largest gains on motion and perspective reasoning.

✓Best overall on MMSI-Bench (46.4%), surpassing Gemini 3 Pro & GPT-5.4
✓Up to +17.4 pp over the base VLM; +31.1 on camera-motion
✓S-Agent-8B distilled to Qwen3-VL-8B reaches GPT-5.4 parity

MMSI-Bench — zero-shot, by reasoning dimension

Per-dimension accuracy following the MMSI-Bench taxonomy. C/O/R = camera / object / region. Δ on the S-Agent row is the absolute gain over the InternVL3.5-8B base VLM.

Model	C-C	O-O	R-R	C-O	O-R	C-R	Meas.	Appr.	Cam.	Obj.	MSR	Avg.
Proprietary models
Gemini 3 Pro	47.3	48.9	42.0	43.0	37.6	60.2	64.1	39.4	41.9	47.4	37.9	45.2
GPT-5.4	41.9	33.0	35.8	49.8	42.4	68.7	54.7	37.4	28.3	40.8	36.4	41.9
Grok 4	36.6	35.1	39.5	34.9	45.9	50.6	21.9	22.7	40.5	43.4	38.4	37.8
Open-weight general models
InternVL3.5-8B (base)	29.0	26.6	29.6	24.4	31.8	25.3	29.7	25.8	14.9	34.2	36.4	29.0
Qwen3-VL-8B-Instruct	28.0	37.2	32.1	31.4	35.3	38.5	37.5	15.2	27.0	28.9	29.8	31.1
Qwen3.5-9B	34.4	36.2	34.6	39.5	38.8	54.2	56.3	28.8	36.5	26.3	28.8	36.5
Open-weight spatial models
SN-SI-1.1-Qwen3VL-8B	44.1	38.3	33.3	65.1	38.8	59.0	48.4	24.2	29.7	34.2	22.2	38.1
VST-7B-SFT	39.8	36.2	35.8	37.2	29.4	33.7	29.7	47.0	36.5	35.5	18.2	32.5
S-Agent (Ours)	46.2	43.6	37.0	43.0	43.5	63.9	57.8	40.9	46.0	48.7	44.4	46.4 +17.4

Accuracy in %. The largest gains concentrate in camera motion (+31.1), multi-step reasoning (+8.0), and the camera–region relation (+38.6), categories that benefit most from accumulated geometric evidence.

ViewSpatial-Bench (zero-shot)

Camera- and person-perspective spatial reasoning. Δ over the base VLM.

Model	C-OVO	C-RD	P-OVO	P-RD	P-SSRD	Avg.
Gemini 3 Pro	31.6	61.9	41.1	74.4	38.9	50.4
GPT-5.4	27.9	60.2	41.0	48.5	40.1	45.6
Qwen3-VL-8B	29.7	54.2	47.3	40.3	31.1	42.2
VST-7B-SFT	29.6	52.7	51.9	50.7	64.5	50.5
S-Agent (Ours)	55.5	62.5	42.2	81.1	60.6	60.0 +14.4

Best on C-OVO (55.5) and P-RD (81.1); +20.5 over GPT-5.4 on P-SSRD.

Trajectory distillation → S-Agent-8B

Fine-tuning Qwen3-VL-8B on 292K S-Agent trajectories (S-300K). Accuracy in %.

Model	MMSI	ViewSpatial
GPT-5.4 (proprietary)	41.9	45.6
Qwen3-VL-8B (base)	31.1	42.2
S-Agent-8B (Ours)	41.6 +10.5	46.8 +4.6

A ~15-point lift over the base model on MMSI-Bench, reaching parity with advanced closed-source models.

04Reasoning trajectories

Browse real agent sessions.

Each module is a real S-Agent run — the question, the key visual evidence, the tools it called, and the answer it submitted. Open one to step through the full reasoning trace.

DistanceReVSI

Keyframes selected for a distance trajectory

Question

Measuring from the closest point of each object, what is the direct distance between the computer mouse and the smoke detector (in meters)?

GT 3.0 m · S-Agent 3.09 m ✓ — T★ keyframe search → metric-depth 3D lifting → measurement expert.

Open reasoning trajectory →

DistanceReVSI

Question

Measuring from the closest point of each object, what is the direct distance between the laptop and the desktop printer (in meters)?

GT 3.4 m · S-Agent 3.33 m ✓ — T★ keyframe search → metric-depth 3D lifting → measurement expert.

Open reasoning trajectory →

CountingReVSI

Keyframes selected for a counting trajectory

Question

How many monitor(s) are in the scene?

GT 3 · S-Agent 3 ✓ — T★ keyframe search → open-vocab detection → counting expert (multi-frame NMS).

Open reasoning trajectory →

CountingReVSI

Question

How many monitor(s) are in the scene?

GT 2 · S-Agent 2 ✓ — T★ keyframe search → open-vocab detection → counting expert (multi-frame NMS).

Open reasoning trajectory →

RouteReVSI

Keyframes selected for a route trajectory

Question

Beginning at the range hood and facing the dishwasher, which turns navigate to the floor lamp?

GT A · S-Agent A ✓ — detection → grounding → Metric3D → spatial reconstruction.

Open reasoning trajectory →

RouteReVSI

Question

Beginning at the laptop and facing the window, which turns navigate to the closet?

GT B · S-Agent B ✓ — detection → grounding → Metric3D → spatial reconstruction.

Open reasoning trajectory →

SizeReVSI

Keyframes selected for a size trajectory

Question

What is the longest dimension of the wet floor sign, measured in centimeters?

GT 61 cm · S-Agent 62.6 cm ✓ — detection → grounding → Metric3D → measurement expert.

Open reasoning trajectory →

MeasurementMMSI

Input image for a measurement trajectory

Question

Assuming the width of this cabinet is 10 cm, what is the length of this cabinet?

GT D · S-Agent D ✓ — detection → Metric3D geometry → measurement expert.

Open reasoning trajectory →

MeasurementMMSI

Question

Compared to the plate under the wooden table, which has a larger radius: the ceramic pot or the plate?

GT A · S-Agent A ✓ — object detection → 3D metric probes → size comparison.

Open reasoning trajectory →

Multi-stepMMSI

Question

With the direction of going up the stairs facing north, which spatial statement is correct?

GT D · S-Agent D ✓ — detection → grounding → Metric3D → relative-position expert.

Open reasoning trajectory →

Multi-stepMMSI

Question

Given several room facts, which statement about the scene is incorrect?

GT B · S-Agent B ✓ — count, detection, grounding, metric and relative-position checks.

Open reasoning trajectory →

Multi-stepMMSI

Question

If the window faces north, where is the cardboard box?

GT A · S-Agent A ✓ — grounding → Metric3D reconstruction → southeast-corner resolution.

Open reasoning trajectory →

↳Cite this work

BibTeX

@article{dai2026sagent,
  title   = {S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence},
  author  = {Dai, Yalun and Li, Hao and Tian, Shulin and Yao, Runmao and
             Dong, Yuhao and Hong, Fangzhou and Chen, Zhaoxi and Liu, Fangfu and
             Tian, Baoliang and Zhang, Dingwen and Wang, Tao and Yap, Kim-Hui and
             Liu, Ziwei},
  journal = {Technical Report},
  year    = {2026}
}