MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark

TL;DR

Are Vision-Language Models robust at multi-hop compositional spatial reasoning?

Not yet. In real-world scenarios, human instructions to embodied agents are far from simple — they require chaining multiple spatial inferences such as perspective-taking, attribute filtering, and relation comparison. However, existing benchmarks only test single-hop spatial relations (e.g., "Is X to the right of Y?"), which cannot capture this compositional complexity. We introduce MultihopSpatial, a benchmark featuring 1- to 3-hop compositional spatial queries paired with visual grounding evaluation. Our extensive evaluation of 37 VLMs reveals that multi-hop spatial reasoning remains a formidable challenge even for frontier models, and that models can often select the correct MCQ answer without genuinely locating the target — a critical blind spot we expose through our grounded metric, Acc@50IoU.

Comparison of existing benchmarks and MultihopSpatial

Figure: (Left) Existing benchmarks rely on single-hop standard MCQs without spatial localization. MultihopSpatial requires multi-hop compositional reasoning with grounded evaluation. (Right) Frontier VLMs show high MCQ accuracy (solid bars) but plummet under Acc@50IoU (striped bars), revealing a severe spatial grounding blind spot.

Multi-hop Compositional Reasoning

Real-world instructions require chaining multiple spatial inferences. MultihopSpatial introduces 1- to 3-hop queries composing attribute, position, and relation — going far beyond the single-step "left/right" questions of existing benchmarks.

Motivated by Embodied AI

Instructions to embodied agents (e.g., humanoid robots, VLAs) are inherently complex. An agent must adopt a perspective, filter candidates, and compare relations before it can act — exactly the multi-hop reasoning our benchmark evaluates.

Grounded Evaluation

Embodied agents need not only correct answers but also precise visual grounding. We propose Acc@50IoU, which requires both a correct MCQ answer and an accurate bounding box, exposing models that guess without truly locating the target.

Training Corpus for Spatial Intelligence

MultihopSpatial-Train (6,791 samples) enables RL post-training that enhances intrinsic spatial reasoning across 5 benchmarks and translates to improved downstream manipulation performance in VLA tasks.

MultihopSpatial

A comprehensive benchmark for multi-hop compositional spatial reasoning with visual grounding

4,500

Benchmark QA Pairs

6,791

Training Samples

3,563

Curated Images

1–3

Reasoning Hops

90%

Inter-Rater Agreement (Krippendorff's α)

MultihopSpatial is a 4,500-sample benchmark of manually annotated VQA pairs focusing on multi-hop, compositional spatial queries common in real-world scenarios. Each example provides ground-truth bounding boxes to jointly evaluate reasoning and spatial localization, which is crucial for embodied/VLA settings (e.g., grasping and action). This grounded design improves interpretability and eliminates the evaluation blind spot of random guessing.

We curate 3,563 spatially complex images from COCO and PACO-Ego4D, ensuring diverse coverage of everyday indoor/outdoor scenes and ego/exo perspectives. Upon these images, we construct 4,500 MCQs, perfectly balanced across 1- to 3-hop reasoning levels (1,500 per hop) and viewpoints (750 ego-centric and 750 exo-centric per hop).

To eliminate the reliability concerns and hallucinations inherent in AI-generated data, all QA pairs and bounding boxes were strictly annotated by ten trained human experts. Each sample underwent a rigorous multi-stage verification with three rounds of independent cross-checking, achieving high inter-annotator agreement (Krippendorff's α = 0.90). Furthermore, we introduce MultihopSpatial-Train, an auxiliary large-scale corpus of 6,791 grounded VQA samples designed to support post-training of VLMs for spatial intelligence.

Figure: Compositional structure and category definitions in MultihopSpatial.

Spatial Reasoning Categories

We define three fundamental spatial reasoning categories: Attribute (att), Position (pos), and Relation (rel). Our benchmark systematically composes these categories into multi-hop questions (e.g., 2- and 3-hop) that demand sequential intermediate inferences — identifying candidates, comparing relations, and resolving ambiguities. By increasing the hop count, we represent longer reasoning chains and escalating difficulty, enabling fine-grained diagnosis of model performance across different levels of reasoning complexity.

1-Hop: Single-step questions targeting one spatial category (pos or rel). We exclude att as a standalone category since attributes are primarily perceptual unless composed with spatial metrics. While 1-hop spatial reasoning has been explored in prior works, we deliberately include it as a controlled baseline for depth-wise comparisons against multi-hop compositions.

2-Hop: Questions combining two categories (att+pos, att+rel, or pos+rel). These queries follow a two-stage structure: (i) restricting the candidate set using one category, and (ii) identifying the target by applying the other. Both constraints must be jointly satisfied, regardless of the specific inference order.

3-Hop: Questions incorporating all three categories (att+pos+rel) in a single query. An att cue narrows candidates, after which the model reasons over pos and rel to identify the target (e.g., selecting the rightmost object that is farthest/closest). This structure mirrors how humans refer to objects in cluttered scenes, testing the disambiguation needed for embodied task execution.

Figure: Example MultihopSpatial questions for 2-hop and 3-hop reasoning under ego-centric (top) and exo-centric (bottom) views.

Benchmark Results

Evaluating 37 VLMs across 5 categories on MultihopSpatial

Evaluation Metrics

MCQ Accuracy

Percentage of correct multiple-choice predictions. Standard but does not verify spatial localization.

Acc@50IoU

Our primary grounded metric. Correct only if the answer matches and predicted bounding box has IoU ≥ 0.5 with ground truth.

Avg. IoU

Computed over MCQ-correct samples only, isolating grounding capability from reasoning errors.

Main Results

Overall Performance: Gemini-3-Pro achieves the highest MCQ accuracy (64.7%) and Acc@50IoU (40.6%), while Qwen3-VL-32B-Thinking leads the open-weight category. Top performance in answer selection does not guarantee precise localization.

Metric-dependent Rankings: Claude-Opus-4.5 drops from 7th (MCQ) to 29th (Acc@50IoU), while the smaller Qwen3-VL-4B rises from 25th to 10th due to robust grounding — proving MCQ alone is misleading.

Benchmark Difficulty: The best model peaks at just 40.6% Acc@50IoU. Under 3-hop ego-centric conditions, only 3 of 37 models exceed the 25% random MCQ baseline. GPT-5.2-Thinking achieves just 8.5% and Claude-Sonnet-4.5-Thinking just 1.9%.

Model	Overall			3Hop-Ego		3Hop-Exo		2Hop-Ego		2Hop-Exo		1Hop-Ego		1Hop-Exo
Model	Acc.	Acc@50	IoU	Acc.	Acc@50	Acc.	Acc@50	Acc.	Acc@50	Acc.	Acc@50	Acc.	Acc@50	Acc.	Acc@50
Proprietary Models — Instant
Claude-Opus-4.5	45.1	3.2	13.3	25.7	2.0	48.5	3.6	33.7	2.0	58.1	4.8	43.2	3.5	61.1	3.1
Claude-Sonnet-4.5	20.9	0.5	4.2	6.4	0.4	18.5	0.9	12.3	0.0	34.7	1.6	13.5	0.0	40.0	0.1
GPT-5.2	19.3	2.0	11.8	5.3	0.7	18.0	4.9	10.8	0.1	30.4	5.9	8.1	0.0	43.3	0.4
Open-weight Models — Instant
GLM-4.6V	43.2	35.2	69.5	15.9	12.3	46.7	39.3	22.7	18.4	61.6	53.2	32.4	24.3	80.1	63.7
Molmo2-8B	41.8	0.3	8.8	15.9	0.3	44.4	0.4	21.6	0.3	60.4	0.1	32.4	0.4	76.4	0.3
Qwen3-VL-235B	41.3	34.8	71.1	14.8	12.3	42.9	37.6	21.9	18.5	58.5	52.7	30.7	23.5	79.2	64.4
Qwen3-VL-32B	40.9	33.4	69.6	13.9	9.7	43.9	36.4	21.9	17.2	59.2	53.1	30.4	22.3	76.4	61.6
InternVL-3.5-38B	40.8	9.7	28.7	17.5	3.2	44.5	12.3	24.3	4.5	56.8	24.5	31.9	3.3	69.6	10.4
InternVL-3.5-14B	39.7	7.9	26.2	17.1	2.5	44.5	10.1	22.0	2.8	56.1	14.4	30.4	3.3	68.0	14.0
InternVL-3.5-4B	39.7	10.3	30.0	15.9	3.1	45.2	13.5	23.9	4.5	54.9	18.7	31.9	3.6	66.3	18.4
InternVL-3.5-8B	38.3	9.4	30.0	14.7	2.9	41.1	10.7	23.3	4.0	50.1	17.2	30.5	4.5	70.0	17.3
Qwen3-VL-8B	38.0	31.3	69.5	12.3	8.8	42.1	36.1	18.8	15.2	55.3	49.1	26.7	20.8	72.8	58.0
Qwen3-VL-4B	37.8	31.0	69.9	15.5	10.9	40.1	33.7	20.9	16.0	53.5	46.9	26.7	21.2	70.3	57.2
Gemma-3-IT-27B	33.1	0.4	5.4	18.1	0.1	30.8	0.5	22.0	0.1	45.7	1.1	28.1	0.3	53.9	0.3
Gemma-3-IT-12B	29.8	0.4	5.9	16.9	0.3	31.6	0.5	22.1	0.4	40.8	0.5	22.9	0.1	44.5	0.5
Gemma-3-IT-4B	28.4	0.2	3.0	22.1	0.0	27.2	0.3	22.5	0.1	36.8	0.3	27.1	0.3	34.7	0.1
Proprietary Models — Reasoning
Gemini-3-Pro*	64.7	40.6	55.0	39.7	18.8	71.1	45.3	36.8	20.5	81.2	55.5	71.1	41.1	88.4	62.3
GPT-5.2-Thinking	57.9	11.5	29.0	36.1	8.5	55.7	10.4	49.7	7.6	63.6	18.0	65.6	12.5	76.4	11.7
Gemini-3-Flash*	57.2	40.2	61.2	6.9	4.3	61.2	46.9	42.3	25.3	80.0	63.7	66.0	38.9	86.8	62.1
Claude-Opus-4.5-Thinking	47.0	4.7	16.7	25.5	3.5	49.7	4.7	35.2	3.1	60.0	8.8	45.1	5.2	66.5	2.9
Claude-Sonnet-4.5-Thinking	32.2	4.3	19.2	14.7	1.9	29.9	3.6	22.1	2.3	45.7	8.1	31.3	3.9	49.3	6.1
Open-weight Models — Reasoning
Qwen3-VL-32B-Thinking	46.8	37.4	67.2	19.2	12.9	57.5	47.1	24.3	18.1	70.1	60.0	30.4	23.1	79.6	63.2
Qwen3-VL-235B-Thinking	45.1	36.3	67.8	17.6	12.7	51.2	42.3	24.8	19.3	67.6	58.7	31.2	22.8	78.1	61.9
Qwen3-VL-4B-Thinking	42.6	28.7	58.3	20.4	9.2	48.1	34.8	22.3	12.0	63.9	50.7	30.8	16.3	70.4	49.2
InternVL-3.5-38B-Thinking	42.1	27.4	57.0	19.5	10.9	43.3	32.5	24.7	15.3	56.8	39.2	34.8	20.7	73.6	45.9
GLM-4.6V-Thinking	42.0	34.7	70.1	14.1	10.7	46.3	40.0	19.5	15.3	63.1	56.1	31.2	23.3	77.6	62.7
Qwen3-VL-8B-Thinking	41.7	29.5	60.1	18.5	9.2	47.9	36.4	21.3	11.9	63.3	51.6	28.5	16.5	70.5	51.2
InternVL-3.5-8B-Thinking	40.6	5.2	22.1	20.5	1.3	39.9	6.7	24.1	1.9	57.3	10.4	30.9	3.5	70.5	7.6
InternVL-3.5-4B-Thinking	40.6	4.7	21.4	18.3	1.2	41.2	4.8	24.4	1.7	58.5	9.6	32.8	1.5	68.1	9.3
InternVL-3.5-14B-Thinking	38.2	11.1	34.7	14.1	4.7	42.8	13.6	21.1	5.6	55.3	20.5	27.6	6.4	68.0	15.9
Specialized Spatial Reasoning Models
SenseNova-InternVL3-8B	42.3	17.3	38.8	20.4	9.1	45.2	19.7	25.2	9.2	55.5	27.2	34.5	11.2	73.2	27.2
Cosmos-Reason2-8B	37.8	27.9	61.4	15.2	10.5	40.7	31.9	19.5	13.5	54.5	43.5	26.5	17.1	70.1	51.1
VST-7B-RL	36.0	0.0	1.5	16.7	0.0	34.1	0.1	23.9	0.0	48.1	0.0	24.5	0.0	68.8	0.0
SpaceQwen3-VL-2B	33.6	10.1	31.5	18.5	4.0	32.5	9.9	22.8	4.0	47.2	22.9	26.1	4.4	54.3	15.2
SpaceOm	32.3	0.3	2.6	15.3	0.5	37.9	0.1	19.6	0.1	47.9	0.4	20.5	0.3	52.8	0.4
SpatialReasoner	31.7	8.7	29.9	18.0	6.8	34.0	10.3	19.6	4.3	46.0	13.7	21.3	5.7	51.5	11.2
SpaceThinker-3B	31.1	4.0	16.6	15.9	2.9	36.3	3.7	19.2	2.9	44.5	6.5	20.8	3.3	50.0	4.3

Table 1: Benchmark results across different hop counts and Ego/Exo perspectives. Green cells indicate the best performance within each model group. *Gemini-3-Pro & -Flash operate with thinking mode enabled by default and do not provide an option to fully disable it. We therefore classify them as reasoning models.

In-depth Analysis

Eight key insights from evaluating 37 VLMs on MultihopSpatial

A1. Hop Degradation A2. Ego vs. Exo A3. Instant vs. Reasoning A4. Open vs. Proprietary A5. Generalist vs. Specialist A6. Grounding Gap A7. Scaling Limits A8. Error Analysis

Performance Degradation across Reasoning Hops

Evaluating the impact of reasoning complexity across all 37 models reveals a consistent performance degradation as the number of hops increases, confirming that compositional spatial reasoning remains a fundamental challenge for current VLMs. Both MCQ accuracy and Acc@50IoU exhibit steep declines from 1-hop to 3-hop.

Crucially, this degradation is exacerbated under ego-centric evaluation, which requires additional perspective transformation. The widening performance gap between ego-centric and exo-centric views at higher hops suggests that perspective-taking compounds with multi-step reasoning, creating a multiplicative rather than additive difficulty.

Average performance by hop count. Performance degrades consistently as reasoning hops increase.

Ego vs. Exo: Perspective-Taking as a Compounding Bottleneck

Beyond a simple performance drop, ego-centric evaluation fundamentally alters capability visibility. Under exo-centric conditions, our Acc@50IoU metric clearly distinguishes grounding-capable models (e.g., Qwen3-VL) from those lacking native localization (e.g., InternVL-3.5) despite their similar MCQ accuracies.

Conversely, ego-centric evaluation acts as an evaluation blind spot: it suppresses even strong grounding models to a 20–25% floor, completely masking these capability gaps. This compression highlights the necessity of evaluating perspectives jointly and reinforces Acc@50IoU as an essential metric to expose disparities invisible to conventional MCQ evaluation.

Model performance scatter plots. Ego-centric evaluation masks grounding differences visible in exo-centric conditions.

Instant vs. Reasoning: Diminishing Returns under Multi-Hop Pressure

While reasoning models consistently outperform instant models, this advantage diminishes under multi-hop pressure. The initial MCQ gain of up to +8 pp at 1-hop narrows considerably by 3-hop.

Critically, even with extended thinking, reasoning models plummet to sub-20% MCQ accuracy and sub-10% Acc@50IoU on 3-hop ego-centric tasks. This convergence toward the performance floor reveals that inference-time reasoning (e.g., chain-of-thought) yields diminishing returns as compositional steps accumulate. MultihopSpatial thus exposes a capability ceiling that test-time compute alone cannot resolve.

Reasoning vs. Instant models. The reasoning advantage shrinks with increasing hop count.

Open-weight vs. Proprietary Models

While proprietary models hold a slight MCQ advantage in ego-centric tasks—indicating stronger perspective-taking—open-weight models consistently dominate in Acc@50IoU. This reversal stems from a distinct asymmetry in visual grounding capabilities.

Proprietary models exhibit high variance: while the Gemini-3 series demonstrates robust localization, others (e.g., GPT-5.2, Claude) lack native grounding, significantly dragging down the group average. Conversely, open-weight families (e.g., Qwen3-VL, GLM-4.6V) maintain consistently high grounding performance, forming a dense cluster at 60%+ in exo-centric Acc@50IoU.

Open-weight vs. Proprietary models across hop count and viewpoint.

Generalist vs. Specialist Models

Comparing specialized spatial reasoning models (SRMs) against comparable general-purpose VLMs (≤10B) reveals a counterintuitive trend: SRMs consistently underperform across all metrics and hop counts.

This shortfall primarily stems from the robust native grounding inherent in general models (e.g., Qwen3-VL), which most SRMs lack. These findings indicate that current SRMs, typically fine-tuned on single-step QA, fail to generalize to multi-hop scenarios, suggesting that general-purpose VLMs provide a stronger foundation for compositional spatial understanding.

Spatial Reasoning Models vs. General Models (≤10B).

Grounding Gap: Grounded vs. Ungrounded Accuracy

A striking disconnect exists between answer selection and spatial localization: most models fall far below the y=x line, indicating that high MCQ accuracy does not entail accurate grounding. Across all 37 models, the average ungrounded ratio reaches 59%, meaning more than half of correctly answered questions lack proper spatial localization.

This ratio varies dramatically by category: proprietary instant models exhibit 93% ungrounded accuracy, while open-weight reasoning models achieve the lowest at 43%. At the extreme, models such as Gemma-3-IT, Molmo2, and Claude-Sonnet-4.5 exceed 98% ungrounded, effectively answering through shortcuts without any spatial understanding.

Grounded vs Ungrounded accuracy per model

Grounded (Acc@50IoU) vs. Ungrounded accuracy per model. Hatched regions indicate ungrounded responses.

Grounding Gap: MCQ Acc. vs. Acc@50IoU.

Ungrounded Ratio by Model Category.

The Limits of LLM Scaling in Spatial Reasoning

Analyzing model scale across three open-weight families reveals that scaling the language backbone alone yields limited gains. While MCQ accuracy shows modest, saturating improvements, Acc@50IoU largely plateaus (e.g., Qwen3-VL) or remains near zero (e.g., Gemma-3-IT).

A notable exception is InternVL-3.5 at 38B, which exhibits a sharp grounding jump from 5.2% to 27.4%. Crucially, this leap coincides with a significant upgrade in its vision encoder (from 300M to 6B), whereas models utilizing fixed, small vision encoders (e.g., Qwen3-VL’s 400M) saturate early. This demonstrates that multi-hop spatial reasoning depends critically on scaling the capacity of visual spatial representations, not solely the language component.

Effect of model scale on MCQ Accuracy (top) and Acc@50IoU (bottom) across overall and per-hop settings.

Error Analysis on Tag Compositions

Analyzing error rates across spatial tag combinations reveals that while reasoning models consistently outperform instant models, multi-tag compositions remain a critical bottleneck. Specifically, the Position-Relation (P-R) combination incurs significantly higher errors than single-tag settings, highlighting the severe difficulty of jointly handling positional localization and relational comparison.

Even specialized spatial reasoning models, which achieve the lowest overall errors, exhibit non-trivial failures on these complex compositions. This underscores that true compositional spatial reasoning remains an unresolved challenge across the current VLM ecosystem.

Average error rates (%) by tag combination across model categories. A, R, and P denote Attribute, Relation, and Position.

MultihopSpatial-Train: Utility as a Training Corpus

Reinforcement learning post-training enhances both spatial reasoning and downstream VLA performance

Beyond serving as a static evaluation benchmark, we investigate MultihopSpatial as a training corpus. We post-train Qwen3-VL-4B-Instruct via Group Relative Policy Optimization (GRPO) using a composite reward function that jointly evaluates format adherence, answer correctness, and spatial localization accuracy:

R = R_format + α · R_mcq + β · R_bbox

Format Reward: Binary reward (1 if output follows expected parsing format, 0 otherwise).

MCQ Reward: Discrete correctness signal (1 if predicted choice matches ground truth).

BBox Reward: Continuous GIoU-based localization score normalized to [0, 1], providing dense geometric feedback.

Spatial Reasoning Benchmarks

The GRPO-trained model achieves substantial improvements on both the in-domain MultihopSpatial benchmark and five out-of-domain spatial reasoning benchmarks, demonstrating that spatial reasoning capabilities generalize beyond the training distribution.

Model	MultihopSpatial (In-domain)			Out-of-Domain
Model	Acc.	Acc@50IoU	avg. IoU	BLINK	3DSRBench	OmniSpatial	VSI-Bench	SpatialMQA
Qwen3-VL-4B-Instruct	37.8	31.0	69.9	82.5	56.1	42.7	62.8	39.6
w/ MultihopSpatial	62.9	53.8	72.6	85.3	56.3	43.9	63.2	41.1

Table 2: Impact of MultihopSpatial-Train on spatial reasoning across diverse benchmarks.

VLM as Backbone for Vision-Language-Action

When deployed as a VLM backbone using VLM4VLA, the MultihopSpatial-trained model consistently outperforms the baseline. On CALVIN, the average task completion score rises from 3.75 to 3.98, with the performance gap widening as the task chain lengthens (from +0.6% on Task-1 to +7.0% on Task-5). The model also achieves a +4.2% improvement on Libero (35.8% → 40.0%).

VLM backbone	Task-1	Task-2	Task-3	Task-4	Task-5	Calvin	Libero
Qwen3-VL-4B-Instruct	92.4	81.8	74.1	66.8	59.9	3.75	35.8
w/ MultihopSpatial-Train	93.0	85.4	79.3	73.2	66.9	3.98	40.0

Table 3: Vision-Language-Action evaluation on Calvin ABC-D and Libero. Each task measures success rate (%). Calvin denotes the average number of successfully completed tasks per sequence.

Qualitative Results

Illustrative failure cases revealing cascading errors in multi-hop spatial reasoning

Figure: Qualitative failure analysis on a 3-hop ego-centric question. All three models recognized the position condition "in front of," but failed to incorporate it into their final predictions.

The query asks to identify the farthest square or rectangular object in front of the sitting person. However, all three models neglect the position condition, instead selecting the farthest rectangular object based solely on attribute and relation. Examining each model's rationale reveals that they explicitly acknowledge the "in front of" constraint during reasoning, yet consistently disregard it in the final answer — demonstrating a failure to maintain intermediate conditions throughout the chain. An error at any single hop inevitably propagates to an incorrect final prediction.

Additional Qualitative Examples

Browse through representative failure cases across 2-hop and 3-hop ego-centric and exo-centric questions:

3-Hop Ego-centric: All three models fail to correctly identify the perspective.

3-Hop Ego-centric: Models struggle with perspective-taking in complex scenes.

3-Hop Exo-centric: Models fail to retain intermediate conditions throughout the reasoning process.

3-Hop Exo-centric: Cascading failures from attribute and position conditions.

2-Hop Ego-centric (att+pos): Shorter reasoning chains still exhibit distinct failure modes.

2-Hop Ego-centric (att+rel): Models fail in different ways even on the same sample.

2-Hop Ego-centric (pos+rel): Error in a single intermediate cue propagates to incorrect answer.

2-Hop Exo-centric (att+pos): Difficulty does not stem only from long reasoning chains.

2-Hop Exo-centric (att+rel): Single-step errors lead to incorrect final answers.

2-Hop Exo-centric (pos+rel): Compositional evaluation precisely diagnoses reasoning breakdowns.

1 / 10

MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model