Spatial Causal Prediction in Video

Zhao, Yanguang; Yang, Jie; Wu, Shengqiong; Hu, Shutong; Qiu, Hongbo; Wang, Yu; Zhang, Guijia; Tan Kai Ze; Fei, Hao; Lin, Chia-Wen; Lee, Mong-Li; Hsu, Wynne

SCP: Spatial Causal Prediction in Video

Yanguang Zhao¹, Jie Yang¹, Shengqiong Wu^1*, Shutong Hu¹, Hongbo Qiu², Yu Wang³, Guijia Zhang², Tan Kai Ze¹, Hao Fei¹, Chia-Wen Lin⁴, Mong-Li Lee¹, Wynne Hsu¹

¹National University of Singapore, ²Shenzhen University

³Sichuan University, ⁴National Tsing Hua University

CVPR Findings 2026

arXiv Code (Coming Soon)

Abstract

Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on 23 state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence.

Comparison with spatial reasoning benchmarks

Comparison with existing benchmarks. Modality denotes the input type. Dynamic/Static indicates whether the scene content itself undergoes temporal changes, rather than mere camera movement. View Type specifies whether the questions involve single or multiple viewpoints. Perspective denotes whether the benchmark includes explicitly designed ego (first-person) and exo (third-person) perspective settings. Causal Reasoning indicates whether the benchmark requires inferring outcomes or states driven by causal dependencies. Seen/Unseen reflects whether the queried information is directly observable within the given visual content.

Benchmark	QA pairs	Modality	Dynamic/Static	View Type		Perspective		Causal Reasoning	Seen/Unseen
				Single	Multi	Ego	Exo
3DSRBench	3,772	Image	Static	✓	✗	✗	✓	✗	Seen
InternSpatial-Bench	6,008	Image	Static	✓	✗	✗	✓	✗	Seen
OmniSpatial	~8,400	Image	Static	✓	✗	✓	✓	✗	Seen
Spatial457	23,752	Image	Static	✓	✗	Undeclared		✗	Seen
All-Angles Bench	~2,100	Image	Static	✗	✓	Undeclared		✗	Seen
EmbSpatial-Bench	3,640	Image	Static	✓	✗	✓	✗	✗	Seen
MMSI-Bench	1,000	Image	Static	✗	✓	Undeclared		✗	Seen
MindCube	21,154	Image	Static	✗	✓	Undeclared		✗	Unseen
VSI-Bench	5,130	Video	Static	✓	✗	✓	✗	✗	Seen
VLM4D	~1,800	Video	Dynamic	✓	✗	✓	✓	✗	Seen
STI-Bench	2,064	Video	Dynamic	✓	✗	✓	✓	✗	Seen
DSI-Bench	~1,700	Video	Dynamic	✓	✗	Undeclared		✗	Seen
SCP-Bench (Ours)	2,500	Video	Dynamic	✓	✓	✓	✓	✓	Unseen

Construction Pipeline

How Well Do Current Models Perform?

Evaluation on SCP-Bench. "Avg." indicates the overall average accuracy. For each category, the best-performing closed model and open-source model in average score are highlighted in deep blue, and best performance on each task is boxed.

Model	Avg.	App. Ord.	Count.	Plan.	Rel.	Rel. Dist.	Rel. Size	Rel. Speed	Spat. State
Human Performance	89.61	97.60	81.20	92.26	85.70	86.70	97.62	91.61	84.17
● Closed Models
GPT-5	66.24	79.04	58.12	59.06	64.07	70.48	95.24	77.42	65.11
Gemini 2.5 Pro	55.84	69.28	54.87	52.76	46.20	63.47	88.10	67.10	62.41
Gemini 2.5 Flash	52.10	59.28	52.14	51.74	43.14	57.75	88.10	66.45	55.60
Claude Sonnet 4.5	56.14	68.86	52.14	57.43	45.65	60.90	80.95	68.39	63.90
● Open-source Models
Qwen3-VL-2B	43.04	41.92	42.74	45.01	40.85	44.41	59.52	47.10	40.65
Qwen3-VL-8B	47.52	54.49	51.28	49.29	42.33	49.47	90.48	46.45	46.40
Qwen3-VL-30B-A3B	54.16	65.27	52.14	54.79	46.22	56.65	85.71	66.45	57.19
Qwen3-VL-32B	56.84	59.88	51.28	58.66	52.63	57.98	90.48	67.10	55.04
Qwen3-VL-235B-A22B	61.04	67.07	54.70	60.90	55.03	63.03	97.62	74.84	63.31
Qwen3-Omni-30B-A3B	53.60	63.47	55.56	53.56	47.03	53.72	88.10	65.81	55.40
InternVL3.5-8B	50.52	59.88	54.70	54.79	43.82	54.52	61.90	58.71	44.96
InternVL3.5-38B	53.56	62.28	53.85	56.01	46.34	57.98	90.48	65.81	48.20
InternVL3.5-241B-A28B	56.96	67.07	60.68	61.10	46.11	60.37	90.48	68.39	60.07
MiniCPM-V-4.5	43.80	53.29	49.57	43.99	36.04	49.20	76.19	52.26	42.81
DeepSeek-VL2	38.08	45.51	38.46	39.51	29.41	45.74	73.81	53.55	33.81
NVILA-8B	34.40	36.53	36.75	38.09	30.66	30.05	59.52	38.71	37.05
NVILA-15B	45.28	54.49	45.30	48.07	35.35	52.13	73.81	50.97	49.28
LLaVA-OneVision-7B	36.48	42.51	37.61	37.07	31.24	38.30	64.29	46.45	35.61
LLaVA-OneVision-70B	50.84	64.67	52.99	48.68	44.39	53.46	78.57	61.94	51.80
LLaVA-OneVision-1.5-8B	45.52	56.29	47.01	46.44	39.13	50.27	80.95	51.61	41.73
LLaVA-NeXT-Video-7B	36.60	43.11	25.64	35.44	29.52	48.40	54.76	54.84	32.73
● Spatial Models
Spatial-MLLM	39.76	45.51	28.21	33.81	38.33	49.73	66.67	50.97	32.37
SpaceR	41.36	52.10	34.19	40.53	34.90	45.21	59.52	54.19	44.60

Overall Evaluation Results. The table above summarizes the overall performance of MLLMs on SCP-Bench. Current systems remain far below human level, underscoring the substantial gap in spatial causal prediction. GPT-5 attains the highest accuracy (66.24%), followed by Qwen3-VL-235B (61.04%) and InternVL3.5-241B (56.96%). Notably, several open-source models match or surpass proprietary ones on specific tasks, for example, outperforming GPT-5 in Counting and Planning and achieving comparable results in Relative Size, Relative Speed, and Spatial State. At the level of question types, the difficulty landscape becomes clearer. Relative Size is consistently the easiest, whereas Object Relations, Planning, and Counting are the most challenging, as they require more abstract spatial causal reasoning and higher-order object interaction understanding.

We further examine performance across perspectives, causal directions, and scenarios (Fig. 6). Models exhibit clear difficulty with multi-view prediction compared to single-view reasoning, indicating limited perspective correspondence. In causal directionality, models perform better when inferring past (backward) events than future (forward) ones, likely because reasoning from known outcomes is easier than anticipating unseen consequences. Finally, model performance remains relatively balanced across different scene categories, with slightly stronger results in driving-related and factory/machine environments.

Evaluation across perspectives, causal directions, and scenes

Temporal Extrapolation Horizon. We analyze how model performance varies with different temporal prediction ranges. Overall, model accuracy remains relatively stable across horizons, averaging around 46.8%. This indicates that dynamic frame sampling in existing MLLMs mitigates sensitivity to temporal length difference, and also that the current temporal segmentation range may be too narrow to induce significant variation.

Causal Consistency. To complement the quantitative results, we conduct a case study to examine whether models obey basic physical and temporal causal constraints. Even when models perceive local motion correctly, their explanations can still violate basic causal constraints, revealing a gap in causal consistency for MLLMs.

Key Takeaways

SCP remains a strong challenge for current MLLMs.
Large open-source models are increasingly competitive with closed models.
Performance differences across temporal ranges are relatively small.

What Affects Spatial Causal Prediction?

Perception vs. Reasoning. In one condition, we provide the models with the unseen parts of the clips, referred to as the Gold Video, thereby removing the need for causal inference and turning the task into pure visual understanding. In contrast, we replace the visible parts of the clips with dense captions, thereby isolating perception and forcing the model to rely solely on textual reasoning. Results show that perception is not the primary bottleneck and that the main limitation lies in spatial causal reasoning.

Single-Frame vs. Multi-Frame Reasoning. We evaluate model performance when given only the cutpoint frame versus the visible part, and all four models show a slight improvement over the multi-frame condition when performing single-frame reasoning. This counterintuitive result suggests that temporal cues contribute minimally to model performance under the base spatial causal reasoning task setting, and the observed gains likely stem from static spatial perception rather than genuine temporal understanding.

Vision Causal Perception. After temporally flipping input videos, accuracy drops are generally small. The weak sensitivity to temporal inversion further suggests that current models do not yet form stable causal direction representations.

Is Visual Input Necessary? Yes. Removing video input leads to clear degradation, and dense captions only partially compensate. Visual evidence remains essential for spatial causal prediction, especially when fine-grained motion and interaction details determine the answer.

Performance with and without visual input

Key Takeaways

Both perception and reasoning limitations contribute to SCP errors.
Visual input remains essential for reliable spatial causal prediction.
Current models still lack stable temporal-causal logic in video reasoning.

How to Improve Spatial Causal Prediction?

Scaling Up Model Size. Larger models consistently provide the strongest gains on SCP-Bench. While small-scale increases can be noisy, substantial scale jumps deliver clear improvements, making scale-up the most reliable baseline strategy.

Reasoning Strategies (CoT / Self-Think). Enabling think-mode does not yield consistent improvements; in fact, most models exhibit slight performance degradation, likely due to overextended reasoning chains that introduce noise and divert attention from essential spatial cues.

Perception Enhancement. To address the limited perceptual capability of MLLMs, we explore several mechanisms designed to enhance spatial perception: (1) generating dense captions of the input video clip to enrich scene perception, and (2) constructing spatial-interaction graphs via prompts that capture key objects, environmental elements, and their spatial and interaction relations. These perception enhancement strategies lead to only marginal improvements overall, with noticeable gains primarily in specific tasks such as Appearance Order and Relative Speed. The models fail to leverage spatial-interaction graphs for accurate spatial causal reasoning. When using dense captions to enhance input, the models also exhibit limited benefit, suggesting these perception-level augmentations alone are insufficient to substantially boost spatial causal reasoning.

Unseen Causal Scaffolds. We investigate which types of unseen spatial causal scaffolds are able to effectively enhance reasoning performance. Specifically, we evaluate three forms of auxiliary information: textual descriptions generated by GPT-5, the future spatial images produced by FLUX.1-dev-12B, and the future causal videos generated by Wan2.2-TI2V-5B. Incorporating textual future predictions consistently improves performance across all tasks, likely because MLLMs are inherently more adept at processing and reasoning over textual information. In contrast, image- and video-based scaffolds provide limited gains, likely due to input length constraints, modality noise, and the inherent perception limitations of current MLLMs. Nevertheless, videos outperform images in dynamic-related tasks (e.g., Relative Size and Spatial State), benefiting from their richer temporal cues.

Key Takeaways

Scaling up model size yields the most consistent gains on SCP.
Unseen spatial causal scaffolds can effectively improve performance.
Vanilla CoT and self-thinking provide only limited average gains.

Conclusion

We introduce Spatial Causal Prediction (SCP) and SCP-Bench, establishing a new paradigm for predictive spatial reasoning beyond visible scenes. Extensive evaluations indicate that current MLLMs remain far from human-level performance, performing better on past inference than future prediction, and relying mainly on perceptual cues. In-depth controlled analyses show that reasoning, rather than perception, constitutes the major bottleneck. While explicit reasoning and structured spatial representations bring limited gains, notably scaling up and integrating causal scaffolds offer a promising path for better SCP performance.