Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

1Tsinghua University, 2Harvard University, 3Massachusetts Institute of Technology
arXiv Video

SandboxVLM is an plug-and-play solution for enhancing spatial intelligence of VLMs, inspired by the coarse nature of human perception.

Abstract

Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

A 3D representation for Spatial Intelligence.

MY ALT TEXT

Visualization of representations of a scene. (a) Input image to the system; (b) Scene graph generated by expert model; (c) Reconstructed point cloud rendering. (d) Text description of 3D bounding boxes; (e) Rendered proxy points; (f) 3D Sandbox. 3D Sandbox strikes a balance between informativeness and interpretability, providing vivid spatial cues while filtering out irrelevant details.

Experimental Results

Method Overview

MY ALT TEXT

Given an input image and a textual query, the system builds a compact, 3D-aware, query-conditioned context for a vision-language model (VLM). A video diffusion prior first expands the input into a short multi-view sequence along imagined trajectories guided by abstract control provided by the VLM. Inside the 3D Sandbox module, an off-the-shelf depth estimator predicts per-frame depth and camera parameters, while the VLM identifies task-relevant objects that guide a 2D segmenter to produce instance masks. The masked regions are lifted into coarse 3D proxies and merged across views through a Multi-View Voting and Clustering step to form abstract 3D bounding boxes. Finally, informative renderings of these 3D abstractions are composed with the query and fed back into the VLM for spatial and physical reasoning.

BibTeX


      @article{liu2025abstract,
        title={Abstract 3D Perception for Spatial Intelligence in Vision-Language Models},
        author={Liu, Yifan and Zhan, Fangneng and Zhou, Kaichen and Du, Yilun and Liang, Paul Pu and Pfister, Hanspeter},
        journal={arXiv preprint arXiv:2511.10946},
        year={2025}
      }