RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song *,+

The Ohio State University

Yu Su

The Ohio State University

CVPR 2025 Oral (0.74%)

*Work partly done during Nvidia internship, +Contact: song.1855@osu.edu

TL;DR A large-scale 2D/3D dataset of real indoor/tabletop environments for spatial reasoning in robotics.

Overview

Spatial understanding is essential for robots to perceive, reason about, and interact with their environments. However, current visual language models often rely on general-purpose image datasets that lack robust spatial scene understanding and reference frame comprehension (ego-, world-, or object-centric). To address this gap, we introduce RoboSpatial, a large-scale dataset of real indoor and tabletop environments captured via egocentric images and 3D scans. RoboSpatial provides 1M images, 5k 3D scans, and 3M annotated spatial relationships, enabling both 2D and 3D spatial reasoning. Models trained on RoboSpatial outperform baselines on tasks including spatial affordance prediction, spatial relationship prediction, and robot manipulation.

Approach

An overview of RoboSpatial Dataset.
We automatically generate spatial relationship annotations from existing datasets with 3D point clouds, egocentric images, and 3D bounding box annotations. We create question/answer pairs covering three classes of spatial relationships, three spatial reference frames, and both binary (yes/no) and numeric (e.g. 2D image points) answers. From 1M images and 5k 3D scans, we generate over 3M spatial question/answer pairs.

Application Example

An illustration of a model trained on RoboSpatial being used to solve a manipulation task using spatial reasoning.
An illustration of a model trained on RoboSpatial being used to solve a manipulation task using spatial reasoning.

Evaluation Results

Spatial QA

Qualitative Results of RoboSpatial-trained models.
In-domain (RoboSpatial-Val, top) and out-of-domain (RoboSpatial-Home, BLINK, middle and bottom) results for RoboSpatial-trained models. Two models shown: SL (SpaceLLaVA) and RP (RoboPoint); the -FT suffix indicates fine-tuning on RoboSpatial. Correct answers in green. All images except bottom-right in the out-of-domain rows are from RoboSpatial-Home.
ModelRoboSpatial-ValRoboSpatial-HomeBLINK-SpatialSpatialBench-Position
Open-source
—2D—
LLaVA-NeXT (8B)30.346.371.855.9
+ RoboSpatial60.559.679.070.6
RoboPoint (13B)38.953.463.644.1
+ RoboSpatial70.663.470.664.7
—3D—
Embodied Generalist (7B)42.829.8N/AN/A
+ RoboSpatial71.943.8N/AN/A
Baselines
Molmo (7B)50.125.667.155.9
GPT-4o50.847.076.270.6
Performance of RoboSpatial-trained models on four spatial reasoning benchmarks.

Robot Manipulation

Robot experiment results.
Robot experiments: the red dot shows the model output (if not present, the model failed to provide a valid point in the image); green dots are used to show when a model outputs multiple points. The robot motion generator, cuRobo, is used to grasp the item referenced by the generated point. The spatial- prefix indicates model trained with RoboSpatial.
ModelSuccess Rate (%)
Open-source
LLaVA-NeXT (8B)23.7
+ RoboSpatial52.6
Baselines
Molmo (7B)43.8
GPT-4o46.9
Task success rate for robot manipulation.

BibTeX Citation

    @inproceedings{song2025robospatial,
  author    = {Song, Chan Hee and Blukis, Valts and Tremblay, Jonathan and Tyree, Stephen and Su, Yu and Birchfield, Stan},
  title     = {{RoboSpatial}: Teaching Spatial Understanding to {2D} and {3D} Vision-Language Models for Robotics},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
  note      = {To appear},
}