RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song *

The Ohio State University

Yu Su

The Ohio State University

CVPR 2025

*Work done during Nvidia internship

TL;DR A large-scale 2D/3D dataset of real indoor environments for spatial reasoning in robotics.

Overview

Spatial understanding is essential for robots to perceive, reason about, and interact with their environments. However, current visual language models often rely on general-purpose image datasets that lack robust spatial scene understanding and reference frame comprehension (ego-, object-, or world-centric). To address this gap, we introduce RoboSpatial, a large-scale dataset of real indoor and tabletop environments captured via egocentric images and 3D scans. RoboSpatial provides 1M images, 5K 3D scans, and 3M annotated spatial relationships, enabling both 2D and 3D spatial reasoning. Models trained on RoboSpatial outperform baselines on tasks including spatial affordance prediction, spatial relationship prediction, and robotics manipulation.

Approach

An overview of RoboSpatial Dataset.
We automatically generate spatial relationship annotations from existing datasets with 3D point clouds, egocentric images, and 3D bounding box annotations. We create question/answer pairs covering three classes of spatial relationships, three spatial reference frames, and both binary (yes/no) and numeric (e.g. 2D image points) answers. From 1M images and 5K scans, we generate over 3M spatial question/answer pairs.

Application Example

An illustration of a model trained on RoboSpatial being used to solve a manipulation task using spatial reasoning.
An illustration of a model trained on RoboSpatial being used to solve a manipulation task using spatial reasoning.

Evaluation Results

Spatial QA

Qualitative Results of RoboSpatial-trained models.
Results of RoboSpatial-trained models on RoboSpatial test set, RoboSpatial-Home and BLINK. Two models are shown: SL (SpaceLLaVA) and RP (RoboPoint), where the S- prefix and FT indicates RoboSpatial-trained models. For Yes/No questions, the green checkmark indicates the correct answer. For spatial context questions, GT indicates the correct answer. For in-domain, all questions are from RoboSpatial test. For out-of-domain, all images except for the top right are from RoboSpatial-Home.
ModelRoboSpatial TestRoboSpatial-HomeBLINK-Spatial
Open-source
ā€”2Dā€”
LLaVA-NeXT30.323.771.8
+ RoboSpatial60.542.079.0
RoboPoint38.940.263.6
+ RoboSpatial70.650.770.6
ā€”3Dā€”
Embodied Generalist42.836.2N/A
+ RoboSpatial71.954.7N/A
Baselines
Molmo50.146.967.1
GPT-4o50.848.676.2
Performance of RoboSpatial-trained models on three spatial reasoning benchmarks.

Robot Manipulation

Robot experiment results.
Robotics experiments: the red dot shows the model output (if not present, the model failed to provide a valid point in the image); green dots are used to show when a model outputs multiple points. The robot motion generator, cuRobo, is used to grasp the item referenced by the generated point. The spatial- prefix indicates model trained with RoboSpatial.
ModelSuccess Rate (%)
Open-source
LLaVA-NeXT23.7
+ RoboSpatial52.6
Baselines
Molmo43.8
GPT-4o46.9
Task success rate for robot manipulation.

BibTeX Citation

    @inproceedings{song2025robospatial,
  author    = {Song, Chan Hee and Blukis, Valts and Tremblay, Jonathan and Tyree, Stephen and Su, Yu and Birchfield, Stan},
  title     = {{RoboSpatial}: Teaching Spatial Understanding to {2D} and {3D} Vision-Language Models for Robotics},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
  note      = {To appear},
}