The Ohio State University
Nvidia
Nvidia
Nvidia
The Ohio State University
Nvidia
CVPR 2025
*Work done during Nvidia internship
TL;DR A large-scale 2D/3D dataset of real indoor environments for spatial reasoning in robotics.
Spatial understanding is essential for robots to perceive, reason about, and interact with their environments. However, current visual language models often rely on general-purpose image datasets that lack robust spatial scene understanding and reference frame comprehension (ego-, object-, or world-centric). To address this gap, we introduce RoboSpatial, a large-scale dataset of real indoor and tabletop environments captured via egocentric images and 3D scans. RoboSpatial provides 1M images, 5K 3D scans, and 3M annotated spatial relationships, enabling both 2D and 3D spatial reasoning. Models trained on RoboSpatial outperform baselines on tasks including spatial affordance prediction, spatial relationship prediction, and robotics manipulation.
Model | RoboSpatial Test | RoboSpatial-Home | BLINK-Spatial |
---|---|---|---|
Open-source | |||
ā2Dā | |||
LLaVA-NeXT | 30.3 | 23.7 | 71.8 |
+ RoboSpatial | 60.5 | 42.0 | 79.0 |
RoboPoint | 38.9 | 40.2 | 63.6 |
+ RoboSpatial | 70.6 | 50.7 | 70.6 |
ā3Dā | |||
Embodied Generalist | 42.8 | 36.2 | N/A |
+ RoboSpatial | 71.9 | 54.7 | N/A |
Baselines | |||
Molmo | 50.1 | 46.9 | 67.1 |
GPT-4o | 50.8 | 48.6 | 76.2 |
Model | Success Rate (%) |
---|---|
Open-source | |
LLaVA-NeXT | 23.7 |
+ RoboSpatial | 52.6 |
Baselines | |
Molmo | 43.8 |
GPT-4o | 46.9 |
@inproceedings{song2025robospatial,
author = {Song, Chan Hee and Blukis, Valts and Tremblay, Jonathan and Tyree, Stephen and Su, Yu and Birchfield, Stan},
title = {{RoboSpatial}: Teaching Spatial Understanding to {2D} and {3D} Vision-Language Models for Robotics},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025},
note = {To appear},
}