Watch and Learn: Learning to Use Computers from Online Videos

Chan Hee Song

The Ohio State University

Yiwen Song

Google Cloud AI Research

Palash Goyal

Google Cloud AI Research

Yu Su

The Ohio State University

Oriana Riva *

Google DeepMind

Hamid Palangi *

Google Cloud AI Research

Tomas Pfister *

Google Cloud AI Research

CVPR 2026

*Joint last authors, Contact: chanhee.luke@gmail.com, {yiwensong, oriva, hamidpalangi}@google.com

TL;DR A framework leverages inverse dynamics to convert internet videos of human computer use into executable UI trajectories, significantly improving computer-using agent performance.

An overview of Watch and Learn.

Overview

Watch and Learn (W&L) is a scalable framework that transforms everyday Internet videos of people using software into executable user interface action trajectories for training computer-using agents. Instead of relying on costly manual annotation or synthetic data that can produce oversimplified behaviors, W&L formulates trajectory extraction as an inverse dynamics problem, predicting user actions directly from consecutive screen states. This approach simplifies learning and generalizes across diverse and evolving applications. Through a task aware retrieval and labeling pipeline, the framework produces more than 53,000 high quality trajectories that can be used both as supervised training data and as in-context examples. Experiments on OSWorld and WindowsAgentArena show consistent improvements for both general purpose and specialized agents with supervised fine-tuning and in-context learning. Watch and Learn demonstrates that web scale human demonstration videos provide a practical and scalable foundation for advancing real world computer-using agents.

Approach

An overview of Watch and Learn.
Our framework converts web-scale human demonstration videos into executable trajectories for CUAs. We first collect a large-scale state-transition dataset of screen observations and user actions, and train an inverse dynamics model (IDM) to recover actions from consecutive screenshots. This IDM is then applied to tutorial videos to extract step-by-step trajectories. A retrieval module selects task-relevant or general demonstrations, which are used in two ways: (i) as in-context exemplars that provide application-specific knowledge at inference time, and (ii) as supervised training data to improve open-source CUAs.

IDM Annotation Example

An example of our IDM being used to annotate an online video.

Action 1: click (8, 45)

Evaluation Results

Inverse Dyanmics Model

ActionTypeGemini 2.5 FlashTongUIOurs
click(x, y)68%72%95%
release71%67%90%
scroll(scroll_y)55%75%93%
type(text)77%71%86%
wait(500ms)92%88%97%
move(x, y)65%61%89%
ActionType Accuracy81.5%84.3%95.8%
Action Accuracy70.5%72.3%91.7%
Action type and overall accuracy of inverse dynamics models.

Computer-Using Benchmarks

OSWorld results.
Qualitative examples on OSWorld. On the left, the video-derived trajectory that Watch & Learn generates for the task. On the right: (i) the o3 agent makes a grounding error by selecting a wrong UI element; (ii) the Jedi (o3) agent makes a planning error by entering the wrong submenu without recovering; (iii) using the video-derived trajectory, W&L agent completes the task successfully. Images are cropped for visibility, and the action coordinates correspond to the original full-resolution screenshots. More trajectory examples are in the main paper.

WindowsAgentArena (SFT)

ModelSetting / Training DataSuccess Rate (%)
UI-TARS-1.5-7BBase (No SFT)18.1
SFT w/ TongUI12.9 (-5.2)
SFT w/ W&L (IDM-labeled)24.0 (+5.9)
OpenCUA-7BBase (No SFT)13.5
UltraCUA-7BBase (No SFT)21.7
WindowsAgentArena SFT results under the 15-step evaluation limit.

OSWorld-Verified (ICL)

CategoryBase ModelMethodSuccess Rate (%)
General ModelsGemini 2.5 FlashBase (w/o video)19.0
ICL w/ W&L labeled videos22.0 (+3.0)
OpenAI o3Base (w/o video)21.8
ICL w/ TongUI labeled videos21.1 (-0.7)
ICL w/ W&L labeled videos24.3 (+2.5)
Claude 4 SonnetBase (w/o video)43.9
ICL w/ TongUI labeled videos43.4 (-0.5)
ICL w/ W&L labeled videos45.5 (+1.6)
Agentic FrameworkJediBase (w/o video)50.6
ICL w/ W&L labeled videos52.8 (+2.2)
OSWorld-Verified in-context learning results under the 50-step evaluation limit.

OSWorld-Verified (SFT)

CategoryBase ModelMethodSuccess Rate (%)
Open-Source ModelsQwen 2.5VL 7BBase (No SFT)1.9
SFT w/ TongUI labeled5.4 (+3.5)
SFT w/ W&L (IDM labeled)13.0 (+11.1)
UI-TARS-1.5-7BBase (No SFT)27.3
SFT w/ TongUI labeled23.8 (-3.5)
SFT w/ W&L (IDM labeled)31.1 (+3.8)
OSWorld-Verified supervised fine-tuning results for open-source CUAs under the 50-step evaluation limit.

BibTeX Citation

    @inproceedings{song2026watchandlearn,
  author    = {Song, Chan Hee and Song, Yiwen and Goyal, Palash and Su, Yu and Riva, Oriana and Palangi, Hamid and Pfister, Tomas},
  title     = {{Watch and Learn: Learning to Use Computers from Online Videos}},
  booktitle = {{Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}},
  year      = {2026},
  note      = {To appear},
}