
Open World Agents Documentation¶
A comprehensive framework for building AI agents that interact with desktop applications through vision, keyboard, and mouse control.
Open World Agents (OWA) is a monorepo containing the complete toolkit for multimodal desktop agent development. From high-performance data capture to model training and real-time evaluation, everything is designed for flexibility and performance.
🚀 Quick Start: Record → Train in 3 Steps¶
# 1. Record desktop interaction
$ ocap my-session.mcap
# 2. Process to training format
$ python scripts/01_raw_events_to_event_dataset.py --train-dir ./
# 3. Train your model
$ python train.py --dataset ./event-dataset
📖 Detailed Guide: Complete Quick Start Tutorial - Step-by-step walkthrough with examples and troubleshooting
Architecture Overview¶
OWA consists of the following core components:
- 🌍 Environment Framework - Universal interface for native desktop automation ("USB-C of desktop agents") with pre-built plugins for desktop control, high-performance screen capture (6x faster), and zero-configuration plugin system
- 📊 Data Infrastructure - Complete desktop agent data pipeline from recording to training with
OWAMcap
format - a universal standard powered by mcap - 🛠️ CLI Tools - Command-line utilities (
owl
) for recording, analyzing, and managing agent data - 🤖 Examples - Complete implementations and training pipelines for multimodal agents
🌍 Environment Framework¶
Universal interface for native desktop automation with real-time event handling and zero-configuration plugin discovery.
Environment Navigation¶
Section | Description |
---|---|
Environment Overview | Core concepts and quick start guide |
Environment Guide | Complete system overview and usage examples |
Custom Plugins | Create your own environment extensions |
CLI Tools | Plugin management and exploration commands |
Built-in Plugins:
Plugin | Description | Key Features |
---|---|---|
Standard | Core utilities | Time functions, periodic tasks |
Desktop | Desktop automation | Mouse/keyboard control, window management |
GStreamer | High-performance capture | 6x faster screen recording |
📊 Data Infrastructure: Complete Desktop Agent Data Pipeline¶
Desktop AI needs high-quality, synchronized multimodal data: screen captures, mouse/keyboard events, and window context. OWA provides the complete pipeline from recording to training.
The OWA Data Ecosystem¶
🎯 Getting Started New to OWA data? Start here:
- Why OWAMcap? - Understand the problem and solution
- Recording Data - Capture desktop interactions with
ocap
- Exploring Data - View and analyze your recordings
📚 Technical Reference Deep dive into the format and pipeline:
- OWAMcap Format Guide - Complete technical specification
- Data Pipeline - Transform recordings to training-ready datasets
🛠️ Tools & Ecosystem
- Data Viewer - Web-based visualization tool
- Comparison with LeRobot - Technical comparison with alternatives
- CLI Tools (owl) - Command-line interface for data analysis and management
🤗 Community Datasets¶
Browse Available Datasets: 🤗 datasets?other=OWA
- Growing Collection: Hundreds of community-contributed datasets
- Standardized Format: All use OWAMcap for seamless integration
- Interactive Preview: Hugging Face Spaces Visualizer
- Easy Sharing: Upload recordings directly with one command
🚀 Impact: OWA has democratized desktop agent data, growing from zero to hundreds of public datasets in the unified OWAMcap format.
🤖 Awesome Examples¶
Learn from complete implementations and training pipelines.
Example | Description | Status |
---|---|---|
Multimodal Game Agent | Vision-based game playing agent | 🚧 In Progress |
GUI Agent | General desktop application automation | 🚧 In Progress |
Interactive World Model | Predictive modeling of desktop environments | 🚧 In Progress |
Usage with LLMs | Integration with large language models | 🚧 In Progress |
Usage with Transformers | Vision transformer implementations | 🚧 In Progress |
Development Resources¶
Learn how to contribute, report issues, and get help.
Resource | Description |
---|---|
Help with OWA | Community support resources |
Installation Guide | Detailed installation instructions |
Contributing Guide | Development setup, bug reports, feature proposals |
FAQ for Developers | Common questions and troubleshooting |
License¶
This project is released under the MIT License. See the LICENSE file for details.