Open World Agents Documentation¶
Open World Agents (OWA) is a monorepo for building AI agents that interact with desktop applications. It provides data capture, environment control, and training utilities.
Quick Start¶
# 1. Record desktop interaction
$ ocap my-session.mcap
# 2. Process to training format
$ python scripts/01_raw_to_event.py --train-dir ./
# 3. Train your model
$ python train.py --dataset ./event-dataset
📖 Detailed Guide: Complete Quick Start Tutorial
Architecture Overview¶
OWA consists of the following core components:
- 🌍 Environment Framework: "USB-C of desktop agents" - universal interface for native desktop automation with pre-built plugins for desktop control, high-performance screen capture, and zero-configuration plugin system
- 📊 Data Infrastructure: Complete desktop agent data pipeline from recording to training with
OWAMcapformat - a universal standard powered by MCAP - 🛠️ CLI Tools: Command-line utilities (
owl) for recording, analyzing, and managing agent data - 🤖 Examples: Complete implementations and training pipelines for multimodal agents
Project Structure¶
The repository is organized as a monorepo with multiple sub-repositories under the projects/ directory. Each sub-repository is a self-contained Python package installable via pip or uv and follows namespace packaging conventions.
open-world-agents/
├── projects/
│ ├── mcap-owa-support/ # OWAMcap format support
│ ├── owa-core/ # Core framework and registry system
│ ├── owa-msgs/ # Core message definitions with automatic discovery
│ ├── owa-cli/ # Command-line tools (ocap, owl)
│ ├── owa-env-desktop/ # Desktop environment plugin
│ ├── owa-env-example/ # Example environment implementations
│ ├── owa-env-gst/ # GStreamer-based screen capture
│ └── [your-plugin]/ # Contribute your own plugins!
├── docs/ # Documentation
└── README.md
Core Packages¶
The easiest way to get started is to install the owa meta-package, which includes all core components and environment plugins:
All OWA packages use namespace packaging and are installed in the owa namespace (e.g., owa.core, owa.cli, owa.env.desktop). For more detail, see Packaging namespace packages. We recommend using uv as the package manager.
| Name | PyPI | Conda | Description |
|---|---|---|---|
owa.core |
Framework foundation with registry system | ||
owa.msgs |
Core message definitions with automatic discovery | ||
owa.cli |
Command-line tools (owl) for data analysis |
||
mcap-owa-support |
OWAMcap format support and utilities | ||
ocap 🎥 |
Desktop recorder for multimodal data capture | ||
owa.env.desktop |
Mouse, keyboard, window event handling | ||
owa.env.gst 🎥 |
High-performance, hardware-accelerated screen capture | ||
owa.env.example |
- | - | Reference implementations for learning |
🎥 Video Processing Packages: Packages marked with 🎥 require GStreamer dependencies. Install
$ conda install open-world-agents::gstreamer-bundlefirst for full functionality.📦 Lockstep Versioning: All first-party OWA packages follow lockstep versioning, meaning they share the same version number to ensure compatibility and simplify dependency management.
🌍 Environment Framework¶
Universal interface for native desktop automation with real-time event handling and zero-configuration plugin discovery.
Environment Navigation¶
| Section | Description |
|---|---|
| Environment Overview | Core concepts and quick start guide |
| Environment Guide | Complete system overview and usage examples |
| Custom Plugins | Create your own environment extensions |
| CLI Tools | Plugin management and exploration commands |
Built-in Plugins:
| Plugin | Description | Key Features |
|---|---|---|
| Standard | Core utilities | Time functions, periodic tasks |
| Desktop | Desktop automation | Mouse/keyboard control, window management |
| GStreamer | Hardware-accelerated capture | Fast screen recording |
📊 Data Infrastructure¶
Desktop AI needs high-quality, synchronized multimodal data: screen captures, mouse/keyboard events, and window context. OWA provides the complete pipeline from recording to training.
🚀 Getting Started¶
New to OWA data? Start here:
- Why OWAMcap? - Understand the problem and solution
- Recording Data - Capture desktop interactions with
ocap - Exploring Data - View and analyze your recordings
📚 Technical Reference¶
- OWAMcap Format Guide - Complete technical specification
- Data Pipeline - Transform recordings to training-ready datasets
🛠️ Tools & Ecosystem¶
- Data Viewer - Web-based visualization tool
- Data Conversions - Convert existing datasets (VPT, CS:GO) to OWAMcap
- CLI Tools (owl) - Command-line interface for data analysis and management
🤗 Community Datasets¶
Browse Datasets: 🤗 HuggingFace
- Standardized Format: All datasets use OWAMcap for seamless integration
- Interactive Preview: Hugging Face Spaces Visualizer
🤖 Examples¶
| Example | Description | Status |
|---|---|---|
| Multimodal Game Agent | Vision-based game playing agent | 🚧 In Progress |
| GUI Agent | General desktop application automation | 🚧 In Progress |
| Interactive World Model | Predictive modeling of desktop environments | 🚧 In Progress |
| Usage with LLMs | Integration with large language models | 🚧 In Progress |
| Usage with Transformers | Vision transformer implementations | 🚧 In Progress |
Development Resources¶
Learn how to contribute, report issues, and get help.
| Resource | Description |
|---|---|
| Help with OWA | Community support resources |
| Installation Guide | Detailed installation instructions |
| Contributing Guide | Development setup, bug reports, feature proposals |
| FAQ for Developers | Common questions and troubleshooting |
Features¶
🌍 Environment Framework: "USB-C of Desktop Agents"¶
- ⚡ Real-time Performance: Optimized for responsive agent interactions (GStreamer components achieve <30ms latency)
- 🔌 Zero-Configuration: Automatic plugin discovery via Python Entry Points
- 🌐 Event-Driven: Asynchronous processing that mirrors real-world dynamics
- 🧩 Extensible: Community-driven plugin ecosystem
→ View Environment Framework Guide
📊 Data Infrastructure: Complete Pipeline¶
- 🌐 Universal Standard: Unlike fragmented formats, enables seamless dataset combination for large-scale foundation models (OWAMcap)
- ⚡ High-Performance Multimodal Storage: Lightweight MCAP container with nanosecond precision for synchronized data streams (MCAP)
- 🔗 Flexible MediaRef: Smart references to both external and embedded media (file paths, URLs, data URIs, video frames) with lazy loading - keeps metadata files small while supporting rich media (OWAMcap) → Learn more
- 🤗 Training Pipeline Ready: Native HuggingFace integration, seamless dataset loading, and direct compatibility with ML frameworks (Ecosystem) → Browse datasets | Data pipeline
→ View Data Infrastructure Guide
🤗 Community & Ecosystem¶
- 🌱 Growing Ecosystem: Hundreds of community datasets in unified OWAMcap format
- 🤗 HuggingFace Integration: Native dataset loading, sharing, and interactive preview tools
- 🧩 Extensible Architecture: Modular design for custom environments, plugins, and message types
- 💡 Community-Driven: Plugin ecosystem spanning gaming, web automation, mobile control, and specialized domains
License¶
This project is released under the MIT License. See the LICENSE file for details.