Skip to content
Open World Agents

Open World Agents Documentation

A comprehensive framework for building AI agents that interact with desktop applications through vision, keyboard, and mouse control.

Open World Agents (OWA) is a monorepo containing the complete toolkit for multimodal desktop agent development. From high-performance data capture to model training and real-time evaluation, everything is designed for flexibility and performance.

🚀 Quick Start: Record → Train in 3 Steps

# 1. Record desktop interaction
$ ocap my-session.mcap

# 2. Process to training format
$ python scripts/01_raw_events_to_event_dataset.py --train-dir ./

# 3. Train your model
$ python train.py --dataset ./event-dataset

📖 Detailed Guide: Complete Quick Start Tutorial - Step-by-step walkthrough with examples and troubleshooting

Architecture Overview

OWA consists of the following core components:

  • 🌍 Environment Framework - Universal interface for native desktop automation ("USB-C of desktop agents") with pre-built plugins for desktop control, high-performance screen capture (6x faster), and zero-configuration plugin system
  • 📊 Data Infrastructure - Complete desktop agent data pipeline from recording to training with OWAMcap format - a universal standard powered by mcap
  • 🛠️ CLI Tools - Command-line utilities (owl) for recording, analyzing, and managing agent data
  • 🤖 Examples - Complete implementations and training pipelines for multimodal agents

🌍 Environment Framework

Universal interface for native desktop automation with real-time event handling and zero-configuration plugin discovery.

Environment Navigation

Section Description
Environment Overview Core concepts and quick start guide
Environment Guide Complete system overview and usage examples
Custom Plugins Create your own environment extensions
CLI Tools Plugin management and exploration commands

Built-in Plugins:

Plugin Description Key Features
Standard Core utilities Time functions, periodic tasks
Desktop Desktop automation Mouse/keyboard control, window management
GStreamer High-performance capture 6x faster screen recording

📊 Data Infrastructure: Complete Desktop Agent Data Pipeline

Desktop AI needs high-quality, synchronized multimodal data: screen captures, mouse/keyboard events, and window context. OWA provides the complete pipeline from recording to training.

The OWA Data Ecosystem

🎯 Getting Started New to OWA data? Start here:

📚 Technical Reference Deep dive into the format and pipeline:

🛠️ Tools & Ecosystem

🤗 Community Datasets

Browse Available Datasets: 🤗 datasets?other=OWA

  • Growing Collection: Hundreds of community-contributed datasets
  • Standardized Format: All use OWAMcap for seamless integration
  • Interactive Preview: Hugging Face Spaces Visualizer
  • Easy Sharing: Upload recordings directly with one command

🚀 Impact: OWA has democratized desktop agent data, growing from zero to hundreds of public datasets in the unified OWAMcap format.


🤖 Awesome Examples

Learn from complete implementations and training pipelines.

Example Description Status
Multimodal Game Agent Vision-based game playing agent 🚧 In Progress
GUI Agent General desktop application automation 🚧 In Progress
Interactive World Model Predictive modeling of desktop environments 🚧 In Progress
Usage with LLMs Integration with large language models 🚧 In Progress
Usage with Transformers Vision transformer implementations 🚧 In Progress

Development Resources

Learn how to contribute, report issues, and get help.

Resource Description
Help with OWA Community support resources
Installation Guide Detailed installation instructions
Contributing Guide Development setup, bug reports, feature proposals
FAQ for Developers Common questions and troubleshooting

License

This project is released under the MIT License. See the LICENSE file for details.