Skip to content

Quick Start GuideΒΆ

Complete step-by-step walkthrough for getting started with Open World Agents

3-Step Workflow

This guide covers the complete OWA workflow: Record β†’ Process β†’ Train

OverviewΒΆ

This guide provides detailed explanations, examples, and troubleshooting for the 3-step OWA workflow:

# 1. Record desktop interaction
$ ocap my-session.mcap

# 2. Process to training format
$ python scripts/01_raw_events_to_event_dataset.py --train-dir ./

# 3. Train your model
$ python train.py --dataset ./event-dataset

πŸ“– Detailed Guide: Complete Quick Start Tutorial - Step-by-step walkthrough with examples and troubleshooting

PrerequisitesΒΆ

Installation Required

Before starting, ensure you have OWA installed. See the Installation Guide for detailed setup instructions.

For full recording capabilities:

# Install GStreamer dependencies first
conda install open-world-agents::gstreamer-bundle
pip install owa

For basic data processing:

pip install owa

Step 1: Record Desktop InteractionΒΆ

Record with ocap

Use ocap (Omnimodal CAPture) to record your desktop interactions with synchronized video, audio, and input events.

$ ocap my-session.mcap

What this captures

  • Screen video with hardware acceleration
  • Keyboard events with nanosecond precision
  • Mouse interactions with exact coordinates
  • Audio recording synchronized with video
  • Everything saved in the OWAMcap format

Learn More

Step 2: Process to Training FormatΒΆ

Transform with Data Pipeline

Transform your recorded data into training-ready datasets using OWA's data pipeline.

$ python scripts/01_raw_events_to_event_dataset.py --train-dir ./

Processing Pipeline

  • Extracts events from the MCAP file
  • Converts format to standardized training structure
  • Handles media references and synchronization
  • Prepares data for ML frameworks

Advanced Processing

flowchart LR
    A[MCAP File] --> B[Event Dataset]
    B --> C[Binned Dataset]
    C --> D[Training Ready]

    style A fill:#e1f5fe
    style D fill:#e8f5e8

Learn More

Step 3: Train Your ModelΒΆ

TODO: Training Implementation

This section is under development. Training scripts and detailed examples are coming soon.

Train with Processed Data

Use the processed dataset to train your desktop agent model.

$ python train.py --dataset ./event-dataset

Training Capabilities

  • Multimodal models on desktop interactions
  • Learn from demonstrations - human behavior patterns
  • Application-specific agents - tailored for your use case
  • Performance evaluation on real tasks

Training Architecture

flowchart TD
    A[Event Dataset] --> B[Vision Encoder]
    A --> C[Action Encoder]
    B --> D[Multimodal Fusion]
    C --> D
    D --> E[Policy Network]
    E --> F[Desktop Agent]

    style A fill:#e1f5fe
    style F fill:#e8f5e8

Learn More

Environment Framework IntegrationΒΆ

Real-time Agent Interactions

While recording and training, you can also use OWA's real-time environment framework for live agent interactions:

from owa.core import CALLABLES

# Real-time screen capture
screen = CALLABLES["desktop/screen.capture"]()
from owa.core import LISTENERS

# Monitor user interactions
def on_key(event):
    print(f"Key pressed: {event.vk}")

listener = LISTENERS["desktop/keyboard"]().configure(callback=on_key)
from owa.core import CALLABLES

# Perform desktop actions
CALLABLES["desktop/mouse.click"]("left", 2)  # Double-click
CALLABLES["desktop/keyboard.type"]("Hello World!")

Learn More

Community ResourcesΒΆ

Datasets & Tools

Next StepsΒΆ

Your Journey Continues

  1. Explore Examples: Start with Agent Examples to see complete implementations

  2. Join the Community: Browse and contribute datasets

  3. Build Custom Plugins: Extend OWA with custom environment plugins

  4. Advanced Usage: Dive into technical documentation for advanced features

Quick Links