Skip to content

OWAMcap Format Guide

What is OWAMcap?

OWAMcap is a specification for using the open-source MCAP container format with Open World Agents (OWA) message definitions. It provides an efficient way to store and process multimodal desktop interaction data including screen captures, mouse events, keyboard events, and window information.

New to OWAMcap?

Start with Why OWAMcap? to understand the problem it solves and why you should use it.

Table of Contents

Getting Started

Quick Start

Try OWAMcap in 3 Steps

1. Install the packages:

$ pip install mcap-owa-support owa-msgs

2. Explore an example file with the owl CLI:

What is owl?

owl is the command-line interface for OWA tools, installed with owa-cli. See the CLI documentation for complete usage.

# Download example file
wget https://github.com/open-world-agents/open-world-agents/raw/main/docs/data/examples/example.mcap

# View file info
owl mcap info example.mcap

# List first 5 messages
owl mcap cat example.mcap --n 5

3. Load in Python:

from mcap_owa.highlevel import OWAMcapReader

with OWAMcapReader("example.mcap", decode_args={"return_dict": True}) as reader:
    for msg in reader.iter_messages(topics=["screen"]):
        screen_data = msg.decoded
        print(f"Frame: {screen_data.shape} at {screen_data.utc_ns}")
        break  # Just show first frame

Core Concepts

OWAMcap combines the robustness of the MCAP container format with OWA's specialized message types for desktop environments, creating a powerful format for recording, analyzing, and training on human-computer interaction data.

Key Terms:

Essential Terminology

  • MCAP: A modular container file format for heterogeneous, timestamped data (like a ZIP file for time-series data). Developed by Foxglove, MCAP provides efficient random access, compression, and self-describing schemas. Widely adopted in robotics (ROS ecosystem), autonomous vehicles, and IoT applications for its performance and interoperability.
  • Topic: A named channel in MCAP files (e.g., "screen", "mouse") that groups related messages
  • Lazy Loading: Loading data only when needed, crucial for memory efficiency with large datasets

What Makes a File "OWAMcap":

┌─────────────────────────────────────────────────────────────┐
│                    OWAMcap File (.mcap)                     │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐  │
│  │   Metadata      │  │   Timestamps    │  │  Messages   │  │
│  │   - Profile     │  │   - Nanosecond  │  │  - Mouse    │  │
│  │   - Topics      │  │     precision   │  │  - Keyboard │  │
│  │   - Schemas     │  │   - Event sync  │  │  - Window   │  │
│  └─────────────────┘  └─────────────────┘  └─────────────┘  │
└─────────────────────────────────────────────────────────────┘
                                │ References
┌─────────────────────────────────────────────────────────────┐
│                External Media Files (.mkv, .png)            │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐  │
│  │  Video Frames   │  │  Screenshots    │  │   Audio     │  │
│  │  - H.265 codec  │  │  - PNG/JPEG     │  │  - Optional │  │
│  │  - Hardware acc │  │  - Lossless     │  │  - Sync'd   │  │
│  └─────────────────┘  └─────────────────┘  └─────────────┘  │
└─────────────────────────────────────────────────────────────┘
  • Base Format: Standard MCAP container format
  • Profile: owa designation in MCAP metadata
  • Schema Encoding: JSON Schema
  • Message Interface: All messages implement BaseMessage from owa.core.message
  • Standard Messages: Core message types from owa-msgs package

Why MCAP?

Built as the successor to ROSBag, MCAP offers efficient storage and retrieval for heterogeneous timestamped data with minimal dependencies. It's designed for modern use cases with optimized random access, built-in compression, and language-agnostic schemas. The format has gained significant adoption across the robotics community, autonomous vehicle companies (Cruise, Waymo), and IoT platforms due to its performance advantages and excellent tooling ecosystem.

$ owl mcap info example.mcap
library:   mcap-owa-support 0.5.1; mcap 1.3.0
profile:   owa
messages:  864
duration:  10.3574349s
start:     2025-06-27T18:49:52.129876+09:00 (1751017792.129876000)
end:       2025-06-27T18:50:02.4873109+09:00 (1751017802.487310900)
compression:
        zstd: [1/1 chunks] [116.46 KiB/16.61 KiB (85.74%)] [1.60 KiB/sec]
channels:
        (1) window           11 msgs (1.06 Hz)    : desktop/WindowInfo [jsonschema]
        (2) keyboard/state   11 msgs (1.06 Hz)    : desktop/KeyboardState [jsonschema]
        (3) mouse/state      11 msgs (1.06 Hz)    : desktop/MouseState [jsonschema]
        (4) screen          590 msgs (56.96 Hz)   : desktop/ScreenCaptured [jsonschema]
        (5) mouse           209 msgs (20.18 Hz)   : desktop/MouseEvent [jsonschema]
        (6) keyboard         32 msgs (3.09 Hz)    : desktop/KeyboardEvent [jsonschema]
channels: 6
attachments: 0
metadata: 0

Key Features:

  • 🌐 Universal Standard: Unlike fragmented formats, enables seamless dataset combination for large-scale foundation models (OWAMcap)
  • High-Performance Multimodal Storage: Lightweight MCAP container with nanosecond precision for synchronized data streams (MCAP)
  • 🔗 Flexible MediaRef: Smart references to both external and embedded media (file paths, URLs, data URIs, video frames) with lazy loading - keeps metadata files small while supporting rich media (OWAMcap)Learn more
  • 🤗 Training Pipeline Ready: Native HuggingFace integration, seamless dataset loading, and direct compatibility with ML frameworks (Ecosystem)Browse datasets | Data pipeline

Core Message Types:

OWA provides standardized message types through the owa-msgs package for consistent desktop interaction recording:

Message Type Description
desktop/KeyboardEvent Keyboard press/release events
desktop/KeyboardState Current keyboard state
desktop/MouseEvent Mouse movement, clicks, scrolls
desktop/MouseState Current mouse position and buttons
desktop/RawMouseEvent High-definition raw mouse input data
desktop/ScreenCaptured Screen capture frames with timestamps
desktop/WindowInfo Active window information
class KeyboardEvent(OWAMessage):
    _type = "desktop/KeyboardEvent"

    event_type: str  # "press" or "release"
    vk: int         # Virtual key code (e.g., 65 for 'A')
    timestamp: int  # Event timestamp

# Example: User presses the 'A' key
KeyboardEvent(event_type="press", vk=65, timestamp=1234567890)

What's VK (Virtual Key Code)?

Operating systems don't directly use the physical keyboard input values (scan codes) but instead use virtualized keys called VKs. OWA's recorder uses VKs to record keyboard-agnostic data. If you're interested in more details, you can refer to the following resources:

class KeyboardState(OWAMessage):
    _type = "desktop/KeyboardState"

    buttons: List[int]  # List of currently pressed virtual key codes

# Example: No keys currently pressed
KeyboardState(buttons=[])
class MouseEvent(OWAMessage):
    _type = "desktop/MouseEvent"

    event_type: str  # "move", "click", "scroll", "drag"
    x: int          # Screen X coordinate
    y: int          # Screen Y coordinate
    button: Optional[str] = None    # "left", "right", "middle"

# Example: Mouse click at position (100, 200)
MouseEvent(event_type="click", x=100, y=200, button="left")
class MouseState(OWAMessage):
    _type = "desktop/MouseState"

    x: int                    # Current mouse X coordinate
    y: int                    # Current mouse Y coordinate
    buttons: List[str] = []   # Currently pressed mouse buttons

# Example: Mouse at position with no buttons pressed
MouseState(x=1594, y=1112, buttons=[])
class RawMouseEvent(OWAMessage):
    _type = "desktop/RawMouseEvent"

    us_flags: mouse state flags, containing movement data type (relative/absolute). Default is relative.
    last_x: can be relative or absolute, depends on us_flags
    last_y: can be relative or absolute, depends on us_flags
    button_flags: Raw button state flags from Windows RAWMOUSE structure
    button_data: Additional button data (wheel delta, etc.)
    device_handle: Raw input device handle (optional)
    timestamp: Optional timestamp in nanoseconds since epoch

# Example: Raw mouse movement
RawMouseEvent(us_flags=0x0000, last_x=15, last_y=-10, button_flags=0x0000, button_data=0)
class ScreenCaptured(OWAMessage):
    _type = "desktop/ScreenCaptured"

    utc_ns: Optional[int] = None                    # System timestamp (nanoseconds)
    source_shape: Optional[Tuple[int, int]] = None  # Original (width, height)
    shape: Optional[Tuple[int, int]] = None         # Current (width, height)
    media_ref: Optional[MediaRef] = None            # URI or file path reference
    frame_arr: Optional[np.ndarray] = None          # In-memory BGRA array (excluded from JSON)

Working with ScreenCaptured Messages

For detailed information on creating, loading, and working with ScreenCaptured messages, see the Media Handling section below. It covers MediaRef formats, lazy loading, and practical usage patterns.

class WindowInfo(OWAMessage):
    _type = "desktop/WindowInfo"

    title: str              # Window title text
    rect: List[int]         # [x, y, width, height]
    hWnd: Optional[int]     # Windows handle (platform-specific)

# Example: Browser window
WindowInfo(
    title="GitHub - Open World Agents - Chrome",
    rect=[100, 50, 1200, 800]
)

Working with OWAMcap

This section covers the essential operations for working with OWAMcap files in your applications. Whether you're processing recorded desktop sessions or creating new datasets, these patterns will help you work efficiently with the format.

Media Handling

OWAMcap's key advantage is efficient media handling through external media references. Instead of storing large image/video data directly in the MCAP file, OWAMcap stores lightweight references to external media files, keeping the MCAP file small and fast to process.

Understanding MediaRef

MediaRef is OWAMcap's way of referencing media content, powered by the mediaref package. It supports multiple formats:

  • File paths: /absolute/path or relative/path
  • File URIs: file:///path/to/file
  • HTTP URLs: https://example.com/image.png
  • Data URIs: data:image/png;base64,... (embedded content)

For videos, add pts_ns (presentation timestamp) to specify which frame.

The mediaref package is automatically installed with owa-msgs and provides efficient lazy loading and batch decoding capabilities.

from owa.core import MESSAGES
import numpy as np

ScreenCaptured = MESSAGES['desktop/ScreenCaptured']

# File paths (absolute/relative) - works for images and videos
screen_msg = ScreenCaptured(media_ref={"uri": "/absolute/path/image.png"})
screen_msg = ScreenCaptured(media_ref={"uri": "relative/video.mkv", "pts_ns": 123456})

# File URIs - works for images and videos
screen_msg = ScreenCaptured(media_ref={"uri": "file:///path/to/image.jpg"})
screen_msg = ScreenCaptured(media_ref={"uri": "file:///path/to/video.mp4", "pts_ns": 123456})

# HTTP/HTTPS URLs - works for images and videos
screen_msg = ScreenCaptured(media_ref={"uri": "https://example.com/image.png"})
screen_msg = ScreenCaptured(media_ref={"uri": "https://example.com/video.mp4", "pts_ns": 123456})

# Data URIs (embedded base64) - typically for images
screen_msg = ScreenCaptured(media_ref={"uri": "data:image/png;base64,iVBORw0KGgo..."})

# From raw image array (BGRA format required)
bgra_array = np.random.randint(0, 255, (1080, 1920, 4), dtype=np.uint8)
screen_msg = ScreenCaptured(frame_arr=bgra_array)
screen_msg.embed_as_data_uri(format="png")  # Required for serialization
# Now screen_msg.media_ref contains: {"uri": "data:image/png;base64,..."}

Why Lazy Loading Matters

Lazy Loading means frame data is only loaded when you explicitly request it. This is crucial for performance:

  • Fast: Iterate through thousands of messages instantly
  • Memory efficient: Only load frames you actually need
  • Scalable: Work with datasets larger than your RAM

Without lazy loading, opening a 1-hour recording would try to load ~200GB of frame data into memory!

# IMPORTANT: For MCAP files, resolve relative paths first
# The OWA recorder saves media paths relative to the MCAP file location
ScreenCaptured = MESSAGES['desktop/ScreenCaptured']
screen_msg = ScreenCaptured(
    media_ref={"uri": "relative/video.mkv", "pts_ns": 123456789}
)

# Must resolve external paths before loading from MCAP files
screen_msg.resolve_relative_path("/path/to/data.mcap")

# Lazy loading: Frame data is loaded on-demand when these methods are called
rgb_array = screen_msg.to_rgb_array()        # RGB numpy array (most common)
pil_image = screen_msg.to_pil_image()        # PIL Image object
bgra_array = screen_msg.load_frame_array()   # Raw BGRA array (native format)

# Check if frame data is loaded (lazy loading means it starts as None)
if screen_msg.frame_arr is not None:
    height, width, channels = screen_msg.frame_arr.shape
    print(f"Frame: {width}x{height}, {channels} channels")
else:
    print("Frame data not loaded - use load_frame_array() first")

Reading and Writing

from mcap_owa.highlevel import OWAMcapReader

with OWAMcapReader("session.mcap") as reader:
    # File metadata
    print(f"Topics: {reader.topics}")
    print(f"Duration: {(reader.end_time - reader.start_time) / 1e9:.2f}s")

    # Lazy loading advantage: Fast iteration without loading frame data
    for msg in reader.iter_messages(topics=["screen"]):
        screen_data = msg.decoded
        print(f"Frame metadata: {screen_data.shape} at {screen_data.utc_ns}")
        # No frame data loaded yet - extremely fast for large datasets

        # Only load frame data when actually needed
        if some_condition:  # e.g., every 10th frame
            frame = screen_data.to_rgb_array()  # Now frame is loaded
            break  # Just show first frame
from mcap_owa.highlevel import OWAMcapWriter
from owa.core import MESSAGES

ScreenCaptured = MESSAGES['desktop/ScreenCaptured']
MouseEvent = MESSAGES['desktop/MouseEvent']

with OWAMcapWriter("output.mcap") as writer:
    # Write screen capture
    screen_msg = ScreenCaptured(
        utc_ns=1234567890,
        media_ref={"uri": "video.mkv", "pts_ns": 1234567890},
        shape=(1920, 1080)
    )
    writer.write_message(screen_msg, topic="screen", timestamp=1234567890)

    # Write standard mouse event
    mouse_msg = MouseEvent(event_type="click", x=100, y=200)
    writer.write_message(mouse_msg, topic="mouse", timestamp=1234567891)
# Time range filtering
with OWAMcapReader("session.mcap") as reader:
    start_time = reader.start_time + 1_000_000_000  # Skip first second
    end_time = reader.start_time + 10_000_000_000   # First 10 seconds

    for msg in reader.iter_messages(start_time=start_time, end_time=end_time):
        print(f"Message in range: {msg.topic}")

# Remote files
with OWAMcapReader("https://example.com/data.mcap") as reader:
    for msg in reader.iter_messages(topics=["screen"]):
        print(f"Remote frame: {msg.decoded.shape}")
# File information
owl mcap info session.mcap

# List messages
owl mcap cat session.mcap --n 10 --topics screen --topics mouse

# Migrate between versions
owl mcap migrate run session.mcap

# Extract frames
owl mcap extract-frames session.mcap --output frames/

Storage & Performance

OWAMcap achieves remarkable storage efficiency through external video references and intelligent compression.

Compression Benefits:

Compression Performance

Compression performance varies significantly across formats. H.265 encoding achieves a 217.8× compression ratio compared to raw BGRA data while maintaining visual quality suitable for agent training, enabling practical storage of large-scale desktop interaction datasets.

Desktop screen capture at 1920 × 1080 resolution, 12 s @ 60 Hz:

Format Size per Frame Whole Size Compression Ratio
Raw BGRA 5.97 MB 4.2 GB 1.0× (baseline)
PNG 1.87 MB 1.31 GB 3.2×
JPEG (Quality 85) 191 KB 135 MB 31.9×
H.265 (keyframe 0.5s, nvd3d11h265enc) 27.8 KB avg 19.6 MB 217.8×

Compression benefit per resolution

Compression performance is resolution-dependent. Lower resolutions yield lower compression ratios.

Desktop screen capture at 600 × 800 resolution, 13 s @ 60 Hz:

Format Size per Frame Whole Size Compression Ratio
Raw BGRA 1.37 MB 1.0 GB 1.0× (baseline)
PNG 468 KB 341 MB 3.0×
JPEG (Quality 85) 64.6 KB 47.2 MB 21.7×
H.265 (keyframe 0.5s, nvd3d11h265enc) 15.3 KB avg 11.2 MB 91.7×

H.265 Configuration

The H.265 settings shown above (keyframe 0.5s, nvd3d11h265enc) are the same as those used by ocap for efficient desktop recording.

Key advantages:

  • Lightweight MCAP: very fast to parse, transfer, and back up
  • Video Compression: leverages hardware-accelerated codecs for extreme savings
  • Selective Loading: grab only the frames you need without full decompression
  • Standard Tools: preview in any video player and edit with off-the-shelf software

Advanced Topics

Extending OWAMcap

Custom Message Types:

Need to store domain-specific data beyond standard desktop interactions? OWAMcap supports custom message types for sensors, gaming, robotics, and more.

Custom Messages Documentation

📖 Custom Message Types Guide - Complete guide to creating, registering, and using custom message types in OWAMcap.

Covers: message creation, package registration, best practices, and CLI integration.

Data Pipeline Integration

See owa-data README for full pipeline documentation.

Reference

Migration & Troubleshooting

File Migration:

OWAMcap format evolves over time. When you encounter older files that need updating, use the migration tool:

When Do You Need Migration?

  • Error messages about unsupported schema versions
  • Missing fields when loading older recordings
  • Compatibility warnings from OWA tools
  • Performance issues with legacy file formats

Migration commands:

# Check if migration is needed
owl mcap info old_file.mcap  # Look for version warnings

# Preview what will change (safe, no modifications)
owl mcap migrate run old_file.mcap --dry-run

# Migrate single file (creates backup automatically)
owl mcap migrate run old_file.mcap

# Migrate multiple files in batch
owl mcap migrate run *.mcap

# Migrate with custom output location
owl mcap migrate run old_file.mcap --output new_file.mcap

Migration Safety

  • Automatic backups: Original files are preserved as .backup
  • Validation: Migrated files are automatically validated
  • Rollback: Use backup files if migration causes issues

Complete Migration Reference

For detailed information about all migration commands and options, see the OWA CLI - MCAP Commands documentation.

Common Issues:

File Not Found Errors

When video files are missing:

# Resolve relative paths
screen_msg.resolve_relative_path("/path/to/mcap/file.mcap")
# Check if external media exists
screen_msg.media_ref.validate_uri()

Memory Usage

Large datasets can consume memory:

# Use lazy loading instead of loading all frames
for msg in reader.iter_messages(topics=["screen"]):
    if should_process_frame(msg.timestamp):
        frame = msg.decoded.load_frame_array()  # Only when needed

Technical Reference

For detailed technical specifications, see:

Next Steps