Skip to content

Why OWAMcap?

Desktop and GUI agent datasets use inconsistent formats, making it hard to combine data from different sources.

The Fragmentation Problem

Existing datasets each define their own format:

Dataset Venue Domain Format
VPT - Minecraft MP4 + JSONL (per-frame action dictionaries)
CS Deathmatch CoG '22 CS:GO HDF5 (screenshots + action labels) + NPY (metadata)
Mind2Web NeurIPS '23 Web Playwright traces + DOM snapshots (JSON) + screenshots (base64 JSON) + HAR
OmniACT ECCV '24 Desktop/Web PNG screenshots + TXT (task + PyAutoGUI script) + bounding box JSON

This is similar to how the Open-X Embodiment project had to manually convert 22 different robotics datasets. OWAMcap addresses this by providing a general desktop message definition based on MCAP. To demonstrate this, we provide conversion scripts that transform VPT, CS:GO, and other existing datasets into OWAMcap, allowing them to be combined and used with a unified training pipeline.

From Recording to Training

OWAMcap integrates with the complete OWA Data Pipeline:

# 1. Record desktop interaction
$ ocap my-session.mcap

# 2. Process to training format
$ python scripts/01_raw_to_event.py --train-dir ./

# 3. Train your model
$ python train.py --dataset ./event-dataset

📖 Detailed Guide: Complete Quick Start Tutorial - Step-by-step walkthrough with examples and troubleshooting

Key Features

  • 🌐 Universal Standard: Unlike fragmented formats, enables seamless dataset combination for large-scale foundation models (OWAMcap)
  • High-Performance Multimodal Storage: Lightweight MCAP container with nanosecond precision for synchronized data streams (MCAP)
  • 🔗 Flexible MediaRef: Smart references to both external and embedded media (file paths, URLs, data URIs, video frames) with lazy loading - keeps metadata files small while supporting rich media (OWAMcap)Learn more
  • 🤗 Training Pipeline Ready: Native HuggingFace integration, seamless dataset loading, and direct compatibility with ML frameworks (Ecosystem)Browse datasets | Data pipeline

Example

$ owl mcap info example.mcap
messages:  864 (10.36s of interaction data)
file size: 22 KiB (vs 1+ GB raw)
channels:  screen, mouse, keyboard, window

See OWAMcap Format Guide for technical details.