Why OWAMcap?¶
Desktop and GUI agent datasets use inconsistent formats, making it hard to combine data from different sources.
The Fragmentation Problem¶
Existing datasets each define their own format:
| Dataset | Venue | Domain | Format |
|---|---|---|---|
| VPT | - | Minecraft | MP4 + JSONL (per-frame action dictionaries) |
| CS Deathmatch | CoG '22 | CS:GO | HDF5 (screenshots + action labels) + NPY (metadata) |
| Mind2Web | NeurIPS '23 | Web | Playwright traces + DOM snapshots (JSON) + screenshots (base64 JSON) + HAR |
| OmniACT | ECCV '24 | Desktop/Web | PNG screenshots + TXT (task + PyAutoGUI script) + bounding box JSON |
This is similar to how the Open-X Embodiment project had to manually convert 22 different robotics datasets. OWAMcap addresses this by providing a general desktop message definition based on MCAP. To demonstrate this, we provide conversion scripts that transform VPT, CS:GO, and other existing datasets into OWAMcap, allowing them to be combined and used with a unified training pipeline.
From Recording to Training¶
OWAMcap integrates with the complete OWA Data Pipeline:
# 1. Record desktop interaction
$ ocap my-session.mcap
# 2. Process to training format
$ python scripts/01_raw_to_event.py --train-dir ./
# 3. Train your model
$ python train.py --dataset ./event-dataset
📖 Detailed Guide: Complete Quick Start Tutorial - Step-by-step walkthrough with examples and troubleshooting
Key Features¶
- 🌐 Universal Standard: Unlike fragmented formats, enables seamless dataset combination for large-scale foundation models (OWAMcap)
- ⚡ High-Performance Multimodal Storage: Lightweight MCAP container with nanosecond precision for synchronized data streams (MCAP)
- 🔗 Flexible MediaRef: Smart references to both external and embedded media (file paths, URLs, data URIs, video frames) with lazy loading - keeps metadata files small while supporting rich media (OWAMcap) → Learn more
- 🤗 Training Pipeline Ready: Native HuggingFace integration, seamless dataset loading, and direct compatibility with ML frameworks (Ecosystem) → Browse datasets | Data pipeline
Example¶
$ owl mcap info example.mcap
messages: 864 (10.36s of interaction data)
file size: 22 KiB (vs 1+ GB raw)
channels: screen, mouse, keyboard, window
See OWAMcap Format Guide for technical details.