OWA Data Pipeline: From Raw MCAP to VLA Training¶

Quick Demo: 3 Commands to VLA Training¶

Step 1: Process raw MCAP files

python scripts/01_raw_events_to_event_dataset.py \ --config configs/mcap_to_event_example.yaml \ --input_dir /data/mcaps/game-session \ --output_dir /data/event-dataset🔄 Raw Events to Event Dataset
📁 Loading from: /data/mcaps/game-session
📊 Found 3 train, 1 test files✓ Created 24,907 train, 20,471 test examples
💾 Saving to /data/event-dataset
✓ Saved successfully
🎉 Completed in 3.9s (0.1min)

Step 2: Create time bins (optional)

python scripts/02B_event_dataset_to_binned_dataset.py \ --input-dir /data/event-dataset \ --output-dir /data/binned-dataset \ --fps 10 \ --filter-empty-actions🗂️ Event Dataset to Binned Dataset
📁 Loading from: /data/event-dataset
📊 Found 3 files to process✓ Created 2,235 binned entries for train split
✓ Created 1,772 binned entries for test split
💾 Saving to /data/binned-dataset
✓ Saved 4,007 total binned entries
🎉 Completed in 4.0s (0.1min)

Step 3: Train your model

python>>> from owa.data.datasets import load_from_disk
>>>
>>> # Load and transform dataset
>>> dataset = load_from_disk("/data/binned-dataset")
>>> dataset["train"].auto_set_transform(stage="binned", instruction="Complete the computer task")
>>>
>>> # Use in training
>>> for sample in dataset["train"].take(1):
... print(f"Images: {len(sample['images'])} frames")
... print(f"Actions: {sample['encoded_events'][:3]}...")
... print(f"Instruction: {sample['instruction']}")
Images: 12 frames
Actions: ['<EVENT_START>mouse_move<EVENT_END>', '<EVENT_START>key_press:w<EVENT_END>', '<EVENT_START>mouse_click:left<EVENT_END>']...
Instruction: Complete the computer task

That's it! Your MCAP recordings are now ready for VLA training.

The OWA Data Pipeline is a streamlined 2-stage processing system that transforms raw MCAP recordings into training-ready datasets for Vision-Language-Action (VLA) models. This pipeline bridges the gap between desktop interaction capture and foundation model training.

Pipeline Architecture¶

graph LR
    A[Raw MCAP Files] --> B[Stage 1: Event Dataset]
    B --> C[Stage 2: Binned Dataset]
    B --> D[Dataset Transforms]
    C --> D
    D --> E[VLA Training Ready]

    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#ffebee

Key Features:

🔄 Flexible: Skip binning and use Event Dataset directly, or use traditional Binned Dataset approach

💾 Storage Optimized: Since event/binned dataset saves only reference to media, the entire pipeline is designed to be space-efficient.

/data/
├── mcaps/           # Raw recordings (400MB)
├── event-dataset/   # References only (20MB)
└── binned-dataset/  # Aggregated refs (2MB)

🤗 Native HuggingFace: Event/binned dataset is a true HuggingFace datasets.Dataset with set_transform(), not wrappers.

# Since event/binned datasets are true HuggingFace datasets,
# they can be loaded directly into training pipelines
from owa.data.datasets import load_from_disk
dataset = load_from_disk("/data/event-dataset")
dataset = load_from_disk("/data/binned-dataset")

# Transform to VLA training format is applied on-the-fly during training
dataset["train"].auto_set_transform(
    stage="binned",
    encoder_type="hierarchical",
    instruction="Complete the computer task"
)

# Use in training
for sample in dataset["train"].take(1):
    print(f"Images: {len(sample['images'])} frames")
    print(f"Actions: {sample['encoded_events'][:3]}...")
    print(f"Instruction: {sample['instruction']}")

⚡ Compute-optimized, On-the-Fly Processing: During preprocess stage, media is not loaded. During training, only the required media is loaded on-demand.

python scripts/01_raw_events_to_event_dataset.py🔄 Raw Events to Event Dataset
📁 Loading from: /data/mcaps/game-session
📊 Found 3 train, 1 test files✓ Created 24,907 train, 20,471 test examples
💾 Saving to /data/event-dataset
✓ Saved successfully
🎉 Completed in **3.9s** (0.1min)

Stage 1: Raw MCAP → Event Dataset¶

Purpose

Extract and downsample raw events from MCAP files while preserving temporal precision and event context.

Script Usage¶

python scripts/01_raw_events_to_event_dataset.py \
  --train-dir /path/to/mcap/files \
  --output-dir /path/to/event/dataset \
  --rate mouse=60 --rate screen=20 \
  --keep-topic screen --keep-topic keyboard

Key Parameters¶

Parameter	Description	Example
`--train-dir`	Directory containing MCAP files	`/data/recordings/`
`--output-dir`	Output directory for Event Dataset	`/data/event-dataset/`
`--rate`	Rate limiting per topic (Hz)	`mouse=60 screen=20`
`--keep-topic`	Topics to include in dataset	`screen keyboard mouse`

Output Schema¶

The Event Dataset uses a flat structure optimized for temporal queries:

{
    "file_path": Value("string"),      # Source MCAP file path
    "topic": Value("string"),          # Event topic (keyboard, mouse, screen)
    "timestamp_ns": Value("int64"),    # Timestamp in nanoseconds
    "message_type": Value("string"),   # Full message type identifier
    "mcap_message": Value("binary"),   # Serialized McapMessage bytes
}

When to Use Event Dataset

High-frequency training: When you need precise temporal resolution
Custom binning: When you want to implement your own temporal aggregation
Event-level analysis: When studying individual interaction patterns

Stage 2: Event Dataset → Binned Dataset¶

Purpose

Aggregate events into fixed-rate time bins for uniform temporal sampling, separating state (screen) from actions (keyboard/mouse). This format is equivalent to most existing VLA datasets, such as LeRobotDataset

Script Usage¶

python scripts/02_event_dataset_to_binned_dataset.py \
  --input-dir /path/to/event/dataset \
  --output-dir /path/to/binned/dataset \
  --fps 10 \
  --filter-empty-actions

Key Parameters¶

Parameter	Description	Default
`--fps`	Binning frequency (frames per second)	`10`
`--filter-empty-actions`	Remove bins with no actions	`False`
`--input-dir`	Event Dataset directory	Required
`--output-dir`	Output directory for Binned Dataset	Required

Output Schema¶

The Binned Dataset organizes events into temporal bins with state-action separation:

{
    "file_path": Value("string"),      # Source MCAP file path
    "bin_idx": Value("int32"),         # Time bin index
    "timestamp_ns": Value("int64"),    # Bin start timestamp
    "state": Sequence(feature=Value("binary"), length=-1),    # Screen events
    "actions": Sequence(feature=Value("binary"), length=-1),  # Action events
}

When to Use Binned Dataset

Traditional VLA training: When following established vision-language-action patterns
Fixed-rate processing: When you need consistent temporal sampling
State-action separation: When your model expects distinct state and action inputs
Efficient filtering: When you want to remove inactive periods

Dataset Transforms: The Magic Layer¶

Dataset transforms provide the crucial bridge between stored data and training-ready format. They apply on-demand during data loading, not during preprocessing.

Unified Transform Interface¶

Both Event Dataset and Binned Dataset support the same transform interface:

Event Dataset TransformBinned Dataset Transform

from owa.data.datasets import load_from_disk

# Load dataset
dataset = load_from_disk("/path/to/event-dataset")

# Apply transform using auto_set_transform
dataset["train"].auto_set_transform(
    stage="event",
    encoder_type="hierarchical",
    load_images=True
)

# Use in training
for sample in dataset["train"]:
    images = sample["images"]          # List[PIL.Image]
    events = sample["encoded_events"]  # List[str]

from owa.data.datasets import load_from_disk

# Load dataset
dataset = load_from_disk("/path/to/binned-dataset")

# Apply transform using auto_set_transform
dataset["train"].auto_set_transform(
    stage="binned",
    encoder_type="hierarchical",
    instruction="Complete the computer task",
    load_images=True
)

# Use in training
for sample in dataset["train"]:
    images = sample["images"]          # List[PIL.Image]
    actions = sample["encoded_events"] # List[str]
    instruction = sample["instruction"] # str

Transform Parameters¶

Parameter	Description	Options	Default
`encoder_type`	Event encoding strategy	`hierarchical`, `factorized`, `json`	`factorized`
`load_images`	Load screen images	`True`, `False`	`True`
`encode_actions`	Encode action events	`True`, `False`	`True`
`instruction`	Task instruction (Binned only)	Any string	`"Complete the task"`

References¶

Format Guide - OWAMcap details
Recording Data - Create with ocap
HuggingFace Datasets - datasets library