Data Conversion Guide : VPT to OWAMcap¶
This document explains how to convert Video PreTraining (VPT) data format to the Open World Agents MCAP format (OWAMcap).
The full example script can be found here. The converted dataset is uploaded in the OWA huggingface repo.
Overview¶
The VPT dataset consists of paired MP4 video files and JSONL files containing keyboard and mouse actions. The conversion process transforms these into OWAMcap format, which is used for storing multimodal interaction data in Open World Agents.
The conversion script handles: - Filter validation for 5-minute VPT recordings - Mapping VPT keyboard actions to OWA virtual key codes - Converting mouse movements to OWA mouse events - Synchronizing video frames with input events - Creating proper timestamps for all events
Requirements¶
- VPT dataset with paired MP4 and JSONL files
- OWA environment with
mcap_owa
package installed
Conversion Process¶
The conversion involves these key steps:
- Validation: Only JSONL files with exactly 6000 lines (5 minutes of 50ms ticks) are processed
- Window Setup: A virtual window is created with 1280x720 resolution
- Input Handling:
- Mouse is pinned to center of screen, with relative movements recorded
- Only navigation-related keyboard inputs are mapped (WASD, Space, Shift, Ctrl)
- Timing: Events are spaced at 50ms intervals, with precise timing for mouse movements
Key Mapping¶
The script maps VPT keyboard inputs to OWA virtual key codes:
VPT Key | OWA Virtual Key |
---|---|
key.keyboard.w | VK.KEY_W |
key.keyboard.a | VK.KEY_A |
key.keyboard.s | VK.KEY_S |
key.keyboard.d | VK.KEY_D |
key.keyboard.space | VK.SPACE |
key.keyboard.left.shift | VK.LSHIFT |
key.keyboard.left.control | VK.LCONTROL |
Usage¶
- Set the
VPT_FOLDER_PATH
variable to the location of your VPT dataset - Run the script to generate a list of valid VPT files for conversion
- The script will convert each valid file to OWAMcap format
# Example configuration
VPT_FOLDER_PATH = Path("~/data/Video-Pre-Training/data/").expanduser()
VPT_TARGET_LIST_FILE = "./vpt_target_files.txt"
Example Command¶
Output¶
For each valid VPT file pair (MP4 + JSONL), the script generates a corresponding .mcap
file containing:
- Window information
- Keyboard events (press/release)
- Mouse events (movement)
- Screen events linking to the original MP4 file
Limitations¶
- Only navigation-related keys are mapped (not inventory, hotbar, etc.)
- Mouse is assumed to be pinned to center of screen
- Original VPT timestamps are not used; instead, events are spaced at fixed 50ms intervals
Technical Details¶
Event Timing¶
- Each tick is 50ms (50,000,000 nanoseconds)
- Mouse pin movements are assumed to take 1ms
- Timestamps start from Unix epoch (0) and increment by tick duration
Resolution¶
The VPT dataset uses 1280x720 resolution, which is maintained in the conversion.
Implementation Details¶
The conversion script (00_vpt_to_owamcap.py
) performs the following steps:
- Generates a list of valid target VPT files that have both MP4 and JSONL components
- For each file pair, creates an OWAMcap file with proper event timing
- Converts keyboard and mouse events from VPT format to OWA format
- Links the video frames to the original MP4 file
Key Components¶
# Key constants
VPT_INTERVAL_TICK_NS = 50_000_000 # 50 ms interval per tick
VPT_EXPECTED_TICKS = 6000 # 5 minutes of 50ms ticks
VPT_MOUSE_PIN_NS = 1_000_000 # 1 ms for mouse pin movement
VPT_X_RESOLUTION = 1280
VPT_Y_RESOLUTION = 720
The script maintains the keyboard state between ticks to properly generate press and release events, simulating continuous interaction.