Skip to content

Open-sourcing Dataset for Multimodal Desktop Agent

As of now (March 22, 2025), there are few datasets available for building multimodal desktop agents.

Even more scarce are datasets that (1) contain high-frequency screen data, (2) have keyboard/mouse information timestamp-aligned with other modalities like screen recordings, and (3) include human demonstrations.

To address this gap, open-world-agents provides the following three solutions:

  1. File Format - OWAMcap: A high-performance, self-contained, flexible container file format for multimodal desktop log data, powered by the open-source container file format mcap. Learn more...

  2. Desktop Recorder - owl mcap record your-filename.mcap: A powerful, efficient, and easy-to-use desktop recorder that captures keyboard/mouse and high-frequency screen data.

  3. 🤗 Hugging Face Integration: Upload your own dataset created by simple owl mcap record to huggingface and share with everyone! The era of open-source desktop data is near and effortless. Preview the dataset at Hugging Face Spaces.