Human-Computer Interaction
Benchmarking Multimodal Agents with OSWorld
Abstract Summary: OSWorld introduces a scalable real computer environment for assessing complex computer tasks performed by multimodal agents across different operating systems.
- Develops OSWorld for evaluating multimodal agents in realistic computing environments.
- Includes a diverse benchmark of 369 tasks using real web and desktop applications.
- Offers a unified setup, execution-based evaluation, and interactive learning platform.
- Evidences substantial capability gaps in leading AI models when compared to human performance.
OSWorld is a crucial stride towards creating multimodal generalist agents that can handle the workload of a diverse array of real-world computer applications.
Personalized AI news from scientific papers.