OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
The study introduces a comprehensive multimodal agent benchmark called OSWorld
for real computer environments.
- OSWorld facilitates the development of agents for open-ended computer tasks.
- Simulates task execution across Ubuntu, Windows, and macOS.
- Assess agents with 369 computer tasks involving web, desktop apps, file IO, and multi-app workflows.
- Presents a stark performance contrast: whereas humans succeed 72.36% of the time, the best model achieves just 12.24%.
- Unveils key challenges with GUI grounding and operational knowledge for LLM/VLM-based agents.
Opinion: OSWorld is a game-changer for advancing AI autonomy in technology use. By providing a rich diversity of tasks and rigorous assessment methods, it stands to accelerate the creation of more capable and versatile virtual assistants.
Personalized AI news from scientific papers.