OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Research.AI

Multimodal Agents

Benchmark

Open-Ended Tasks

AI Autonomy

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

The study introduces a comprehensive multimodal agent benchmark called OSWorld for real computer environments.

OSWorld facilitates the development of agents for open-ended computer tasks.
Simulates task execution across Ubuntu, Windows, and macOS.
Assess agents with 369 computer tasks involving web, desktop apps, file IO, and multi-app workflows.
Presents a stark performance contrast: whereas humans succeed 72.36% of the time, the best model achieves just 12.24%.
Unveils key challenges with GUI grounding and operational knowledge for LLM/VLM-based agents.

Opinion: OSWorld is a game-changer for advancing AI autonomy in technology use. By providing a rich diversity of tasks and rigorous assessment methods, it stands to accelerate the creation of more capable and versatile virtual assistants.

Personalized AI news from scientific papers.