Research.AI
Subscribe
Multimodal Agents
Benchmark
Open-Ended Tasks
AI Autonomy
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

The study introduces a comprehensive multimodal agent benchmark called OSWorld for real computer environments.

  • OSWorld facilitates the development of agents for open-ended computer tasks.
  • Simulates task execution across Ubuntu, Windows, and macOS.
  • Assess agents with 369 computer tasks involving web, desktop apps, file IO, and multi-app workflows.
  • Presents a stark performance contrast: whereas humans succeed 72.36% of the time, the best model achieves just 12.24%.
  • Unveils key challenges with GUI grounding and operational knowledge for LLM/VLM-based agents.

Opinion: OSWorld is a game-changer for advancing AI autonomy in technology use. By providing a rich diversity of tasks and rigorous assessment methods, it stands to accelerate the creation of more capable and versatile virtual assistants.

Personalized AI news from scientific papers.