Stay updated daily with trending AI research
7 days free trialPick your own topicsAutomated AI summaries

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Language Models
Web Agents
Artificial Intelligence
Benchmarking
Task Performance
arXiv:2407.15711 - [arXivPDF]
7
137
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Abstract
Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.
7
137
Sign up to continue reading AI summary
Stay updated on the latest trending research with our newsletter. Never miss a release date!
Sign Up
© 2025 Adaptive Plus Inc.1216 Broadway, Suite 213,575 Market Str, San Francisco, CA