ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

AI Digest

Reinforcement Learning

LLMs

Decision-Making

Hierarchical RL

Multi-Turn Interaction

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

The paper, ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL, illuminates the potential of Large Language Models (LLMs) in agent tasks that entail multi-turn decision-making (e.g., web interactions, tool usage, customer support). Existing methods in Reinforcement Learning (RL) for LLMs generally optimize single-turn rewards, which are insufficient for complex, multi-stage tasks. The novel framework ArCHer employs two concurrent RL algorithms: a high-level, value-based RL for reward aggregation over multiple utterances, and a low-level RL to train a token policy within each turn. In essence, this hierarchical reinforcement learning model accomplishes the following:

Enhances sample efficiency by approximately 100 times compared to existing methods.
Utilizes high-level value function to inform low-level token policy within turns.
Applies off-policy, value-based RL to manage multiple turns and delayed rewards.
Achieves substantial performance improvement as model capacity increases — tested up to 7 billion parameters.

This paper represents an important step forward in training LLMs for real-world tasks with multiple turns and complex decision-making scenarios. It provides a foundation for future research where models could tackle even higher complexity tasks with greater autonomy.

Personalized AI news from scientific papers.