Stay updated daily with trending AI research
7 days free trialPick your own topicsAutomated AI summaries

Step-Audio 2 Technical Report

multi-modal
language model
audio understanding
speech recognition
reinforcement learning
paralinguistic information
retrieval-augmented generation
arXiv:2507.16632 - [arXivPDF]
53
31
1
Step-Audio 2 Technical Report
Abstract
This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.
53
31
1
Sign up to continue reading AI summary
Stay updated on the latest trending research with our newsletter. Never miss a release date!
Sign Up
© 2026 Adaptive Plus Inc.1216 Broadway, Suite 213,575 Market Str, San Francisco, CA