In the publication Tuning Large language model for End-to-end Speech Translation, Zhang and colleagues introduce the LST model designed to improve end-to-end speech translation (E2E-ST). The model incorporates a speech frontend, an adapter, and an LLM backend, undergoing a two-stage training process to optimize multi-modal translation tasks.
This paper represents a significant step in fine-tuning LLMs for complex multimodal translation tasks, showcasing the capacity to break down barriers in human-machine communication.