Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

AI Infrastructure literature

LLM

GPU

Throughput

Latency

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Abstract

The paper introduces Sarathi-Serve, an efficient LLM inference scheduler that leverages techniques from Sarathi for optimizing throughput without compromising latency. It demonstrates significant throughput improvements on the A100 GPU, employing stall-free scheduling and large batch sizes.

Analytical Insights:

Chunked-prefills and stall-free techniques adapted from Sarathi.
Serving throughput improvements up to 2.6x on a single A100 GPU and up to 6.9x for larger models.
Achieves desired latency thresholds with optimal batching techniques.

This integration reflects a significant step towards efficient large-scale deployment of LLM applications on GPU infrastructure, emphasizing enhancements in both throughput and latency management.

Personalized AI news from scientific papers.