AI Infrastructure literature
Subscribe
LLM
GPU
Throughput
Latency
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Abstract

The paper introduces Sarathi-Serve, an efficient LLM inference scheduler that leverages techniques from Sarathi for optimizing throughput without compromising latency. It demonstrates significant throughput improvements on the A100 GPU, employing stall-free scheduling and large batch sizes.

Analytical Insights:

  • Chunked-prefills and stall-free techniques adapted from Sarathi.

  • Serving throughput improvements up to 2.6x on a single A100 GPU and up to 6.9x for larger models.

  • Achieves desired latency thresholds with optimal batching techniques.

This integration reflects a significant step towards efficient large-scale deployment of LLM applications on GPU infrastructure, emphasizing enhancements in both throughput and latency management.

Personalized AI news from scientific papers.