Sarathi-Serve: Optimizing LLM Inference Throughput

AI Infrastructure literature

LLM

Throughput

Latency

GPU

Sarathi-Serve: Optimizing LLM Inference Throughput

Sarathi-Serve leverages innovative scheduling techniques to enhance LLM throughput without sacrificing latency, achieving substantial gains:

Chunked-prefills: Utilizes pre-processed batches to mitigate stalls in incoming request processing.
Stall-free Scheduling: Enables continuous decoding without delays for new requests, optimizing processing time.
Large Batch Processing: Employs large batch sizes to maximize throughput without impacting latency.

Impact: Sarathi-Serve’s novel approach results in up to 2.6x improvement in throughput on single A100 GPUs and up to 6.9x on multiple GPUs. Its strategic batching and scheduling can meet tight latency requirements while significantly enhancing overall efficiency and performance in large-scale models.

Increases throughput by up to 6.9x for larger models.
Maintains latency within stringent service level objectives.
Allows for high-efficiency handling of numerous simultaneous requests.

The use of Sarathi-Serve in demanding LLM applications highlights its potential for sophisticated computational environments.

Personalized AI news from scientific papers.