You Only Cache Once: Decoder-Decoder Architecture for Language Models

I am interested in the hardware aspects of AI, particularly if any progress is being made to allow deploying large models onto smartphone devices

Large Language Models

GPU

Efficiency

The YOCO model fundamentally shifts how large language models utilize GPU resources by caching key-value pairs only once, substantially decreasing memory demands while enabling robust global attention mechanisms. This design not only enhances memory efficiency but also improves throughput and reduces latency during the prefill phase, confirmed through comprehensive experiments. YOCO extends to 1M context length, establishing it as a significant advancement in model training and execution.

Introduces YOCO architecture focusing on efficient memory utilization.
Demonstrates substantial improvements in inference memory, latency, and throughput.
Employs unique ‘cross-decoder’ and ‘self-decoder’ components for enhanced performance.
Reveals potential extensions to extreme context lengths with high accuracy.

This discovery is pivotal for the development of more efficient and scalable AI systems, especially in resource-constrained environments.

Personalized AI news from scientific papers.