The YOCO model fundamentally shifts how large language models utilize GPU resources by caching key-value pairs only once, substantially decreasing memory demands while enabling robust global attention mechanisms. This design not only enhances memory efficiency but also improves throughput and reduces latency during the prefill phase, confirmed through comprehensive experiments. YOCO extends to 1M context length, establishing it as a significant advancement in model training and execution.
This discovery is pivotal for the development of more efficient and scalable AI systems, especially in resource-constrained environments.