Together AI Open-Sources OSCAR: A Breakthrough in 2-Bit KV Cache Quantization for Long-Context LLM Serving
Together AI introduces OSCAR, an attention-aware 2-bit KV cache quantization system that significantly reduces memory usage and increases throughput for long-context LLM serving.

['Long-context inference has become a major bottleneck in serving large language models (LLMs). As the context length, batch size, and model depth increase, the KV cache grows exponentially, consuming a substantial fraction of GPU memory. To address this challenge, Together AI has open-sourced OSCAR, an innovative attention-aware 2-bit KV cache quantization system that enables efficient and accurate long-context LLM serving.', 'The KV cache is a critical component of LLM serving, but its memory requirements have been a significant obstacle to scaling up batch sizes and context lengths.
Previous quantization methods have struggled to achieve 2-bit precision without sacrificing accuracy or requiring custom serving layouts. OSCAR overcomes these limitations by applying a data-aware rotation to redistribute outlier energy across all channels, ensuring that quantization errors are pushed into low-importance directions.', "The key to OSCAR's success lies in its observation that the rotation applied before quantization should be derived from attention statistics themselves, rather than the raw distribution of KV activations. By estimating the empirical query covariance and using its eigenvectors as the key rotation basis, OSCAR minimizes the error in attention logits.
Similarly, for values, OSCAR uses the score-weighted value covariance to determine the optimal rotation basis. This approach enables OSCAR to achieve accurate results even at 2-bit precision.", "OSCAR has been integrated into SGLang's production serving stack as an INT2 KV-cache mode, fully compatible with paged attention. The system uses a three-region KV cache layout per request and achieves significant memory savings.
Evaluations on four model configurations demonstrate that OSCAR outperforms competing methods, achieving state-of-the-art results on various benchmarks. With OSCAR, Together AI is poised to revolutionize long-context LLM serving, enabling faster, more efficient, and more accurate processing of large language models.", 'The performance benefits of OSCAR are substantial. At batch size 32 and 100K context length, OSCAR achieves a 6.17× and 7.83× speedup over BF16 on Qwen3-4B-Thinking and GLM-4.7-FP8, respectively.
The speedup increases with context length, making OSCAR an attractive solution for applications requiring long-context inference. With its open-source release, OSCAR is set to have a significant impact on the AI community, enabling researchers and practitioners to push the boundaries of LLM serving and applications.']
Source: MarkTechPost