AI Research

KV Cache Compression: TurboQuant, OSCAR, EpiCache Compete

AI News Desk

MarkTechPost

Jun 18, 2026

2 min read

KV cache compression methods, including TurboQuant, OSCAR, and EpiCache, aim to reduce memory usage in long-context large language models.

KV Cache Compression: TurboQuant, OSCAR, EpiCache Compete

Long-context large language models face a memory bottleneck due to caching key and value vectors during decoding. This cache grows linearly with sequence length and batch size, dwarfing the model's footprint. For example, Llama-3.1-70B's KV cache costs about 0.31 MB per token; at 128K tokens, that is ~40 GB, and at 1M tokens, it exceeds 300 GB.

Current approaches to reducing KV cache size include token eviction, quantization, low-rank projection, merging, and architectural sharing. Recent work has focused on ultra-low-bit quantization. Google and NYU's TurboQuant and Together AI's OSCAR tackle the problem from opposite directions, while Apple's EpiCache addresses a different aspect.

Most KV quantizers struggle with outlier channels, which have disproportionately large magnitudes that dominate the quantization range. KIVI established a baseline by quantizing keys per-channel and values per-token, cutting end-to-end peak memory by about 2.6×. TurboQuant handles outliers without data calibration, achieving near-lossless recall at 4× compression and quality neutrality at 3.5 bits per channel.

OSCAR uses attention-aware rotation from an offline calibration pass, achieving essentially full-precision results at 2.28 bits. OSCAR ships as a complete system, with up to 7.83× job-level throughput and roughly 8× KV-cache memory reduction at 100K context. TurboQuant offers broader generality, working on any model untouched.

EpiCache, a training-free KV-cache management framework, addresses extended multi-turn conversations, reporting up to 40% higher accuracy than eviction baselines and near-full-cache accuracy at 4–6× compression. Why this matters: The development of efficient KV cache compression methods like TurboQuant, OSCAR, and EpiCache is crucial for the deployment of long-context large language models. These models have the potential to revolutionize applications such as chatbots, language translation, and text summarization.

However, their high memory requirements can be a significant barrier to adoption. By reducing the memory usage of these models, these compression methods can enable faster and more cost-effective deployment. Moreover, the competition among these methods is driving innovation, with potential future developments including the combination of calibration-aware rotation with optimal scalar quantization.

As the field continues to evolve, it will be important to monitor the trade-offs between accuracy, memory usage, and computational cost, and to consider the implications for developers, businesses, and consumers.

Share this article

X LinkedIn Telegram

Source: MarkTechPost