KV Cache Compression: TurboQuant, OSCAR, EpiCache Compete
KV cache compression methods, including TurboQuant, OSCAR, and EpiCache, aim to reduce memory usage in long-context large language models.

Long-context large language models face a memory bottleneck due to caching key and value vectors during decoding. This cache grows linearly with sequence length and batch size, dwarfing the model's footprint. For example, Llama-3.1-70B's KV cache costs about 0.31 MB per token; at 128K tokens, that is ~40 GB, and at 1M tokens, it exceeds 300 GB.
Current approaches to reducing KV cache size include token eviction, quantization, low-rank projection, merging, and architectural sharing. Recent work has focused on ultra-low-bit quantization. Google and NYU's TurboQuant and Together AI's OSCAR tackle the problem from opposite directions, while Apple's EpiCache addresses a different aspect.
Most KV quantizers struggle with outlier channels, which have disproportionately large magnitudes that dominate the quantization range. KIVI established a baseline by quantizing keys per-channel and values per-token, cutting end-to-end peak memory by about 2.6×. TurboQuant handles outliers without data calibration, achieving near-lossless recall at 4× compression and quality neutrality at 3.5 bits per channel.
OSCAR uses attention-aware rotation from an offline calibration pass, achieving essentially full-precision results at 2.28 bits. OSCAR ships as a complete system, with up to 7.83× job-level throughput and roughly 8× KV-cache memory reduction at 100K context. TurboQuant offers broader generality, working on any model untouched.
EpiCache, a training-free KV-cache management framework, addresses extended multi-turn conversations, reporting up to 40% higher accuracy than eviction baselines and near-full-cache accuracy at 4–6× compression. Why this matters: The development of efficient KV cache compression methods like TurboQuant, OSCAR, and EpiCache is crucial for the deployment of long-context large language models. These models have the potential to revolutionize applications such as chatbots, language translation, and text summarization.
However, their high memory requirements can be a significant barrier to adoption. By reducing the memory usage of these models, these compression methods can enable faster and more cost-effective deployment. Moreover, the competition among these methods is driving innovation, with potential future developments including the combination of calibration-aware rotation with optimal scalar quantization.
As the field continues to evolve, it will be important to monitor the trade-offs between accuracy, memory usage, and computational cost, and to consider the implications for developers, businesses, and consumers.
Source: MarkTechPost