Together AI Trims KV Cache for Open Weight Models

San Francisco-based Together AI has released details about a new KV cache compression system that may make the hardware-hungry frontier AI labs take notice.

Together AI is a research-led company building a full-stack AI platform for the enterprise, using open source technologies. This latest part of the puzzle simplifies the memory cache that holds the user’s context data – the major reason commercial LLM providers have to buy up all the GPUs.

OSCAR (Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization) hits the long-sought-after goal of carving LLM KV caches down to 2-bit precision, from the full 16-bit precision used by today’s models to represent context data.

Such downsampling significantly cuts memory needs and speeds processing, while maintaining the accuracy of the context data.

According to the company, this approach results in an 8x reduction in memory needed for the KV cache, as well as a 3x decoding speed of 100k context windows. Do the math and you’ll find a single 80GB H100 can run 256 concurrent requests with OSCAR.

The work could not only cut hardware requirements per user for large-scale LLM providers, but it could even be pivotal for building smaller-scale AIs, easing the path for enterprise-run AIs.

A Tiny Vector

KV caches hold the user-supplied context. Each word is transformed into a vector, one that indicates its relation with every other word in the document. By industry consensus, each vector is encoded in 16-bit precision. Once stored in the cache, the LLM doesn’t need to recreate it anew for each new query.

With the costs of GPU and power ever escalating, some researchers have been looking for ways to cut the size of these context windows, through a technique called quantization, a form of compression. With compression comes data loss, and researchers whittling 16 bits down to 4 or even 2 bits have been beset by myriad challenges.

The chief challenge is that AI-generated data is rarely evenly distributed, so can’t be easily compressed. When context is encoded into the KV cache, large clumps of the data often fall way outside of expected distribution range.

“Under low-bit quantization, these outliers dominate the quantization scale, compressing most normal values into only a few effective quantization levels and substantially degrading attention quality,” the researchers write in the ArXiv paper.

Today’s algos will stretch the axis so all possible values are accommodated, but this can obliterate all the nuanced details happening elsewhere on the axis. Other approaches to this problem simply create an axis that will accommodate all the values evenly, which for math reasons confuses matters further, and, given only two-bits, break down completely.

OSCAR looks at all the values before they are quantized, using an “attention-aware” calibration framework to calculate specific directions of interest to the model. In effect, OSCAR is novel in that it separates the calibration process from the actual inferencing.

It understands possible directions of interest through a set of test queries that were executed beforehand. They have thus far been precalculated for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7.

To handle the conversion of 2-bit values, the team built an “INT2 attention kernel” using Triton, a programming language for building deep learning primitives.

Ready for Open Weight Models

OSCAR was embedded in the SGLang serving framework for running LLMs, which means it can help models that SGLang supports, including open-weight powerhouses like Qwen, Llama, DeepSeek, Mistral, Google’s Gemma and Microsoft Phi.

Together AI is not the only company interested in cutting KV cache size. In March Google released details of its TurboQuant compression, which shrinks 16-bit vectors to 3 bits.

OSCAR works differently than TurboQuant, Together AI researchers point out. TurboQuant is an online quantizer, whereas OSCAR does preparatory work ahead of the inferencing. The two could possibly be merged for further efficiencies, where TurboQuant goes to work on OSCAR’s attention-aware processing, they noted.

Together AI Trims KV Cache for Open Weight Models

A Tiny Vector

Ready for Open Weight Models

SHARE THIS STORY

FOLLOW US

Together AI Trims KV Cache for Open Weight Models

A Tiny Vector

Ready for Open Weight Models

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP