AI labs paying for premium GPUs just to hold all their users’ contextual data may get a break at the bank. Google researchers have sussed out how to store all that data in pricey VRAM a lot more efficiently.
With their new TurboQuant framework, 16-bit vectors can be reduced to 3 bits, reducing model size 6-fold in the GPU’s KV cache – with no loss in accuracy.
TurboQuant will make “semantic search at Google’s scale faster and more efficient,” wrote Google Research Scientist Amir Zandieh and Google Fellow Vahab Mirrokni, in a blog post. “They allow for building and querying large vector indices with minimal memory, near-zero preprocessing time, and state-of-the-art accuracy.”Just as an actual turbocharger makes cars faster by making better use of the underlying hardware, so too does TurboQuant promise more efficient use of the expensive GPUs.
Vector Bloat
The work is built on a set of quantization algorithms that shrink data size while maintaining the fidelity of what the data represents as best as possible. Think of it as a real-life example of the fictional Pied Piper’s lossless data compression technology in the HBO show Silicon Valley.
Vectors are the predominant way AI understands tech. They are a set of relational numbers that represent language, images and other forms of data for the computer.
Vectors are also memory hogs. AI researchers have been eagerly looking for ways to cut memory sizes, which would lead to a corresponding reduction of how much VRAM they will need to occupy. NVIDIA, for instance, recently proposed a way for LLMs to self-compress by shedding unnecessary information.
The goal of TurboQuant is to reduce the model size while maintaining full accuracy of what is being represented.
How Does Your Engine Feel?
This framework has two novel practices: One is “PolarQuant,” a superior coordinate system for storing vector data, which is currently stored in simple X,Y-styled cartesian coordinates. Polar coordinates describe how far a point is from the center, and at what angle it is.
“This is comparable to replacing ‘Go 3 blocks East, 4 blocks North’ with ‘Go 5 blocks total at a 37-degree angle’,” the researchers write. Polar coordinates provide more useful information about the location of a vector without the additional metadata needed for a simpler coordinate system.
The other innovation is the use of a geometry formula called the Johnson-Lindenstrauss Transform (JLT) that, with zero memory overhead, maps a set of coordinates onto a sparser dimensional set with no loss of information by way of a single error-correcting bit.
TurboQuant first compresses vector data into a more statistically-predictable pattern of polar coordinates and JLT does the error correction from any leftover mistakes.
Results at the Finish Line
In benchmark tests involving answering questions, generating code and summarization, TurboQuant could accurately recall information while using less memory than KIVI, the currently recognized standard for KV Cache data compression. KIVI can offer lossless compression at 5-bit vectorization, but TurboQuant can do the job with just 3.5 bits.
In a test designed to see if a model can pinpoint one specific, tiny piece of information buried inside a massive amount of text (the “needle-in-haystack” task), TurboQuant could always correctly identify the needed information in the compressed data set.
Other AI researchers have been kicking the tires of this speed machine. Anonymous AI developer Buun posted benchmarks (retweeted by Hugging Face) that showed, for Meta’s Llama, TurboQuant offers superior performance over the standard Q8_0 compression algo, at least for context windows larger than 128,000 tokens.
Researchers from a similar AI quantization project, RaBitQ, voiced concerns about Google’s work, noting that TurboQuant has a similar setup to RaBitQ in that both use JLT and that both randomize the data to shake it into a more statistically coherent pattern. The TurboQuant authors downplayed the comparison in their paper, the RaBitQ researchers contended.
More Context, More Money
The researchers didn’t say specifically how this approach would save in GPU purchases. But some back-of-the-envelope math could offer an idea.
Each user with a 100,000-token context window in standard 16-bit precision would require roughly 30GB of KV cache memory, based on standard transformer scaling laws. The system must also hold the model weights themselves: a 70-billion-parameter model requires about 140GB. Altogether, a single inference job would require approximately 170GB of VRAM.
Using NVIDIA H100 GPUs, each with 80GB of VRAM, this workload translates to roughly 2–3 GPUs per user. TurboQuant’s six-fold cache compression would reduce the context memory from ~30GB to about 5GB, bringing total memory usage down to roughly 145GB. (This estimate does not factor in built-in overhead, crafty parallelization techniques and other miscellaneous what-have-you’s).
So maybe you’ll need two GPUs instead of three? With NVIDIA H100s costing at least $25,000 each, this technique is worth investigating.
And while the savings may not initially be dramatic, they will increase as context windows get bigger and if models themselves get smaller. The researchers envision a time when 1 million tokens-per-context will be routine; in this environment, TurboQuant can cut NVIDIA’s bill in half.
Hitting the Theoretical Bottom
The researchers argue that TurboQuant is no mere collection of hacks, but are “fundamental algorithmic contributions backed by strong theoretical proofs.”
The work is important not only for AI, but search in general. Users are increasingly expecting search services to connect the dots, and provide more context to their questions. Increased use of vectorization will help in delivering the goods.
“As AI becomes more integrated into all products, from LLMs to semantic search, this work in fundamental vector quantization will be more critical than ever,” researchers conclude. And who knows, such lossless fidelity may lead to fewer hallucinations as well.
The researchers plan to present this work on fundamental vector quantization at the International Conference for Learning Representation next month.

