Google TurboQuant AI Memory Compression Explained

Google TurboQuant AI Memory Compression Explained

March 25, 2026, Google published a research blog post. By the end of that trading day, Micron was down 4%, Western Digital had shed 4.4%, Seagate slid 5.6%, and SanDisk, the biggest casualty, collapsed 6.5%. The next morning, South Korea woke up to SK Hynix falling 6% and Samsung dropping nearly 5%. Kioxia, Japan’s flash memory giant, lost close to 6%.

All of that, from a blog post.

The paper describes an algorithm called TurboQuant. It’s a compression method developed by Amir Zandieh, a research scientist at Google, and Vahab Mirrokni, a vice president and Google Fellow, alongside collaborators at Google DeepMind, KAIST, and New York University. And if the market’s reaction felt like an overreaction, well, that’s a more interesting debate than it first appears.

The Problem TurboQuant Actually Solves

To understand why this matters, you need to understand what a key-value cache is, because that’s exactly what TurboQuant targets.

When a large language model generates text, it doesn’t start from scratch with every new word it produces. It stores the context (the calculations from every previous token) in a high-speed data structure called the key-value cache, or KV cache. Think of it like a short-term memory buffer. The longer the conversation or document the model is processing, the larger that buffer grows. And it grows fast.

This is why running AI models at scale is so brutally expensive. The KV cache doesn’t live on your hard drive. It sits in GPU memory, the premium real estate of modern computing. As models gain the ability to handle longer and longer context windows (some now handle millions of tokens), the memory demands balloon to the point where a significant chunk of your hardware investment exists just to hold that cache.

TurboQuant’s answer is to compress it, aggressively. Standard KV cache values are stored at 16 bits of precision. TurboQuant squeezes them down to 3 bits per value. That’s a reduction of over six times, and according to Google’s benchmarks across five standard evaluation frameworks (LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval) the model’s accuracy doesn’t budge.

That last part is the piece that makes engineers stop and take notice.

How It Actually Works

Most compression techniques have a dirty secret: they only look as good as their headline numbers. Traditional quantization methods shrink data vectors, yes, but they require storing additional normalization constants alongside the compressed values so the system can accurately decompress them later. Those constants typically add one to two extra bits per value, quietly eating into whatever compression ratio you thought you had.

TurboQuant sidesteps this through a two-stage process. The first stage, called PolarQuant, which will appear in its own paper at AISTATS 2026, converts data vectors from standard Cartesian coordinates into polar form. Separating each vector into a magnitude and a set of angles sounds like a math lecture, but the practical consequence is meaningful, angular distributions in these models follow predictable, concentrated patterns. That predictability means the system can skip expensive per-block normalization entirely.

The second stage applies QJL, a technique based on the Johnson-Lindenstrauss transform, which was published at AAAI 2025. QJL mops up the small residual error left by the first stage and encodes it as a single sign bit per dimension. The result is a representation where nearly every bit is doing useful work, with no bits wasted on overhead and no normalization constants bloating the storage.

At 4-bit precision, TurboQuant delivered up to an eight-times speedup in computing attention on Nvidia H100 GPUs compared to an uncompressed 32-bit baseline. On needle-in-a-haystack retrieval tasks, where a model must locate a specific piece of information buried inside a massive block of text, it achieved perfect scores at 3-bit compression. The paper is scheduled for formal presentation at ICLR 2026 in April.

The work builds on years of incremental research from the same Google team. It’s not a sudden revelation that appeared out of nowhere. But the combination of these two earlier techniques into a single, deployable algorithm with no accuracy penalty is genuinely new.

Why the Market Panicked, and Whether It Should Have

Here’s where it gets interesting.

The memory chip industry has had a rough few years. It went through a painful supply glut, a brutal downturn, and then, starting in late 2023, a spectacular recovery fueled entirely by AI. SK Hynix became the semiconductor industry’s darling almost overnight because of its high-bandwidth memory (HBM) chips, which are essential for training massive AI models. Micron’s stock tripled. Samsung poured investment into next-generation DRAM. The story was simple: AI needs memory, and it needs more of it every year.

TurboQuant threatens to complicate that story. Wells Fargo analyst Andrew Rocha put it plainly. If broadly adopted, the algorithm “directly attacks the cost curve” for AI memory and quickly raises the question of how much capacity the industry actually needs. That’s the fear that sent stocks sliding.

But the counter-argument is persuasive, and several analysts made it loudly. KC Rajkumar at Lynx Equity Strategies pointed out that extreme supply constraints mean memory demand over the next three to five years is nowhere near in danger. An analyst at Citrini Research was blunter on X: “It’s like saying Aramco should crash because Toyota came out with a next-generation hybrid engine.”

That analogy is worth sitting with. Efficiency improvements don’t automatically shrink total demand. They often expand it. When storage got cheaper, people stored more. When bandwidth improved, applications consumed more of it. Forbes published an analysis suggesting that by lowering the hardware barrier to running AI locally, TurboQuant might actually accelerate the proliferation of on-device and edge AI deployments, which could paradoxically drive total chip consumption higher in the long run.

The distinction investors may have missed is between unit demand and aggregate demand. TurboQuant reduces how much memory a single model instance requires. It does nothing to reduce the number of model instances the world wants to run.

What TurboQuant Can’t Do

To be fair to the skeptics, there are real limits here, and they matter.

TurboQuant only works during AI inference, which is the process of running a trained model to generate responses. It offers zero relief for the training phase, where a model learns from data in the first place. Training is the memory-hungry, months-long, multi-billion-dollar process that requires the largest, most expensive clusters of GPUs on the planet. That market, the one driving demand for SK Hynix’s HBM3E chips, is entirely untouched by this paper.

The AI infrastructure spending committed by major companies isn’t going anywhere. Meta recently committed up to $27 billion in a deal with Nebius for dedicated compute. Google, Microsoft, and Amazon are collectively planning hundreds of billions in data center capital expenditure through 2026. These budgets exist primarily to build and train models, not just run them. TurboQuant doesn’t touch that demand curve.

The algorithm is also, right now, a laboratory result. It exists in a paper and a GitHub repository. Deploying compression at this level across production inference systems, including managing the edge cases, the model-specific quirks, and the hardware-level implementations, takes significant engineering work. The road from “benchmarks look great” to “this is running in Google’s data centers at scale” is not short.

What the Future Looks Like

The honest answer is, probably both things are true simultaneously.

TurboQuant will compress memory requirements for AI inference at the companies that adopt it. That’s real. Over time, if the technique becomes standard practice across the industry the way 8-bit quantization eventually did, it will meaningfully reduce how much DRAM a given inference cluster needs. Memory companies that bet everything on AI inference demand will feel some of that pressure.

At the same time, the AI inference market itself is growing faster than any efficiency gain can offset in the near term. Every new model that ships, every new application that adds an AI feature, every enterprise that builds internal tools on top of API access: each of those is adding demand. The question isn’t whether TurboQuant reduces per-unit memory needs. It does. The question is whether aggregate demand grows faster than efficiency gains reduce it, and history suggests the answer is almost always yes.

What’s genuinely worth watching is what happens with on-device AI. Right now, running capable AI models on laptops, smartphones, or edge hardware requires either compromising on model quality or purchasing expensive hardware. 

If TurboQuant’s approach becomes widely implemented in local inference runtimes, the floor for running a capable AI model drops significantly. That expands the market possibly to hundreds of millions of consumer devices that weren’t previously viable. Those devices don’t use HBM. They use commodity LPDDR memory. So even a version of the future where TurboQuant hurts high-end GPU memory could be one where it expands the lower end of the market considerably.

What This Moment Actually Tells Us

The real story here isn’t just about one algorithm. It’s about where the AI industry is in its maturation.

For the past three years, the primary competition in AI infrastructure was about raw scale who had the most GPUs, who could store the most data, who could process the longest context windows. Hardware was the bottleneck, and spending was the solution. TurboQuant represents something different: the industry starting to get smart about what it already has.

Efficiency research like this gets more important, not less, as inference costs become the dominant recurring expense in AI products. Training a model once is a capital cost. Serving it millions of times a day is the operational cost that determines whether an AI product is actually viable as a business. That’s the economic pressure that makes TurboQuant relevant — not as a threat to memory companies, but as a signal that the AI industry is entering a phase where doing more with less matters as much as buying more.

The stocks have already partially recovered. The panic was, by most accounts, overdone. 

But the broader shift it pointed to — from brute-force scaling to thoughtful efficiency — is real, and it’s probably just getting started.

Post a Comment

Previous Post Next Post