DeepSeek-V4 is the open-weights frontier model that re-set the long-context bar in 2026. The headline number is a 50x KV cache reduction at one-million-token context relative to a standard BF16 baseline. The headline number is also misleading on its own, because it is the multiplicative product of five separate architectural choices, none of which would do it alone. This is the plain-language version of how those choices fit together.
The five things that compound
- Hybrid attention: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA).
- Mixed precision per tensor role: FP8 main, BF16 for rotary positional dims, FP4 for the indexer.
- Muon optimizer: orthogonalized gradient steps instead of AdamW.
- Manifold-constrained hyper-connections: residuals projected onto a learned manifold instead of summed naively.
- FP4 quantization-aware training: the model is trained with FP4 in the loop, not quantized after the fact.
Each of these by itself is a one-to-three-times improvement somewhere. The product is the 50x.
Compressed Sparse Attention, the CSA layer
Standard attention is O(n²) in sequence length because every query attends to every key. At one-million-token context the KV cache that has to sit in GPU memory becomes the limit, not the compute. CSA attacks the cache directly.
The mechanism, simplified:
- Group every four consecutive tokens into one compressed key-value entry. The cache is now four times smaller per layer.
- A lightweight network, the "lightning indexer," scores each compressed entry against the current query and picks the top 1,024 (Pro) or top 512 (Flash).
- The query attends only to those selected entries, plus a sliding window of recent uncompressed tokens so the local detail is preserved.
The HCA layer does the same thing, more aggressively: 128 tokens per compressed entry instead of 4. CSA and HCA are interleaved so the model gets both fine and coarse views of history.
The Muon optimizer
AdamW is the default for trillion-parameter training. Muon replaces it for most parameters. The intuition is that AdamW's per-element adaptive step sizes are doing partial work that you can do better with a structural intervention on the whole matrix.
Pseudocode of the inner loop:
g = compute_gradient(W) # the raw gradient matrix
U, S, V = approx_svd_via_newton_schulz(g, iters=10)
g_orth = U @ V.T # the polar factor: shape only
W = W - lr * g_orth
All singular values are squashed to one. The step direction is preserved; the step magnitudes across directions are normalized. Empirically this stabilizes training at scale and converges faster per FLOP. The polar-factor trick is the math that turns "Muon" from a paper curiosity into an optimizer DeepSeek shipped a frontier model on.
FP4 KV cache, and why it does not destroy the model
Four-bit floats have sixteen distinct values. Storing a key or value tensor in FP4 sounds like it should be catastrophic for attention quality. It is not, for two reasons that compound:
- FP4 only stores the indexer keys and values, not the main attention path. The indexer's job is to score and rank compressed entries, not to attend to them directly. Rank is robust to quantization in a way that softmax-weighted summation is not.
- The model is trained with FP4 in the loop via quantization-aware training. The weights learn around the FP4 grid rather than being projected onto it after the fact.
Combined with the four-tokens-to-one CSA grouping and the per-layer mixed precision (FP8 main, BF16 rotary, FP4 indexer), you get the 50x KV cache reduction without the 50x quality drop you would expect from naive FP4.
How V4 compares to V3.2
| Axis | V3.2 | V4-Pro |
|---|---|---|
| KV cache at 1M context (relative) | 1.0× | 0.1× |
| FLOPs at 1M context (relative) | 1.0× | 0.27× |
| Attention type | MLA | CSA + HCA hybrid |
| Optimizer | AdamW | Muon (most params) |
| KV cache precision | BF16 | FP8 / BF16 / FP4 mixed |
| Residual design | Standard | Manifold-constrained hyper-connections |
At shorter contexts the gap narrows. The CSA grouping and FP4 indexer help most when the cache dominates memory, which happens past roughly 100k tokens. Under 32k tokens V4 is incrementally better than V3.2; over 256k tokens it is in a different cost regime.
The unsexy lessons
Three takeaways that generalize beyond DeepSeek:
One. The cache, not the compute, is the long-context bottleneck. Optimization research that ignores cache size will keep losing to research that does not.
Two. Multiplicative wins from independent axes beat any single architectural breakthrough. There is no "one big idea" in V4; there are five medium ones that touch different layers of the system.
Three. Open weights changed the dynamics. The architectural details above are not a leak. They are in the technical report, in the code, and reproducible by anyone with a few thousand GPUs and the patience to run it.
What to read next
The interactive walkthrough with diagrams of each layer, the CSA selection animation, and the five-way comparison to V3.2, Subquadratic SSA, Mamba, and Kimi Linear is at the DeepSeek-V4 lab. The companion Subquadratic note covers the closed-weights long-context competitor. The LeWorldModel lab covers the world-model angle on the same problem family.