Vol. XII · No. 05 · May 2026
Jake Cuth.

DeepSeek-V4 architecture, in plain language.


DeepSeek-V4 is the open-weights frontier model that re-set the long-context bar in 2026. The headline number is a 50x KV cache reduction at one-million-token context relative to a standard BF16 baseline. The headline number is also misleading on its own, because it is the multiplicative product of five separate architectural choices, none of which would do it alone. This is the plain-language version of how those choices fit together.

The five things that compound

  1. Hybrid attention: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA).
  2. Mixed precision per tensor role: FP8 main, BF16 for rotary positional dims, FP4 for the indexer.
  3. Muon optimizer: orthogonalized gradient steps instead of AdamW.
  4. Manifold-constrained hyper-connections: residuals projected onto a learned manifold instead of summed naively.
  5. FP4 quantization-aware training: the model is trained with FP4 in the loop, not quantized after the fact.

Each of these by itself is a one-to-three-times improvement somewhere. The product is the 50x.

Compressed Sparse Attention, the CSA layer

Standard attention is O(n²) in sequence length because every query attends to every key. At one-million-token context the KV cache that has to sit in GPU memory becomes the limit, not the compute. CSA attacks the cache directly.

The mechanism, simplified:

The HCA layer does the same thing, more aggressively: 128 tokens per compressed entry instead of 4. CSA and HCA are interleaved so the model gets both fine and coarse views of history.

The Muon optimizer

AdamW is the default for trillion-parameter training. Muon replaces it for most parameters. The intuition is that AdamW's per-element adaptive step sizes are doing partial work that you can do better with a structural intervention on the whole matrix.

Pseudocode of the inner loop:

g = compute_gradient(W)               # the raw gradient matrix
U, S, V = approx_svd_via_newton_schulz(g, iters=10)
g_orth = U @ V.T                      # the polar factor: shape only
W = W - lr * g_orth

All singular values are squashed to one. The step direction is preserved; the step magnitudes across directions are normalized. Empirically this stabilizes training at scale and converges faster per FLOP. The polar-factor trick is the math that turns "Muon" from a paper curiosity into an optimizer DeepSeek shipped a frontier model on.

FP4 KV cache, and why it does not destroy the model

Four-bit floats have sixteen distinct values. Storing a key or value tensor in FP4 sounds like it should be catastrophic for attention quality. It is not, for two reasons that compound:

Combined with the four-tokens-to-one CSA grouping and the per-layer mixed precision (FP8 main, BF16 rotary, FP4 indexer), you get the 50x KV cache reduction without the 50x quality drop you would expect from naive FP4.

How V4 compares to V3.2

AxisV3.2V4-Pro
KV cache at 1M context (relative)1.0×0.1×
FLOPs at 1M context (relative)1.0×0.27×
Attention typeMLACSA + HCA hybrid
OptimizerAdamWMuon (most params)
KV cache precisionBF16FP8 / BF16 / FP4 mixed
Residual designStandardManifold-constrained hyper-connections

At shorter contexts the gap narrows. The CSA grouping and FP4 indexer help most when the cache dominates memory, which happens past roughly 100k tokens. Under 32k tokens V4 is incrementally better than V3.2; over 256k tokens it is in a different cost regime.

The unsexy lessons

Three takeaways that generalize beyond DeepSeek:

One. The cache, not the compute, is the long-context bottleneck. Optimization research that ignores cache size will keep losing to research that does not.

Two. Multiplicative wins from independent axes beat any single architectural breakthrough. There is no "one big idea" in V4; there are five medium ones that touch different layers of the system.

Three. Open weights changed the dynamics. The architectural details above are not a leak. They are in the technical report, in the code, and reproducible by anyone with a few thousand GPUs and the patience to run it.

What to read next

The interactive walkthrough with diagrams of each layer, the CSA selection animation, and the five-way comparison to V3.2, Subquadratic SSA, Mamba, and Kimi Linear is at the DeepSeek-V4 lab. The companion Subquadratic note covers the closed-weights long-context competitor. The LeWorldModel lab covers the world-model angle on the same problem family.

llm deepseek attention muon fp4