What is the DeepSeek-V4 architecture?

DeepSeek-V4 is a 1.6 trillion parameter mixture-of-experts large language model. It uses a hybrid attention design (Compressed Sparse Attention plus Heavily Compressed Attention), the Muon optimizer for most weights, FP4 quantization for the KV cache indexer, and manifold-constrained hyper-connections in place of standard residuals. The combination yields roughly 50x KV cache reduction relative to a BF16 GQA8 baseline at one-million-token context.

What is Compressed Sparse Attention (CSA) in DeepSeek-V4?

CSA is the first of two attention layer types. It groups every four consecutive tokens into a single compressed key-value entry, then a lightning indexer selects the top-1024 (Pro) or top-512 (Flash) most-relevant entries for each query. The query attends only to those selected entries plus a sliding window of recent uncompressed tokens, rather than attending to every previous token.

What is the Muon optimizer?

Muon is the optimizer DeepSeek-V4 uses in place of AdamW for most parameters. It orthogonalizes the gradient before stepping: the gradient matrix M is approximated by the polar factor UV^T of its singular value decomposition, computed via a ten-iteration hybrid Newton-Schulz, then the step is taken in that shape-only direction with all singular values normalized to one. The result is faster convergence and improved stability at trillion-parameter scale.

What is FP4 KV cache and why does it matter?

FP4 is a four-bit floating-point format. DeepSeek-V4 stores the lightning indexer's keys and values in FP4 rather than FP8 or BF16, cutting that part of the KV cache by 2x or 4x respectively. Combined with the CSA grouping (4 tokens per entry) and the per-layer mixed-precision design, this is one of the three multiplicative factors behind the 50x total KV cache reduction at one-million-token context.

What is different between DeepSeek-V4 and V3.2?

V4 differs from V3.2 across five compounding axes: hybrid CSA plus HCA attention rather than single-type attention, Muon optimizer rather than AdamW, mixed precision (FP8 main / BF16 rotary / FP4 indexer) rather than uniform precision, FP4 quantization-aware training rather than post-hoc quantization, and manifold-constrained hyper-connections rather than standard residuals. At one-million-token context V4-Pro uses roughly ten percent of V3.2's KV cache and twenty-seven percent of its FLOPs.

Where can I read the DeepSeek-V4 paper?

DeepSeek published a technical report alongside the model release. The full paper covers the attention design, optimizer math, mixed-precision schedule, and benchmark results. The interactive walkthrough at jakecuth.com/work/deepseek-v4-lab/ summarizes the key claims with diagrams and the five-architecture comparison.

DeepSeek-V4 Architecture Explained · Muon, CSA, FP4 KV Cache

DeepSeek-V4 is the open-weights frontier model that re-set the long-context bar in 2026. The headline number is a 50x KV cache reduction at one-million-token context relative to a standard BF16 baseline. The headline number is also misleading on its own, because it is the multiplicative product of five separate architectural choices, none of which would do it alone. This is the plain-language version of how those choices fit together.

The five things that compound

Hybrid attention: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA).
Mixed precision per tensor role: FP8 main, BF16 for rotary positional dims, FP4 for the indexer.
Muon optimizer: orthogonalized gradient steps instead of AdamW.
Manifold-constrained hyper-connections: residuals projected onto a learned manifold instead of summed naively.
FP4 quantization-aware training: the model is trained with FP4 in the loop, not quantized after the fact.

Each of these by itself is a one-to-three-times improvement somewhere. The product is the 50x.

Compressed Sparse Attention, the CSA layer

Standard attention is O(n²) in sequence length because every query attends to every key. At one-million-token context the KV cache that has to sit in GPU memory becomes the limit, not the compute. CSA attacks the cache directly.

The mechanism, simplified:

Group every four consecutive tokens into one compressed key-value entry. The cache is now four times smaller per layer.
A lightweight network, the "lightning indexer," scores each compressed entry against the current query and picks the top 1,024 (Pro) or top 512 (Flash).
The query attends only to those selected entries, plus a sliding window of recent uncompressed tokens so the local detail is preserved.

The HCA layer does the same thing, more aggressively: 128 tokens per compressed entry instead of 4. CSA and HCA are interleaved so the model gets both fine and coarse views of history.

The Muon optimizer

AdamW is the default for trillion-parameter training. Muon replaces it for most parameters. The intuition is that AdamW's per-element adaptive step sizes are doing partial work that you can do better with a structural intervention on the whole matrix.

Pseudocode of the inner loop:

g = compute_gradient(W)               # the raw gradient matrix
U, S, V = approx_svd_via_newton_schulz(g, iters=10)
g_orth = U @ V.T                      # the polar factor: shape only
W = W - lr * g_orth

All singular values are squashed to one. The step direction is preserved; the step magnitudes across directions are normalized. Empirically this stabilizes training at scale and converges faster per FLOP. The polar-factor trick is the math that turns "Muon" from a paper curiosity into an optimizer DeepSeek shipped a frontier model on.

FP4 KV cache, and why it does not destroy the model

Four-bit floats have sixteen distinct values. Storing a key or value tensor in FP4 sounds like it should be catastrophic for attention quality. It is not, for two reasons that compound:

FP4 only stores the indexer keys and values, not the main attention path. The indexer's job is to score and rank compressed entries, not to attend to them directly. Rank is robust to quantization in a way that softmax-weighted summation is not.
The model is trained with FP4 in the loop via quantization-aware training. The weights learn around the FP4 grid rather than being projected onto it after the fact.

Combined with the four-tokens-to-one CSA grouping and the per-layer mixed precision (FP8 main, BF16 rotary, FP4 indexer), you get the 50x KV cache reduction without the 50x quality drop you would expect from naive FP4.

How V4 compares to V3.2

Axis	V3.2	V4-Pro
KV cache at 1M context (relative)	1.0×	0.1×
FLOPs at 1M context (relative)	1.0×	0.27×
Attention type	MLA	CSA + HCA hybrid
Optimizer	AdamW	Muon (most params)
KV cache precision	BF16	FP8 / BF16 / FP4 mixed
Residual design	Standard	Manifold-constrained hyper-connections

At shorter contexts the gap narrows. The CSA grouping and FP4 indexer help most when the cache dominates memory, which happens past roughly 100k tokens. Under 32k tokens V4 is incrementally better than V3.2; over 256k tokens it is in a different cost regime.

The unsexy lessons

Three takeaways that generalize beyond DeepSeek:

One. The cache, not the compute, is the long-context bottleneck. Optimization research that ignores cache size will keep losing to research that does not.

Two. Multiplicative wins from independent axes beat any single architectural breakthrough. There is no "one big idea" in V4; there are five medium ones that touch different layers of the system.

Three. Open weights changed the dynamics. The architectural details above are not a leak. They are in the technical report, in the code, and reproducible by anyone with a few thousand GPUs and the patience to run it.

What to read next

The interactive walkthrough with diagrams of each layer, the CSA selection animation, and the five-way comparison to V3.2, Subquadratic SSA, Mamba, and Kimi Linear is at the DeepSeek-V4 lab. The companion Subquadratic note covers the closed-weights long-context competitor. The LeWorldModel lab covers the world-model angle on the same problem family.

llm deepseek attention muon fp4

NOTE 006 2026-05-17 · llm · architecture

DeepSeek-V4 architecture, in plain language.