Model Release Synthesized from 2 sources

DeepSeek V4 Pro Ships 1.6T MoE After 484-Day Dev Cycle

Key Points

• V4 Pro: 1.6T MoE, 48/256 active experts, MIT license
• CSA/HCA cuts FLOPs to 27% and KV cache to 10% at 1M tokens
• Huawei Ascend compatibility targets Chinese hardware independence
• Both Base and Instruct variants ship at launch
• #2 open-weight tier per independent benchmarks
• 58-page technical report covers 484-day development

References (2)

[1] DeepSeek V4 Pro and Flash Released with 1.6T MoE, 1M Context — Latent Space ↗
[2] DeepSeek V4 technical report reveals 484-day development cycle — 量子位 QbitAI ↗

What does 484 days of methodical engineering produce that a three-month sprint cannot? DeepSeek's answer arrived this week in the form of V4 Pro and V4 Flash, a dual-model release that rewrites the timeline expectations for frontier AI development while delivering performance competitive with Gemini 3.1 and GPT-5.4 class systems.

The wait was not for lack of ambition. DeepSeek published a 58-page technical report that exposes every design decision across nearly a year and a half of iteration—a level of transparency rare in open-weight releases. The document reveals V4 Pro's 1.6 trillion parameter Mixture of Experts architecture, with 48 active experts drawn from 256 total per token. The smaller Flash variant distills this to 284 billion active parameters across 8 experts, both supporting 1 million token context through DeepSeek's Compressed Sparse Attention and Heavily Compressed Attention mechanisms.

The CSA/HCA breakthrough is the headline efficiency story. At maximum context length, these techniques reduce computational requirements to just 27% of the FLOPs and 10% of the KV cache memory consumed by DeepSeek-V3.2—a meaningful advance given that long-context inference has historically punished even well-funded deployments. The report attributes this to manifold-constrained hyper-connections, a training methodology DeepSeek first introduced in a January 2026 paper and has now validated at production scale.

Performance benchmarks tell a nuanced story. Independent evaluators place V4 Pro at the #2 open-weight tier, competitive with Kimi K2.6 and GLM-5.1, with particular strength in long-context reasoning and agentic coding tasks. It trails the closed frontier—GPT-5.x and Opus 4.7 remain ahead—but the gap narrows significantly in agentic scenarios where the million-token context becomes a genuine advantage rather than a spec sheet flex.

Perhaps the most consequential detail sits not in the model card but in the hardware compatibility matrix. DeepSeek V4 ships with native support for Huawei Ascend chips, the first major open-weight release to explicitly target China's CANN ecosystem rather than NVIDIA's CUDA stack. The geopolitical framing—Chinese AI independence from export-controlled H100s—is accurate but undersells the engineering challenge. Ascend hardware delivers roughly a quarter of the H100 supply availability; building a frontier model that runs well on it requires fundamental rethinking of memory bandwidth utilization and operator fusion, not just porting.

The release also breaks a pattern. DeepSeek shipped V3 as Base-only, inviting the community to fine-tune an instruct model. V4 arrives with both Base and Instruct variants at launch, a practical decision that accelerates deployment timelines for anyone building production systems. The MIT license covers both, removing the licensing ambiguity that has complicated enterprise adoption of earlier DeepSeek releases.

What does 484 days of patient engineering unlock? The technical report itself may be the answer. Multiple researchers have already cited it as among the best-written model papers of the year—not for its conclusions, but for documenting methodology that competitors can study and the field can build upon. In an era when most frontier labs treat architecture details as competitive moats, DeepSeek's disclosure strategy suggests a different theory of value: the discipline to ship the documentation, not just the weights.