DP Attention：让 KV Cache 不再复制

2026-06-03

在 MoE 大模型推理中，传统的 TP + EP 方案会让每张卡都持有一份完整的 KV Cache，造成严重的显存冗余。DP Attention 通过让每张卡只负责自己的 request，彻底消除这种冗余。本文对比两种方案（4 GPU、8 Experts 示例），配有可交互动画。

传统方案：TP (Attention) + EP (Experts)

在传统方案里，Attention 用 Tensor Parallelism 切分（每卡只算 1/N 的 heads），Experts 用 Expert Parallelism 切分（每卡放一部分 Expert）。问题在于：每张 GPU 都要持有所有 request 的完整 KV Cache，这是 N 倍的冗余。

下面动画展示传统方案一层的完整流程：初始 KV Cache 冗余 → Attention (TP) + AllReduce → All-to-All dispatch → Expert 计算 → All-to-All combine：

每一层的通信开销：

**Attention 通信：**AllReduce（每层 2×，合并 partial heads）
**Expert 通信：**All-to-All（每层 2×，dispatch + combine）
**KV Cache：**每张卡都存所有 request 的 KV → EP/TP 份冗余副本

DP Attention 方案：DP (Attention) + EP (Experts)

DP Attention 把 Attention 改成 Data Parallel：每张 GPU 只负责一个 request，独立计算完整的 Attention（所有 heads），各算各的。Attention 部分的参数被复制到每张卡上，但 Attention 的参数量很小（约占模型 1%），复制的代价可以忽略。

关键收益：每张卡只存自己那个 request 的 KV Cache，零冗余，而且 Attention 阶段完全不需要跨卡通信。

下面动画展示 DP Attention 方案：初始 KV Cache 零冗余 → Attention (DP，零通信！) → All-to-All dispatch → Expert 计算 → All-to-All combine：

注意对比：相比传统方案，DP Attention 在整个 Attention 阶段没有任何通信，一层只剩下 Expert 的 2× All-to-All，完全没有 AllReduce。

数据流图解：一条 batch 怎么流过一层（分步）

光看「计算 + 通信」还不够直观，下面用一个分步动画追踪数据本身的流动：一个 batch 的 4 条 sequence（每卡 1 条）进来后，如何走过 Attention（DP，零通信）→ Router → All-to-All dispatch → Expert → All-to-All combine。点 下一步 → 一步步看，重点观察：Attention 阶段每条 sequence 各自在本地算，而到了 MoE，sequence 会碎成 token 被 All-to-All 打散到不同 expert，再聚合回来。

对比总结

指标	❌ 传统 TP + EP	✓ DP Attention + EP
Attention 通信	AllReduce（每层 2×）	零！（各算各的）
Expert 通信	All-to-All（每层 2×）	All-to-All（每层 2×）
KV Cache 副本数	EP/TP 份（冗余！）	1 份（零冗余）
KV Cache 总占用	× EP/TP 倍	最小化
Attention 参数	按 TP 切分	每卡复制（~1% 参数）

**核心权衡：**DP Attention 用「复制一小部分 Attention 参数」换来了「消除 KV Cache 冗余 + 消除 Attention 通信」。在 MoE 模型里，KV Cache 往往是推理显存的大头，因此这个权衡非常划算 —— 这也是 DeepSeek-V2/V3 等模型在推理时采用 DP Attention 的原因。

本文正文 markdown 渲染，2 个交互动画通过自定义 {% anim %} 标签以隔离 iframe 嵌入，源自 Arkive 教程。