DP Attention vs TP Attention

Interactive: DP Attention vs TP Attention 对比

调整参数观察内存和通信开销的变化

Num Heads: TP Degree: Seq Length:

TP Attention (传统)

KV Cache

Comm

Compute

每个 GPU 持有完整 KV cache 副本
All-reduce after attention + MLP

DP Attention (优化)

KV Cache

Comm

Compute

每个 GPU 只维护本地 KV cache
All-to-All between attention & MLP