Interactive: DP Attention vs TP Attention 对比

调整参数观察内存和通信开销的变化

TP Attention (传统)

KV Cache
Comm
Compute

每个 GPU 持有完整 KV cache 副本
All-reduce after attention + MLP

DP Attention (优化)

KV Cache
Comm
Compute

每个 GPU 只维护本地 KV cache
All-to-All between attention & MLP