调整参数观察内存和通信开销的变化
每个 GPU 持有完整 KV cache 副本All-reduce after attention + MLP
每个 GPU 只维护本地 KV cacheAll-to-All between attention & MLP