CUDA Graph 深度解析

2026-06-03

一份交互式可视化指南，帮助你理解 GPU 执行优化。每次 GPU kernel launch 都有 CPU 开销，CUDA Graph 把整个 kernel 序列录制成一张图，一次 launch 即可重放。本文正文为 markdown，关键机制配有可交互动画。

1. 问题背景：千次 Kernel Launch 的死亡

每次 GPU kernel launch 都有 CPU 开销。对于大型 kernel 这可以忽略不计，但现代深度学习每步要运行数百个小 kernel。

下面这个动画对比「不使用 Graph」与「使用 Graph」的 CPU/GPU 时间线：不使用时 CPU 大部分时间花在 driver 开销上（红色间隙），GPU 空闲等待下一条 launch 指令；使用 Graph 后一次 launch 即可触发整个 kernel 序列。

2. CUDA Graph 工作原理：三个阶段

CUDA Graph 分三个阶段工作：

Capture（捕获）：将所有 GPU 操作录制到一个图结构中。此阶段不执行实际计算。
1
2
with torch.cuda.graph(g): out = model(input)
Instantiate（实例化）：将图编译为可执行形式。验证依赖关系并分配资源。通常在 capture 结束时自动完成。
Replay（重放）：通过单次 CPU launch 执行整个图。这是快速路径。
1
2
input.copy_(new_data) g.replay()

捕获的图把所有操作和它们的依赖关系以 DAG 形式记录下来，例如 Input Buffer → {LayerNorm, Attention} → MatMul → GELU → Output Buffer。

3. Graph-Friendly vs 非 Graph-Friendly 代码

基本规则：捕获的所有内容在每次 replay 时必须完全相同。相同的操作、相同的 shape、相同的内存地址。

下面这个交互式检查器，点击每一行代码判断它是否 Graph-Friendly：

会破坏 Graph Capture 的操作	Graph-Friendly 模式
数据依赖的控制流	固定 shape 的 tensor 操作（matmul, conv）
动态 tensor shape	就地 buffer 更新（`.copy_()`）
内存分配/释放	静态控制流（无运行时分支）
CPU-GPU 同步（`.item()`, `.cpu()`）	预分配内存池
对 GPU tensor 执行 print 语句	固定的 kernel grid 维度
可变长度循环	使用固定参数的 cuDNN/cuBLAS
`cudaMalloc` / `cudaFree`	Padding 到最大 shape + masking
`cudaStreamSynchronize`	条件判断在 graph 外部进行

4. 内存模型：为什么地址很重要

Capture 期间 GPU 内存布局被记录下来：

0x7f00_0000 → input_buffer  [4, 512]
0x7f00_2000 → weight_1      [512, 512]
0x7f01_0000 → intermediate  [4, 512]
0x7f01_2000 → output_buffer [4, 512]

Replay 期间 必须使用相同的地址，只是数据可以是新的。如果某个 tensor 被重新分配到新地址（如 0x7f02_0000 → new_alloc [8, 512]），replay 就会失败。

关键规则：你可以修改已记录地址上的数据（通过 .copy_()），但不能改变地址本身。这意味着：不能重新分配内存、不能改变 shape、replay 期间不能创建新 tensor。

5. 高效使用策略

策略一：Graph Pool / Cache

为已知的 batch size 缓存 graph：

# 为已知的 batch size 缓存 graph
graph_cache = {}

def get_or_capture_graph(batch_size):
    if batch_size not in graph_cache:
        # 向上取整到最近的 2 的幂次
        padded = next_power_of_2(batch_size)
        static_input = torch.zeros(padded, 512, device='cuda')
        g = torch.cuda.CUDAGraph()
        with torch.cuda.graph(g):
            static_output = model(static_input)
        graph_cache[padded] = (g, static_input, static_output)
    return graph_cache[next_power_of_2(batch_size)]

# 使用方式：batch 1-32 只需约 5 个 graph
# 大小：1, 2, 4, 8, 16, 32

策略二：分段 Pipeline

只有静态、重复的核心部分使用 graph，动态的前处理/后处理正常运行：

1	`动态 Tokenize → 动态 Pad/Batch → ⚡ CUDA Graph (Transformer Layers) → 动态 Decode → 动态 Output`

策略三：LLM Decode 模式（最佳应用场景）

# LLM 推理：prefill 是动态的，decode 是静态的

# 阶段一：Prefill（可变序列长度 - 不使用 graph）
kv_cache = model.prefill(prompt_tokens)  # 动态 shape

# 阶段二：Decode（每次输入 1 个 token，输出 1 个 token - 使用 graph）
decode_graph = capture_decode_step(model, kv_cache)

for step in range(max_new_tokens):
    # 每步计算完全相同！
    token_input.copy_(last_token)
    decode_graph.replay()  # ⚡ 单次 launch 执行整个 layer stack
    next_token = sample(logits_output)
    if next_token == eos:
        break

为什么这个场景效果极佳：每个 decode step 执行完全相同的操作和 shape（1 个 token）。唯一变化的是数据内容。这正是 CUDA Graph 的完美应用场景。

6. 什么时候 CUDA Graph 真正有用？

场景	不使用 Graph	使用 Graph	加速比
LLM Decode（小模型）	~2.1ms/token	~0.7ms/token	3x
ResNet-50 推理	~4.2ms	~2.8ms	1.5x
大矩阵乘法 (4096×4096)	~12ms	~11.5ms	1.04x
训练 Step（大量小 kernel）	~8ms	~5ms	1.6x

经验法则：

大量小 kernel（每个 <100μs） → 巨大收益（2-5x）

中等 kernel（100μs - 1ms） → 适度收益（1.3-2x）

少量大 kernel（每个 >10ms） → 几乎无收益（<1.1x）

7. 重构模式：修改前 & 修改后

模式一：Dynamic Shapes → Padded Static Shapes

# 修改前（动态）：Batch size 每次迭代都变化
for batch in loader:
    out = model(batch)  # batch.shape[0] 会变！

# 修改后（Graph-Friendly）：固定 buffer，对较小的 batch 做 padding
static_input = torch.zeros(MAX_BATCH, DIM)
for batch in loader:
    static_input[:len(batch)].copy_(batch)
    graph.replay()
    out = static_output[:len(batch)]

模式二：条件逻辑 → 多个 Graph

# 修改前（数据依赖分支）
def forward(x):
    if x.mean() > 0:  # GPU 同步！
        return path_a(x)
    else:
        return path_b(x)

# 修改后（Graph-Friendly）：将两个路径分别 capture 为独立的 graph
graph_a = capture(path_a, static_input)
graph_b = capture(path_b, static_input)

# 在 graph 外部检查条件
if condition:  # 在 replay 之前决定
    graph_a.replay()
else:
    graph_b.replay()

模式三：内部分配 → 预分配 Pool

# 修改前（每次调用都分配内存）
def forward(x):
    tmp = torch.empty_like(x)  # 每次调用都新分配！
    tmp = x * 2
    return tmp + 1

# 修改后（Graph-Friendly）：预分配一次
self.tmp = torch.empty(FIXED_SHAPE)

def forward(x):
    # 复用已有的 buffer
    self.tmp.mul_(0).add_(x).mul_(2)
    return self.tmp + 1

8. 决策流程图：是否应该使用 CUDA Graph？

依次回答以下问题，全部为「是」才推荐使用：

工作负载是否以相同 shape 重复执行？（否 → 跳过）
Kernel 是否很小（每个 <1ms）？（否 → 可能不值得）
能否预分配所有内存？（否 → 先重构）
内部没有数据依赖的分支？（否 → 拆分为多段）
会被 replay 很多次（>100）吗？（否 → 跳过）

全部满足 → 使用 CUDA Graph！ 预期加速 1.5-5x。

9. 常见陷阱与解决方案

陷阱：Capture 前忘记 Warmup。CUDA 会延迟初始化 context、cuDNN 算法选择和 JIT 编译。如果在 warmup 之前就 capture，这些一次性操作会被固化到 graph 中。

# 在 capture 之前务必 warmup 3+ 次迭代
for _ in range(11):  # NVIDIA 推荐 11 次
    _ = model(static_input)
torch.cuda.synchronize()
# 现在开始 capture

陷阱：使用 graph 输出时忘记 .clone()。输出 tensor 是一个固定的 buffer，如果在下一次 replay 之前不 clone 它，数据会被覆盖。

1
2
3

graph.replay()
result = static_output.clone()  # 如果要保留引用，必须 clone
graph.replay()  # static_output 现在是新数据了

陷阱：cuDNN benchmark 模式。cuDNN benchmark 模式会在不同运行中选择不同的算法，graph 可能 capture 到一个非最优的算法。

1 2	`# 在 capture 之前禁用，或做足够多的 warmup 使其稳定 torch.backends.cudnn.benchmark = False`

陷阱：Graph pool 导致内存碎片。每个 graph 会捕获自己的 memory pool，太多 graph = GPU 内存碎片化。应使用共享 pool。

# 在多个 graph 之间共享 memory pool
pool = torch.cuda.graph_pool_handle()
with torch.cuda.graph(g1, pool=pool):
    ...
with torch.cuda.graph(g2, pool=pool):
    ...

陷阱：Graph 中的梯度累积。如果你将训练步骤 graph 化，梯度会在同一个 buffer 中累积，需要谨慎处理梯度清零的位置（在 graph 内部还是外部）。

# 将 zero_grad 包含在 capture 的 graph 内部
with torch.cuda.graph(g):
    optimizer.zero_grad(set_to_none=True)
    loss = model(static_input)
    loss.backward()
    # 如果 optimizer.step() 有动态行为，不要包含它

10. 速查卡

# 基础 capture
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    out = model(inp)
g.replay()

# 使用共享 memory pool
pool = torch.cuda.graph_pool_handle()
with torch.cuda.graph(g, pool=pool):
    ...

# 高级辅助函数
model = torch.cuda.make_graphed_callables(model, (sample_input,))

Graph 化前检查清单：

所有 tensor shape 固定
内部无 .item() / .cpu() / print
内部无动态内存分配
无数据依赖的分支
Warmup 迭代已完成
cudnn.benchmark 已禁用或已稳定
下次 replay 前已 clone 输出
会 replay >100 次

本文正文 markdown 渲染，2 个交互动画通过自定义 {% anim %} 标签以隔离 iframe 嵌入，源自 Arkive 教程。