GPU 计算与 CUDA 基础

2026-06-03

从硬件架构到编程模型，理解 GPU 并行计算的核心概念。本文正文为 markdown，关键机制配有可交互动画（点击、悬停或逐步演示）。

2.1 GPU 架构：SM, Warp 与 Thread Hierarchy

GPU 并行计算的核心思想

GPU 与 CPU 的根本区别在于设计哲学：CPU 追求低延迟（少量核心 + 大缓存 + 复杂控制逻辑），而 GPU 追求高吞吐（数千个简单核心 + 大量并发线程来隐藏延迟）。

Streaming Multiprocessor (SM)：GPU 的基本计算单元，每个 SM 包含多个 CUDA Core、Tensor Core、Load/Store 单元和 Shared Memory
Warp：32 个线程组成一个 Warp，是 GPU 调度的最小单位，Warp 内所有线程执行相同指令 (SIMT)
Thread Hierarchy：Thread → Warp → Thread Block → Grid，形成多层并行结构

GPU 全局架构总览 (Blackwell Architecture)

下图展示了一个现代 GPU 的完整硬件结构，从芯片级到 SM 内部，涵盖所有关键计算和存储单元：

组件	数量/规格	功能	峰值性能
CUDA Cores	24,064 (128/SM × 188 SMs)	标量浮点/整数运算 (FP32, FP64, INT32)	125 TFLOPS (FP32)
Tensor Cores	752 (4/SM × 188 SMs), 5th Gen	矩阵 FMA: D = A×B + C (支持 FP4/FP8/BF16/TF32)	4000 TOPS (FP4) / ~500 TFLOPS (BF16)
RT Cores	188 (1/SM), 4th Gen	硬件光线追踪 (BVH traversal, 三角形求交)	380 TFLOPS
SRAM (Shared/L1)	228 KB/SM (可配置分配比例)	Block 内线程共享的高速暂存, 手动管理	~19 TB/s 聚合带宽
L2 Cache	96 MB	全 GPU 共享, 硬件自动管理, 减少 HBM 访问	~12 TB/s
HBM (GDDR7)	96 GB, 512-bit interface	全局显存, 存放 model weights / activations / KV cache	1792 GB/s
Register File	256 KB/SM (65536 × 32-bit)	每个 thread 的私有存储, 最快的存储层级	~20 TB/s

一次 GEMM 的数据流动路径：

1
2
3

HBM (Weight Matrix) → L2 Cache (auto cache) → Shared Memory (tiled block)
  → Registers (fragments) → Tensor Core (D = A×B + C)
  → Registers (accumulator) → HBM (Output Matrix)

性能关键：最大化数据在 SRAM (Shared Mem + Registers) 的复用，最小化 HBM 访问 —— 这就是 Tiling 的核心思想。

交互式 GPU 架构图（点击 SM 查看内部结构）

Thread Hierarchy 详解

层级	大小	对应硬件	可用内存	通信方式
Thread	1	CUDA Core / Tensor Core	Registers (~255 个/thread)	—
Warp	32 threads	SM 调度单位 + Tensor Core MMA	Register File (共享)	Warp Shuffle, Tensor Core 协作
Thread Block	最多 1024 threads	单个 SM	Shared Memory (最高 228KB)	Shared Memory 读写
Grid	多个 blocks	整个 GPU (所有 SM)	Global Memory (HBM)	Global Memory / Atomics

Tensor Core 在 Thread Hierarchy 中的位置

Tensor Core 是 Warp 级别的计算单元，一次 MMA (Matrix Multiply-Accumulate) 操作由整个 Warp (32 threads) 协作完成：

Tensor Core 的关键特点：

Warp 协作：32 个 thread 各自持有 A/B/C 矩阵的不同 fragment (碎片)，一条 mma.sync 指令由整个 Warp 协同执行
数据分布：A/B fragment 分布在 32 个 thread 的 registers 中，不需要经过 Shared Memory
混合精度：输入 FP16/BF16/FP8，累加器 FP32，兼顾速度和精度
吞吐量 (H100)：每 SM 有 4 个 Tensor Core，每 cycle 处理一次 MMA → 单 SM 峰值 ~200 TFLOPS (FP8)

编程模型映射

# CUDA 编程模型 → GPU 硬件映射

Grid (kernel launch)
  └── Block 0, Block 1, ..., Block N     # → 分配到各个 SM
        └── Warp 0, Warp 1, ..., Warp M  # → SM 内的 Warp Scheduler 调度
              └── Thread 0..31           # → CUDA Cores 执行标量运算
              └── [Tensor Core MMA]      # → 整个 Warp 协作执行矩阵运算

# 关键约束:
# - 一个 Block 只能在一个 SM 上运行 (不能跨 SM)
# - 一个 SM 可以同时运行多个 Block (取决于资源)
# - Warp 内所有 thread 执行相同 PC (SIMT)
# - Tensor Core 指令是 Warp 级别的, 不是 Thread 级别的

# 资源限制 (H100):
# - 每 SM: 最多 2048 个 resident threads = 64 个 Warp
# - 每 SM: 最多 32 个 resident Blocks
# - 每 Block: 最多 1024 个 threads
# - 每 SM: 228KB Shared Memory, 65536 个 32-bit Registers
# - 全 GPU: 132 个 SM → 最多 132×2048 = 270,336 个并发 threads

2.2 内存层次：从 Registers 到 HBM

GPU 内存层次金字塔（悬停查看带宽）：越靠近计算单元，容量越小但带宽越高、延迟越低。

带宽对比（H100 SXM）

层级	带宽
Registers	~20 TB/s
Shared Memory	~19 TB/s
L2 Cache	~12 TB/s
HBM3	3.35 TB/s
PCIe 5.0 (Host ↔ Device)	64 GB/s

关键设计原则

数据复用 (Data Reuse)：尽可能在 Shared Memory 中复用数据，减少对 HBM 的访问
Coalesced Access：同一个 Warp 的线程访问连续内存地址，合并为一次事务
Bank Conflict 避免：Shared Memory 分为 32 个 bank，避免同一 Warp 内多个线程访问同一 bank
Occupancy 优化：平衡 Shared Memory 和 Register 使用量，最大化 SM 上的活跃 Warp 数

2.3 GEMM 与 Tensor Core

为什么 GEMM 如此重要？

深度学习的核心运算 —— 全连接层、卷积、Attention 的和 —— 本质都是矩阵乘法 (General Matrix Multiply, GEMM)。优化 GEMM 就是优化整个训练和推理过程。

一个 C[M,N] = A[M,K] × B[K,N] 需要 FLOPs（乘加各算一次）。对于 LLM 中典型的矩阵尺寸（如 4096×4096），这意味着数十亿次运算。

GEMM Tiling 动画（点击 “开始” 观察分块计算过程）

Tiling 策略分层

// GEMM Tiling 分层策略 (概念伪代码)
// Level 1: Thread Block Tile (从 Global Memory 加载到 Shared Memory)
for tile_k in range(K // BLOCK_K):
    // 每个 Thread Block 负责 C 的一个 BLOCK_M × BLOCK_N 子块
    load A[block_m, tile_k] → shared_A   // Cooperative load
    load B[tile_k, block_n] → shared_B
    __syncthreads()

    // Level 2: Warp Tile (从 Shared Memory 到 Registers)
    for warp_tile:
        load shared_A[warp_m] → reg_A
        load shared_B[warp_n] → reg_B

        // Level 3: Tensor Core MMA (在 Registers 中计算)
        reg_C += mma(reg_A, reg_B)   // Tensor Core 指令

    __syncthreads()

store reg_C → C[block_m, block_n]   // 写回 Global Memory

2.4 CUDA Graphs：消除 Kernel Launch 开销

问题：Kernel Launch Overhead

每次 CPU 向 GPU 提交一个 kernel，都有约 5-10 微秒的启动开销。对于单个大 kernel 这不是问题，但在 LLM 推理的 decode 阶段，模型有数百个小 kernel 依次执行：

一个 Transformer layer 可能包含 20+ 个 kernel（LayerNorm, QKV projection, Attention, FFN…）
如果每个 kernel 计算时间只有 10-20μs，那 launch overhead 占比可达 30-50%
CPU 成为瓶颈 —— GPU 在等待 CPU 发射下一个 kernel

CUDA Graph 时间线对比（动画）

CUDA Graph 三步流程：Capture → Instantiate → Replay

import torch

# Step 0: Warmup (确保所有 memory 分配和 kernel 编译完成)
model = MyModel().cuda()
static_input = torch.randn(batch_size, seq_len, d_model, device='cuda')

for _ in range(3):
    _ = model(static_input)  # warmup runs

# Step 1: Capture - 录制 kernel 执行序列
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
    static_output = model(static_input)  # 不真正执行，只录制

# Step 2 & 3: Replay - 用新输入重放
for new_data in dataloader:
    # 将新数据 copy 到 static_input（必须原地更新！）
    static_input.copy_(new_data)

    # 重放整个计算图（一次 launch 执行所有 kernel）
    graph.replay()

    # 结果在 static_output 中（固定地址）
    result = static_output.clone()  # 如需保存则 clone

CUDA Graph 的限制

静态形状：Capture 时的 tensor shape 必须固定，不能有动态 shape
无 CPU 依赖：Graph 内不能有 if/else、Python 控制流、CPU 同步
内存固定：输入输出的地址在 capture 时固定，之后只能原地写入
不支持动态分配：Graph 内不能 torch.empty() 新 tensor
适合场景：decode 阶段（batch size 和 seq_len 固定的推理循环）

2.5 Memory 管理与 Profiling

PyTorch CUDA Memory Allocator

PyTorch 使用自己的 caching memory allocator，避免频繁调用 cudaMalloc/cudaFree（每次调用需要全 GPU 同步，非常慢）。

Block Pool：释放的内存不还给 CUDA，而是放回 pool 等待复用
Fragmentation 问题：长时间训练后，pool 中大量小碎片无法合并为大块
Expandable Segments（PyTorch 2.0+）：使用 cudaMallocAsync 或虚拟内存 API，按需扩展而非一次性分配

Memory Profiling 实战

import torch
from torch.cuda import memory

# 方法 1: memory_snapshot (详细快照)
torch.cuda.memory._record_memory_history(max_entries=100000)

# ... 运行你的模型代码 ...
output = model(input_data)
loss = criterion(output, target)
loss.backward()

# 导出快照 (可用 PyTorch Memory Viz 工具可视化)
snapshot = torch.cuda.memory._snapshot()
with open("mem_snapshot.pickle", "wb") as f:
    pickle.dump(snapshot, f)
torch.cuda.memory._record_memory_history(enabled=None)

# 方法 2: 基础统计
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved:  {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Max Allocated: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")

# 方法 3: memory_summary (文本表格)
print(torch.cuda.memory_summary(abbreviated=True))

启用 Expandable Segments

# 方式 1: 环境变量
# export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# 方式 2: 代码中设置 (必须在任何 CUDA 操作之前)
import torch
torch.cuda.memory._set_allocator_settings("expandable_segments:True")

# 效果：减少 OOM 概率，允许内存段按需增长
# 原理：使用虚拟内存映射，物理内存按 2MB page 按需分配
# 适合：batch size 变化大、模型结构动态变化的场景

常见 OOM 排查清单

检查 memory_reserved vs memory_allocated —— 差距大说明碎片严重
使用 torch.cuda.empty_cache() 释放 cached memory（注意：不影响已分配的 tensor）
开启 expandable_segments 减少外部碎片
用 gradient_checkpointing 用计算换内存
检查是否有 tensor 被意外保留引用（常见于 loss 日志列表）

2.6 CUDA Programming Model

Grid / Block / Thread 映射

CUDA 编程的核心在于将计算任务映射到 GPU 的线程层次结构中：

# 概念：一个 kernel 的执行配置
# Grid:  (gridDim.x, gridDim.y, gridDim.z) 个 Block
# Block: (blockDim.x, blockDim.y, blockDim.z) 个 Thread

# 例如处理 1024×1024 的矩阵：
# Block size: 16×16 = 256 threads
# Grid size:  64×64 = 4096 blocks
# 总线程数:   256 × 4096 = 1,048,576

# 每个线程通过 threadIdx 和 blockIdx 计算自己负责的元素
row = blockIdx.y * blockDim.y + threadIdx.y
col = blockIdx.x * blockDim.x + threadIdx.x

用 Triton 写 GEMM Kernel（Python 风格）

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    A_ptr, B_ptr, C_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
):
    # 当前 program (类似 thread block) 的 ID
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    # 计算当前 block 负责的行列范围
    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    offs_k = tl.arange(0, BLOCK_K)

    # 初始化累加器 (在 registers 中)
    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

    # 沿 K 维度循环 (tiling)
    for k in range(0, K, BLOCK_K):
        # 从 Global Memory 加载 tile
        a = tl.load(A_ptr + offs_m[:, None] * stride_am + (k + offs_k[None, :]) * stride_ak)
        b = tl.load(B_ptr + (k + offs_k[:, None]) * stride_bk + offs_n[None, :] * stride_bn)

        # 矩阵乘加 (编译器会映射到 Tensor Core)
        acc += tl.dot(a, b)

    # 写回结果到 Global Memory
    tl.store(C_ptr + offs_m[:, None] * stride_cm + offs_n[None, :] * stride_cn, acc)

Kernel Launch 与 Occupancy

选择合适的 Block Size 对性能至关重要：

Block 太小（如 32 threads）：无法充分利用 SM 资源，Occupancy 低
Block 太大（如 1024 threads）：每个 block 需要更多资源，SM 上能放的 block 数减少
经验值：128 或 256 threads/block 通常是好的起点
使用 torch.cuda.get_device_properties() 查询硬件参数来指导配置

# 查询 GPU 属性
props = torch.cuda.get_device_properties(0)
print(f"GPU: {props.name}")
print(f"SMs: {props.multi_processor_count}")
print(f"Max threads/block: {props.max_threads_per_block}")
print(f"Max threads/SM: {props.max_threads_per_multi_processor}")
print(f"Shared mem/block: {props.max_shared_memory_per_block / 1024:.0f} KB")
print(f"Warp size: {props.warp_size}")

2.7 练习题

练习 1：计算 GPU 算力利用率（计算题）

已知条件：

GPU: NVIDIA H100 SXM，FP16 Tensor Core 峰值算力 = 989.5 TFLOPS
你的 GEMM Kernel 处理矩阵 C[4096, 4096] = A[4096, 4096] × B[4096, 4096]
Kernel 执行时间 = 0.35 ms（由 NSight 测量）

问题：该 Kernel 的算力利用率 (MFU) 是多少？

提示：GEMM 的 FLOPs = 2 × M × N × K；MFU = 实际 FLOPS / 峰值 FLOPS；实际 FLOPS = FLOPs / 执行时间。

解答：

# Step 1: 计算总 FLOPs
M, N, K = 4096, 4096, 4096
total_flops = 2 * M * N * K
# = 2 × 4096 × 4096 × 4096 = 137,438,953,472 ≈ 137.4 GFLOPs

# Step 2: 计算实际 throughput
time_sec = 0.35e-3  # 0.35 ms
actual_tflops = total_flops / time_sec / 1e12
# = 137.4e9 / 0.35e-3 / 1e12 ≈ 392.7 TFLOPS

# Step 3: 计算 MFU
peak_tflops = 989.5  # H100 FP16 Tensor Core
mfu = actual_tflops / peak_tflops
# = 392.7 / 989.5 ≈ 39.7%

答案：MFU ≈ 39.7%。注：实际优化良好的 GEMM（如 cuBLAS）在该尺寸下可达 70-80% MFU。39.7% 说明还有优化空间（可能是 tiling 不够优、memory access pattern 不好等）。

练习 2：判断 Kernel 是 Compute-bound 还是 Memory-bound（分析题）

已知条件（H100 SXM）：峰值算力 989.5 TFLOPS (FP16)；HBM 带宽 3.35 TB/s；机器的 Arithmetic Intensity 边界 = 989.5 / 3.35 ≈ 295 FLOPs/Byte。

现有两个 Kernel：

Kernel A（Vector Add）：对 N=1M 个 FP16 元素做逐元素加法。读 2N 元素 + 写 N 元素 = 6MB 数据传输，总计算 = N = 1M FLOPs
Kernel B（GEMM 4096×4096×4096）：如练习 1，137.4 GFLOPs，读写数据 = (4096×4096 + 4096×4096 + 4096×4096) × 2 bytes = 96 MB

问题：分别判断两个 Kernel 是 compute-bound 还是 memory-bound，并解释 Roofline Model 分析过程。

解答：

# 机器边界 (Ridge Point)
ridge = 989.5e12 / 3.35e12  # ≈ 295 FLOPs/Byte

# Kernel A: Vector Add
flops_a = 1e6          # 1M FLOPs
bytes_a = 6e6          # 6 MB (读2个输入 + 写1个输出, FP16)
ai_a = flops_a / bytes_a  # ≈ 0.167 FLOPs/Byte
# 0.167 远小于 295 → Memory-bound ✓

# Kernel B: GEMM 4096³
flops_b = 137.4e9     # 137.4 GFLOPs
bytes_b = 96e6         # 96 MB
ai_b = flops_b / bytes_b  # ≈ 1431 FLOPs/Byte
# 1431 远大于 295 → Compute-bound ✓

结论：

Kernel A (Vector Add) 是 Memory-bound：优化方向是减少内存访问、利用 cache、合并 kernel (kernel fusion)
Kernel B (GEMM) 是 Compute-bound：优化方向是用 Tensor Core、提高 Occupancy、优化指令调度

这就是为什么 operator fusion 在 LLM 推理中如此重要 —— 把多个 memory-bound 的小 kernel（如 bias add + activation + dropout）融合为一个，减少中间结果的 HBM 读写。

练习 3：CUDA Graph 适用性判断（判断题）

判断以下场景能否使用 CUDA Graph，并解释原因：

场景 A：LLM decode 阶段，batch_size=8, seq_len 每步 +1（KV cache 增长），但使用 paged attention
场景 B：固定 batch_size 的图像分类推理（ResNet-50），输入始终是 224×224
场景 C：训练过程中的 forward pass，模型包含 if x.sum() > 0: path_a() else: path_b() 的动态分支
场景 D：vLLM 中的 prefill 阶段，不同请求的 prompt 长度不同

解答：

场景 A：可以，但需要技巧 ✓。Paged attention 的 kernel 参数（page table）可以通过固定 buffer 传入。实际中 SGLang/vLLM 对 decode 使用 CUDA Graph：为不同 batch_size 预编译多个 graph（如 bs=1,2,4,8,16…），padding 到最近的大小。seq_len 增长不影响 decode 的单步计算（每步只计算 1 个 token）。
场景 B：非常适合 ✓。经典的 CUDA Graph 应用场景 —— 固定 input shape、固定模型结构、无动态控制流。可以获得显著加速。
场景 C：不可以 ✗。if x.sum() > 0 是数据依赖的控制流，在 capture 时只会录制当时走过的分支。如果实际运行时走了不同分支，graph replay 仍会执行录制的那条路径，导致结果错误。
场景 D：不直接适用，但可以 padding ✓/✗。Prefill 阶段不同请求的 seq_len 不同，shape 动态变化。直接使用 CUDA Graph 不行。但可以像 decode 一样，为常见 seq_len 预编译多个 graph，短序列 padding 到最近的档位。权衡：padding 浪费算力 vs graph 节省的 launch overhead。对 prefill 来说 kernel 通常够大，launch overhead 占比低，因此收益不如 decode 阶段明显。

练习 4：Shared Memory Tiling 设计（设计题）

你需要实现一个 GEMM kernel 处理 C[M,N] = A[M,K] × B[K,N]：Target GPU H100 (Shared Memory 可配置最大 228KB/SM)；数据类型 FP16 (2 bytes per element)；Block tile size BLOCK_M=128, BLOCK_N=128, BLOCK_K=32。

问题：(1) 计算每个 block 需要多少 Shared Memory；(2) 该配置下每个 SM 最多能运行多少个 block？(3) 如果改为 BLOCK_K=64，会有什么权衡？

解答：

# 1. 计算 Shared Memory 用量
BLOCK_M, BLOCK_N, BLOCK_K = 128, 128, 32
elem_size = 2  # FP16 = 2 bytes

tile_A = BLOCK_M * BLOCK_K * elem_size  # 128×32×2 = 8 KB
tile_B = BLOCK_K * BLOCK_N * elem_size  # 32×128×2 = 8 KB
single_buffer = tile_A + tile_B          # = 16 KB

# Double buffering (加载下一个 tile 时当前 tile 还在计算)
total_smem = single_buffer * 2           # = 32 KB

# 2. 每个 SM 能运行多少 block？
smem_per_sm = 228  # KB (H100 max)
max_blocks = smem_per_sm // 32  # = 7 (被 Shared Memory 限制)
# 但实际还受 register、thread 数限制
# 若每 block 用 256 threads, 则 7 blocks = 1792 threads/SM
# H100 max 2048 threads/SM, 所以 thread 不是瓶颈

# 3. BLOCK_K=64 的权衡
tile_A_64 = 128 * 64 * 2  # = 16 KB
tile_B_64 = 64 * 128 * 2  # = 16 KB
total_smem_64 = (16 + 16) * 2  # = 64 KB (double buffer)
max_blocks_64 = 228 // 64  # = 3 blocks/SM

权衡分析 (BLOCK_K=32 vs BLOCK_K=64)：

指标	BLOCK_K=32	BLOCK_K=64
Shared Mem / block	32 KB	64 KB
Blocks / SM	7	3
Warps / SM	7×8=56	3×8=24
Occupancy	56/64 = 87.5%	24/64 = 37.5%
K 维循环次数	K/32 次	K/64 次 (减半)
每次循环计算量	128×128×32 FLOPs	128×128×64 FLOPs (翻倍)

为什么 Occupancy 很重要？—— SM 内的 Warp 调度机制

关键：一个 SM 同一时刻有多个 Warp “在跑”，不是只有一个！ H100 每个 SM 有 4 个 Warp Scheduler，每个 Scheduler 每 cycle 可以给 1 个 Warp 发射指令。所以每 cycle 最多 4 个 Warp 同时执行。

Warp 状态	含义	数量
Resident Warps	已分配资源 (register/smem)，随时可被选中	最多 64/SM
Eligible Warps	不在等数据、不在等 barrier，下 cycle 可执行	动态变化
Issuing Warps	当前 cycle 正在被 scheduler 发射指令	最多 4/SM

64 个 Resident Warp 中，每 cycle 选 4 个发射。其余要么在等 HBM 数据 (~400 cycles)，要么在等 __syncthreads()，要么 eligible 但没被选中。

关键区别 vs CPU：Warp 切换是零开销的！每个 Resident Warp 的 register 始终保留在 Register File 中（不需要 save/restore）。Scheduler 只需将指针切换到另一个 Warp 的 PC —— 这就是为什么 GPU 可以靠大量 Warp 隐藏延迟，而 CPU context switch 代价高昂。

Occupancy 高 → Resident Warp 多 → 即使很多 Warp stall 了等内存，scheduler 仍能找到 eligible Warp 来填满 pipeline。Occupancy 低 → 可能所有 Warp 都在等数据 → 4 个 Scheduler 空转 → GPU 利用率暴跌。

但 Occupancy 不是越高越好：

BLOCK_K=64 的优势：循环次数减半 → 减少 __syncthreads() 调用次数、减少 loop branch overhead、每次循环做更多有效 MMA 指令（amortize instruction pipeline 的固定开销）
BLOCK_K=64 的劣势：Occupancy 降到 37% → 如果 kernel 是 latency-bound 的 (计算和访存交替，不够密集)，则没有足够 Warp 来填满 stall cycle
什么时候大 BLOCK_K 更优？ 当 kernel 是 compute-bound 的 (Tensor Core 利用率已经很高，每个 Warp 产生的 MMA 指令密度足够填满 pipeline)，此时 Occupancy 不是瓶颈，减少 loop overhead 反而收益更大
什么时候小 BLOCK_K 更优？ 当 K 维度较小 (循环次数本来就少) 或 kernel 有大量 non-compute 操作 (如 epilogue 中的 store/activation) 时，高 Occupancy 更重要

实践指导：在 Triton/CUTLASS 中通常通过 auto-tune 在 BLOCK_K={32, 64, 128} 之间选择。cuBLAS 内部也有类似的 heuristic。没有理论公式能直接算出最优值 —— 因为真实的 pipeline 行为取决于 instruction mix、bank conflict、register pressure 等无法简单建模的因素。Profile-guided tuning 是唯一靠谱的方法。

Coding Exercise A：用 Triton 写一个 Vector Add Kernel（编程题）

使用 Triton 实现 element-wise vector add kernel，包含 grid/block 配置、边界检查 (masking)，并 benchmark 与 PyTorch 原生实现的性能对比。

要求：使用 @triton.jit 装饰器；正确处理非对齐长度（N 不能被 BLOCK_SIZE 整除）；展示如何选择 BLOCK_SIZE 并比较性能；验证结果与 torch.add 完全一致；使用 triton.testing.do_bench 进行 benchmark。

完整解答：

import torch
import triton
import triton.language as tl

@triton.jit
def vector_add_kernel(
    x_ptr, y_ptr, output_ptr,
    N,
    BLOCK_SIZE: tl.constexpr,
):
    # 获取当前 program 的 ID (类似 CUDA 的 blockIdx)
    pid = tl.program_id(0)

    # 计算本 program 负责处理的元素索引范围
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)

    # 边界 mask: 防止越界访问 (关键!)
    mask = offsets < N

    # 从 Global Memory 加载数据 (mask 外的位置填 0)
    x = tl.load(x_ptr + offsets, mask=mask, other=0.0)
    y = tl.load(y_ptr + offsets, mask=mask, other=0.0)

    # element-wise 加法
    result = x + y

    # 写回结果 (只写 mask 内的位置)
    tl.store(output_ptr + offsets, result, mask=mask)


def vector_add(x: torch.Tensor, y: torch.Tensor, block_size=1024) -> torch.Tensor:
    """Triton vector add 的 Python wrapper"""
    assert x.shape == y.shape and x.is_cuda
    N = x.numel()
    output = torch.empty_like(x)

    # 计算 grid: 需要多少个 program 来覆盖所有元素
    grid = ((N + block_size - 1) // block_size, )

    # 启动 kernel
    vector_add_kernel[grid](x, y, output, N, BLOCK_SIZE=block_size)

    return output


# ===== 验证正确性 =====
print("=" * 50)
print("验证正确性")
print("=" * 50)

N = 98432  # 非 2 的幂次, 测试边界处理
x = torch.randn(N, device='cuda')
y = torch.randn(N, device='cuda')

# Triton 实现
triton_output = vector_add(x, y, block_size=1024)
# PyTorch 实现
torch_output = x + y

max_diff = (triton_output - torch_output).abs().max().item()
print(f"N = {N} (非对齐长度)")
print(f"最大误差: {max_diff:.2e}")
assert max_diff == 0.0, "结果不一致!"
print("✓ 结果与 torch.add 完全一致 (zero diff)\n")


# ===== Benchmark 不同 BLOCK_SIZE =====
print("=" * 50)
print("Benchmark: 不同 BLOCK_SIZE 的性能对比")
print("=" * 50)

N_bench = 10_000_000  # 10M 元素
x_bench = torch.randn(N_bench, device='cuda')
y_bench = torch.randn(N_bench, device='cuda')

# PyTorch baseline
torch_ms = triton.testing.do_bench(lambda: x_bench + y_bench)
print(f"PyTorch (torch.add):  {torch_ms:.4f} ms")

# Triton with different block sizes
for bs in [128, 256, 512, 1024, 2048]:
    triton_ms = triton.testing.do_bench(
        lambda: vector_add(x_bench, y_bench, block_size=bs)
    )
    bandwidth = 3 * N_bench * 4 / (triton_ms * 1e-3) / 1e9  # GB/s (FP32)
    print(f"Triton BLOCK_SIZE={bs:4d}: {triton_ms:.4f} ms | 有效带宽: {bandwidth:.1f} GB/s")

print(f"\n注: Vector Add 是典型的 memory-bound kernel")
print(f"理论峰值带宽 (H100): 3350 GB/s")
print(f"实际能达到 ~80% 就很好了")

预期输出 (在 H100 上)：

==================================================
验证正确性
==================================================
N = 98432 (非对齐长度)
最大误差: 0.00e+00
✓ 结果与 torch.add 完全一致 (zero diff)

==================================================
Benchmark: 不同 BLOCK_SIZE 的性能对比
==================================================
PyTorch (torch.add):  0.0391 ms
Triton BLOCK_SIZE= 128: 0.0452 ms | 有效带宽: 2655.3 GB/s
Triton BLOCK_SIZE= 256: 0.0412 ms | 有效带宽: 2912.6 GB/s
Triton BLOCK_SIZE= 512: 0.0398 ms | 有效带宽: 3015.1 GB/s
Triton BLOCK_SIZE=1024: 0.0393 ms | 有效带宽: 3053.4 GB/s
Triton BLOCK_SIZE=2048: 0.0395 ms | 有效带宽: 3038.0 GB/s

注: Vector Add 是典型的 memory-bound kernel
理论峰值带宽 (H100): 3350 GB/s
实际能达到 ~80% 就很好了

Coding Exercise B：用 torch.cuda.Event 测量 Kernel 时间（编程题）

正确测量 GPU kernel 执行时间（不是 CPU 时间！），包含 warmup、CUDA Event timing、多次运行取平均。对比不同矩阵大小的 GEMM 执行时间，画出 TFLOPS 曲线，并判断哪些大小是 compute-bound、哪些是 launch-overhead-bound。

为什么不能用 time.time()？ CPU 和 GPU 是异步执行的。time.time() 只测量 CPU 发射 kernel 的时间，不包括 GPU 实际计算时间。必须用 CUDA Event 或 torch.cuda.synchronize() 配合。

完整解答：

import torch
import time

def benchmark_gemm_cuda_events(M, N, K, num_warmup=10, num_runs=100):
    """使用 CUDA Events 精确测量 GEMM 执行时间"""
    A = torch.randn(M, K, device='cuda', dtype=torch.float16)
    B = torch.randn(K, N, device='cuda', dtype=torch.float16)

    # Warmup: 消除首次运行的 JIT 编译、内存分配开销
    for _ in range(num_warmup):
        _ = torch.mm(A, B)
    torch.cuda.synchronize()  # 确保 warmup 完成

    # 创建 CUDA Events
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)

    # 测量: 记录每次运行的时间
    times = []
    for _ in range(num_runs):
        start_event.record()
        C = torch.mm(A, B)
        end_event.record()
        # 同步等待 GPU 完成 (必须! 否则 elapsed_time 可能返回错误值)
        torch.cuda.synchronize()
        times.append(start_event.elapsed_time(end_event))  # 返回毫秒

    # 统计
    avg_ms = sum(times) / len(times)
    std_ms = (sum((t - avg_ms)**2 for t in times) / len(times)) ** 0.5

    # 计算 TFLOPS
    flops = 2 * M * N * K  # 乘法 + 加法
    tflops = flops / (avg_ms * 1e-3) / 1e12

    return avg_ms, std_ms, tflops


# ===== 对比: CPU time vs CUDA Event time =====
print("=" * 60)
print("对比 CPU time 和 CUDA Event time 的区别")
print("=" * 60)

M, N, K = 2048, 2048, 2048
A = torch.randn(M, K, device='cuda', dtype=torch.float16)
B = torch.randn(K, N, device='cuda', dtype=torch.float16)

# 错误方式: CPU time (不同步)
t0 = time.time()
C = torch.mm(A, B)  # 只是把 kernel 发射出去!
cpu_time_no_sync = (time.time() - t0) * 1000

# 正确方式 1: CPU time + synchronize
torch.cuda.synchronize()
t0 = time.time()
C = torch.mm(A, B)
torch.cuda.synchronize()
cpu_time_sync = (time.time() - t0) * 1000

# 正确方式 2: CUDA Event
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
C = torch.mm(A, B)
end.record()
torch.cuda.synchronize()
cuda_event_time = start.elapsed_time(end)

print(f"CPU time (无 sync): {cpu_time_no_sync:.4f} ms  ← 只测了 launch 时间, 错误!")
print(f"CPU time (有 sync): {cpu_time_sync:.4f} ms  ← 包含 sync overhead")
print(f"CUDA Event time:    {cuda_event_time:.4f} ms  ← 最精确的 GPU 端时间")


# ===== GEMM TFLOPS 曲线 =====
print(f"\n{'='*60}")
print("GEMM TFLOPS vs Matrix Size")
print(f"{'='*60}")
print(f"{'Size':>6} | {'Time (ms)':>12} | {'Std (ms)':>10} | {'TFLOPS':>8} | {'分析'}")
print("-" * 70)

sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096]
results = []

for size in sizes:
    avg_ms, std_ms, tflops = benchmark_gemm_cuda_events(size, size, size)
    results.append((size, avg_ms, std_ms, tflops))

    # 判断 bound 类型
    if avg_ms < 0.015:  # <15 微秒, 基本是 launch overhead
        bound_type = "launch-overhead-bound"
    elif tflops < 100:  # TFLOPS 很低, 矩阵太小没法填满 GPU
        bound_type = "underutilized (太小)"
    else:
        bound_type = "compute-bound ✓"

    print(f"{size:>6} | {avg_ms:>12.4f} | {std_ms:>10.4f} | {tflops:>8.1f} | {bound_type}")

预期输出 (在 H100 上)：

============================================================
对比 CPU time 和 CUDA Event time 的区别
============================================================
CPU time (无 sync): 0.0234 ms  ← 只测了 launch 时间, 错误!
CPU time (有 sync): 0.0891 ms  ← 包含 sync overhead
CUDA Event time:    0.0847 ms  ← 最精确的 GPU 端时间

============================================================
GEMM TFLOPS vs Matrix Size
============================================================
  Size |    Time (ms) |   Std (ms) |   TFLOPS | 分析
----------------------------------------------------------------------
    32 |       0.0082 |     0.0015 |      8.0 | launch-overhead-bound
    64 |       0.0085 |     0.0012 |     61.7 | launch-overhead-bound
   128 |       0.0091 |     0.0014 |    461.7 | underutilized (太小)
   256 |       0.0112 |     0.0018 |    238.2 | compute-bound ✓
   512 |       0.0203 |     0.0021 |    662.3 | compute-bound ✓
  1024 |       0.0587 |     0.0025 |    731.4 | compute-bound ✓
  2048 |       0.0891 |     0.0032 |    770.1 | compute-bound ✓
  4096 |       0.3412 |     0.0041 |    802.7 | compute-bound ✓

分析总结：

小矩阵 (32-128)：TFLOPS 极低，因为 kernel launch overhead (~5-10us) 远大于计算时间本身。GPU 的数千个 core 大部分空闲。
中等矩阵 (256-512)：TFLOPS 快速上升，GPU 开始被有效利用，但还没到峰值（tile 无法完全填满所有 SM）。
大矩阵 (1024-4096)：接近峰值 TFLOPS，是真正的 compute-bound。cuBLAS 的 Tensor Core GEMM 在这些尺寸下效率最高。

解决小矩阵问题的方法：Batched GEMM（多个小矩阵打包成一次调用）、CUDA Graph（消除多次 launch overhead）、Kernel fusion（避免中间结果写回 HBM）。

练习 5：Roofline Model 分析（面试题）

已知条件：GPU NVIDIA H100 SXM；BF16 Tensor Core 峰值算力 989 TFLOPS；HBM 带宽 3.35 TB/s；Arithmetic Intensity (AI) = FLOPs / Bytes_accessed。

问题：(1) Ridge point 在哪？(2) 判断以下 kernel：GEMM [4096,4096]×[4096,4096]（AI=4096）、LayerNorm [32, 2048, 4096]（AI约5）、Softmax [32, 32, 2048, 2048]（AI约3）、Element-wise ReLU（AI=0.25）；(3) 对 memory-bound / compute-bound kernel 的优化策略。

解答：

# (1) Ridge Point
ridge = 989e12 / 3.35e12  # ≈ 295 FLOPs/Byte

# (2) 判断各 kernel
# GEMM: AI ≈ 4096 >> 295 → Compute-bound ✓
# LayerNorm: AI ≈ 5 << 295 → Memory-bound ✓
# Softmax: AI ≈ 3 << 295 → Memory-bound ✓
# ReLU: AI = 0.25 << 295 → Memory-bound ✓

# (3) 优化策略
# Memory-bound kernel 优化:
#   - Kernel fusion: 减少中间结果的 HBM round-trips
#     (e.g., fuse LayerNorm + ReLU + Dropout 为一个 kernel)
#   - 利用 Shared Memory 做 reduction, 避免多次读 HBM
#   - Flash Attention: 将 Softmax 的 3-pass 融合为 1-pass
#
# Compute-bound kernel 优化:
#   - 使用 Tensor Core (MMA 指令), 比 FP32 CUDA Core 快 8-16x
#   - 增大 tile size, 提高数据复用率
#   - 优化 warp-level scheduling, 减少 pipeline bubble

总结：Ridge point = 295 FLOPs/Byte；GEMM (AI=4096) 是 Compute-bound；LayerNorm (AI≈5)、Softmax (AI≈3)、ReLU (AI=0.25) 全部 Memory-bound。LLM 推理中绝大多数非 GEMM 的 kernel 都是 memory-bound，这就是为什么 Flash Attention 和 kernel fusion 如此重要。

练习 6：Shared Memory Bank Conflict（面试题）

已知条件：Shared Memory 有 32 个 bank，每个 bank 宽度为 4 bytes；一个 warp 中的 32 个 thread 同时访问 Shared Memory；多个 thread 访问同一 bank 的不同地址 → Bank Conflict（串行化）；多个 thread 访问同一 bank 的同一地址 → Broadcast（无冲突）。

问题：(1) Thread i 访问地址 i × 4：有多少 bank conflict？(2) Thread i 访问地址 i × 128：有多少 bank conflict？(3) 设计一个 padding 策略，消除对 32×32 float 矩阵在 Shared Memory 中按列访问时的 bank conflict。

解答：

# (1) Thread i 访问地址 i×4
# Bank index = (i×4 / 4) % 32 = i % 32 = i (for i=0..31)
# → 每个 thread 访问不同的 bank → 0 bank conflicts! (理想情况)

# (2) Thread i 访问地址 i×128
# Bank index = (i×128 / 4) % 32 = (i×32) % 32 = 0
# → 所有 32 个 thread 都访问 bank 0!
# → 32-way bank conflict → 串行化为 32 步! → 性能下降 32x

# (3) Padding 策略消除列访问 bank conflict
# 问题: 32×32 float matrix, 按列访问
# __shared__ float A[32][32];
# 列访问: thread i 访问 A[i][col], 地址 = (i*32 + col) * 4
# Bank = (i*32 + col) % 32 = col (不依赖 i!)
# → 所有 thread 访问同一 bank → 32-way conflict!
#
# 解决: Pad 每行到 33 个 float
# __shared__ float A[32][33];  // 多加 1 列 padding
# 现在列访问: 地址 = (i*33 + col) * 4
# Bank = (i*33 + col) % 32
# 由于 33 和 32 互素 (gcd=1)，不同 i 得到不同 bank → 0 bank conflicts!

// 有 bank conflict 的版本
__shared__ float tile[32][32];
// 按列访问: tile[threadIdx.x][col] → 32-way conflict

// 无 bank conflict 的版本 (padding)
__shared__ float tile[32][33];  // 只多了 32×4=128 bytes
// 按列访问: tile[threadIdx.x][col] → 0 conflict
// 代价: 每行浪费 4 bytes, 总共浪费 128 bytes → 微不足道

面试要点：Bank conflict 是 shared memory 性能的最大隐患。在 GEMM tiling 中，如果 A tile 按行加载但按列读取（transpose 访问模式），不加 padding 性能可能下降数倍。Triton 和 CUTLASS 都内置了 swizzle/padding 机制来自动避免这个问题。

练习 7：CUDA Graph 适用性判断 + 量化收益（面试题）

已知条件：一个 LLM decode step 包含 200 个小 kernel；平均每个 kernel 的计算时间 = 15 微秒；不使用 CUDA Graph 时，每个 kernel 的 launch overhead = 5 微秒。

问题：(1) 不使用 CUDA Graph：总时间是多少？launch overhead 占比？(2) 使用 CUDA Graph：总时间是多少？加速比？(3) 如果某些 kernel 有动态 shape，还能用 CUDA Graph 吗？SGLang/vLLM 如何处理？

解答：

# (1) Without CUDA Graph
n_kernels = 200
compute_time = 15   # 微秒/kernel
launch_overhead = 5  # 微秒/kernel

total_no_graph = n_kernels * (compute_time + launch_overhead)
# = 200 × (15 + 5) = 4000 微秒 = 4 ms

overhead_ratio = (n_kernels * launch_overhead) / total_no_graph
# = (200 × 5) / 4000 = 25%  → 25% 的时间浪费在 kernel launch 上!

# (2) With CUDA Graph
# 整个 graph 只需 1 次 launch (~5 微秒), 然后 GPU 连续执行所有 kernel
total_with_graph = 5 + n_kernels * compute_time
# = 5 + 200 × 15 = 3005 微秒 ≈ 3 ms

speedup = total_no_graph / total_with_graph
# = 4000 / 3005 ≈ 1.33x

# (3) Dynamic shapes 的处理
# CUDA Graph 在 capture 时固定了所有 tensor 的 shape
# SGLang/vLLM 的解决方案:
# - 启动时为常见 batch size 预编译多个 graph:
#   batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]
# - 运行时: 将实际 batch 向上 pad 到最近的预编译大小
#   e.g., actual_bs=13 → pad to 16, 用 bs=16 的 graph
# - Padding 浪费的算力很小, 但节省的 launch overhead 很可观

总结：Without graph 4 ms，launch overhead 占 25%；With graph ~3 ms，加速 1.33x；Dynamic shape 通过预编译多个 graph + padding 解决，是 SGLang/vLLM 的标准做法。

面试延伸：在 decode 阶段 kernel 越小（因为 seq_len=1），launch overhead 占比越高，CUDA Graph 收益越大。这也是为什么 decode 阶段几乎所有框架都用 CUDA Graph，而 prefill 阶段（kernel 大）通常不用。

练习 8：Occupancy vs Performance（面试题）

已知条件：一个 kernel 使用 64 registers/thread 和 48 KB shared memory/block；H100 SM 资源：65536 registers / SM、228 KB shared memory / SM、最大 2048 threads / SM、最大 32 blocks / SM。

问题：(1) 只考虑 register 限制，每个 SM 最多容纳多少 threads？(2) 只考虑 shared memory 限制，每个 SM 最多运行多少 blocks？(3) 假设每 block 256 threads，计算实际 occupancy；(4) 更高的 occupancy 是否总意味着更好的性能？给出一个反例。

解答：

# 已知条件
regs_per_thread = 64
smem_per_block = 48  # KB
threads_per_block = 256

# SM 资源
total_regs = 65536
total_smem = 228    # KB
max_threads_sm = 2048
max_blocks_sm = 32

# (1) Register 限制
max_threads_by_reg = total_regs // regs_per_thread  # = 1024 threads
max_blocks_by_reg = max_threads_by_reg // threads_per_block  # = 4 blocks

# (2) Shared Memory 限制
max_blocks_by_smem = total_smem // smem_per_block  # = 4.75 → 4 blocks

# (3) 计算 Occupancy (取最严格的限制)
# - Register: 4 blocks; Shared Memory: 4 blocks
# - Thread: 2048/256 = 8 blocks; Block: 32 blocks
# → 瓶颈是 register 和 smem, 都限制为 4 blocks
actual_blocks = 4
actual_threads = actual_blocks * threads_per_block  # = 1024
occupancy = actual_threads / max_threads_sm  # = 1024 / 2048 = 50%

# (4) 更高 occupancy ≠ 更好性能的反例: GEMM kernel
# - 高 register 版本: 128 regs/thread, occupancy = 25%
#   → 每个 thread 在 register 中缓存大量数据 → 实际 throughput 更高!
# - 低 register 版本: 32 regs/thread, occupancy = 100%
#   → 数据放不下, 频繁 spill 到 local memory (实际是 HBM!) → throughput 反而更低!
#
# 原因: "Occupancy cliff" 在很多情况下是个误区
# - compute-bound kernel: 只要有足够 warp 隐藏 pipeline latency (~20 cycles) 即可
# - memory-bound kernel: occupancy 更重要, 需要大量 warp 隐藏 HBM latency (~400 cycles)

总结：Register 限制最多 1024 threads (4 blocks × 256)；Shared Memory 限制最多 4 blocks；Occupancy = 50%；更高 occupancy 不一定更好！GEMM 等 compute-bound kernel 故意用大量 register 做数据复用，牺牲 occupancy 换取更高的实际吞吐。

面试要点：NVIDIA 官方文档也指出 “higher occupancy does not always equate to higher performance”。实际优化中应该用 ncu (Nsight Compute) profile 来确定瓶颈是 latency hiding 还是 instruction throughput。cuBLAS 的最优 GEMM kernel 通常 occupancy 只有 25-50%。

2.8 本章小结

核心知识点回顾

主题	要点
GPU 架构	SM → Warp (32 threads) → SIMT 执行; H100 有 132 个 SM
内存层次	Registers 最快但最小; Shared Memory 是优化关键; HBM 带宽是主要瓶颈
GEMM	多级 Tiling (Block → Warp → Tensor Core); 数据复用最大化
Tensor Core	4×4 矩阵乘加/cycle; FP16 输入 FP32 累加; 比 CUDA Core 快 16×
CUDA Graph	Capture → Instantiate → Replay; 消除 launch overhead; 要求静态 shape
Profiling	memory_snapshot 定位碎片; Roofline 判断瓶颈; expandable_segments 缓解 OOM

进阶阅读

本文是 ML Systems 系列 Chapter 2。正文 markdown 渲染，6 个可视化/交互组件通过自定义 {% anim %} 标签以隔离 iframe 嵌入，源自 Arkive 教程。