RLHF 系统：从理论到生产

2026-06-03

从 RL 基础到 PPO、GRPO，再到 veRL / SLIME / OpenRLHF 等生产框架与 Training-Inference Mismatch，系统理解现代 LLM 对齐的训练系统。本文正文为 markdown，关键机制配有可交互动画（点按钮逐步演示）。

1. RL 基础回顾

Markov Decision Process (MDP)

在 RLHF 的语境下，我们将 LLM 生成过程建模为一个 MDP：

MDP 元素	LLM 对应	说明
State	已生成的 token 序列	prompt + 已生成部分
Action	下一个 token	从 vocabulary 中选择
Policy	LLM 的输出分布	softmax over logits
Reward	Reward Model 打分	通常只在序列末尾给出
Transition	确定性拼接

Policy Gradient

Policy Gradient 的核心思想：让好的 action 出现概率增大，坏的减小。

其中是 Advantage Function（，衡量某个 action 相对于平均水平好多少），表示在状态下选择 action 相对于平均策略有多好。

Advantage Function 的直觉：

：该 action 比平均好 → 增加概率
：该 action 比平均差 → 减小概率
：该 action 和平均一样 → 不变

Generalized Advantage Estimation (GAE)

GAE 通过指数加权平均多步 TD error 来平衡 bias 和 variance：

$其中$

时退化为单步 TD（低方差高偏差），时退化为 Monte Carlo（高方差低偏差）。

def compute_gae(rewards, values, gamma=0.99, lam=0.95):
    """计算 Generalized Advantage Estimation"""
    T = len(rewards)
    advantages = [0.0] * T
    last_gae = 0.0

    for t in reversed(range(T)):
        if t == T - 1:
            next_value = 0.0  # terminal state
        else:
            next_value = values[t + 1]

        delta = rewards[t] + gamma * next_value - values[t]
        advantages[t] = delta + gamma * lam * last_gae
        last_gae = advantages[t]

    return advantages

2. PPO 算法详解

Clipped Surrogate Objective

PPO 的核心创新在于通过 clipping 限制策略更新幅度，避免灾难性的大步更新：

$其中$

当偏离 1 过远时（即新旧策略差异过大），clip 会阻止进一步优化。

下面拖动滑块改变参数，观察 clipped objective 如何限制策略更新：

PPO 完整 Loss

PPO 的总 Loss 由三部分组成：

：策略更新（上面已介绍）
：Value Function Loss =
：Entropy bonus，防止策略过早坍缩

import torch
import torch.nn.functional as F

def ppo_loss(old_log_probs, new_log_probs, advantages,
             values, returns, clip_eps=0.2, vf_coef=0.5, ent_coef=0.01):
    """
    计算 PPO Loss
    Args:
        old_log_probs: [B, T] 旧策略的 log probabilities
        new_log_probs: [B, T] 新策略的 log probabilities
        advantages: [B, T] GAE 计算的 advantages
        values: [B, T] value function 预测
        returns: [B, T] 实际 returns (advantages + old_values)
    """
    # Policy loss with clipping
    ratio = torch.exp(new_log_probs - old_log_probs)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * advantages
    policy_loss = -torch.min(surr1, surr2).mean()

    # Value function loss
    value_loss = F.mse_loss(values, returns)

    # Entropy bonus (from new policy distribution)
    entropy = -(torch.exp(new_log_probs) * new_log_probs).mean()

    total_loss = policy_loss + vf_coef * value_loss - ent_coef * entropy
    return total_loss, {
        'policy_loss': policy_loss.item(),
        'value_loss': value_loss.item(),
        'entropy': entropy.item(),
        'approx_kl': ((ratio - 1) - (ratio.log())).mean().item()
    }

3. RLHF Pipeline

三阶段流水线

标准 RLHF pipeline 分为三个训练阶段。下面动画演示数据如何流经 SFT → Reward Model → PPO → Aligned Model：

Stage 1: SFT (Supervised Fine-Tuning)

使用高质量的 human demonstrations 对预训练模型进行微调，让模型学会按照指令格式回答。

数据：(prompt, ideal_response) pairs
目标：标准 language modeling loss (next token prediction)
输出：SFT model

Stage 2: Reward Model Training

训练一个 scalar reward model，学习人类的偏好排序。

其中是人类更喜欢的回答，是较差的回答（Bradley-Terry model）。

Stage 3: PPO Training

使用 RM 作为 reward signal，通过 PPO 优化 policy。加入 KL penalty 防止偏离 SFT model 过远：

4. GRPO - Group Relative Policy Optimization

核心思想：去掉 Critic Network

GRPO（DeepSeek 提出）的关键创新：用同一组 rollout 内的相对比较来估计 advantage，不需要单独训练 value function (critic)。

对于每个 prompt ，生成个回答，计算各自的 reward，然后用 group 内的均值和标准差做归一化。

下面动画对比 PPO（需要 Critic 网络）与 GRPO（group 内相对比较）：

GRPO 优势

特性	PPO	GRPO
Critic Network	需要（同等规模）	不需要
显存占用	高（4 个模型）	低（3 个模型）
Advantage 估计	GAE (temporal)	Group normalization
适用场景	通用	outcome-level reward
实现复杂度	高	较低

def grpo_step(policy, ref_policy, prompts, reward_fn,
              group_size=8, clip_eps=0.2, beta=0.01):
    """
    GRPO Training Step
    对每个 prompt 生成 group_size 个回答，计算 group-relative advantage
    """
    all_losses = []

    for prompt in prompts:
        # 1. Generate G responses for this prompt
        responses = policy.generate(prompt, num_samples=group_size)

        # 2. Score all responses
        rewards = torch.tensor([reward_fn(prompt, r) for r in responses])

        # 3. Compute group-relative advantage (no critic needed!)
        advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

        # 4. Compute policy ratio and clipped loss
        for resp, adv in zip(responses, advantages):
            new_logp = policy.log_prob(prompt, resp)
            old_logp = policy.log_prob_no_grad(prompt, resp)
            ref_logp = ref_policy.log_prob(prompt, resp)

            ratio = torch.exp(new_logp - old_logp)
            surr1 = ratio * adv
            surr2 = torch.clamp(ratio, 1-clip_eps, 1+clip_eps) * adv

            # KL penalty against reference
            kl = (old_logp - ref_logp).mean()

            loss = -torch.min(surr1, surr2).mean() + beta * kl
            all_losses.append(loss)

    total_loss = torch.stack(all_losses).mean()
    total_loss.backward()
    return total_loss.item()

GRPO 算法：用 group 内相对 reward 作为 advantage，无需 Critic model。

5. 系统设计挑战

Training + Inference 协同

RLHF 系统最大的工程挑战：同一份模型权重既要用于训练（backward pass），又要用于推理（生成 rollout）。

核心矛盾：

Training 需要：参数分片 (FSDP/DeepSpeed)、激活存储、梯度计算

Inference 需要：Tensor Parallelism、KV Cache、continuous batching

两者对模型的存储格式、并行策略完全不同！

常见解决方案：

策略	优点	缺点
同机复用 (Colocate)	无网络开销	显存争夺、利用率低
分离部署 (Separate)	各自优化	需要 weight sync
Hybrid Engine	动态切换	实现复杂

下面切换标签页查看三种 Weight Sync 策略，并播放 NCCL broadcast 动画：

# Weight synchronization via NCCL broadcast
import torch.distributed as dist

def sync_weights_nccl(train_model, infer_model, src_rank=0, group=None):
    """
    通过 NCCL broadcast 同步权重
    src_rank: training model 所在的 rank
    group: 包含 training 和 inference ranks 的 process group
    """
    for (name, train_param), (_, infer_param) in zip(
        train_model.named_parameters(),
        infer_model.named_parameters()
    ):
        # Broadcast from training rank to all inference ranks
        dist.broadcast(train_param.data, src=src_rank, group=group)

        # Inference model directly uses the broadcast data
        if dist.get_rank() != src_rank:
            infer_param.data.copy_(train_param.data)

    # Ensure all ranks are synchronized
    dist.barrier(group=group)


def sync_weights_disk(train_model, save_path, infer_model=None):
    """Disk-based weight sync (simpler but slower)"""
    # Training side: save checkpoint
    torch.save(train_model.state_dict(), save_path)

    # Inference side: load checkpoint
    if infer_model is not None:
        state_dict = torch.load(save_path, map_location='cuda')
        infer_model.load_state_dict(state_dict)
    return save_path

Training-Inference Mismatch：Rollout 和 Training 使用不同 kernel 导致数值差异。

6. veRL 架构

设计理念：Single Controller + Hybrid Engine

veRL（Volcano Engine RL）采用 Ray 作为分布式调度框架，通过单一 controller 协调所有组件。

核心组件：

Single Controller：一个 Python 进程管理整个训练流程
Actor Workers：运行 policy model (FSDP sharded)
Rollout Workers：运行 inference engine (vLLM/SGLang)
Critic Workers：运行 value model
Reward Workers：运行 reward model

veRL Training Loop：

Controller 发送 prompts 到 Rollout Workers
Rollout Workers 用当前 policy 生成 responses
Reward Workers 计算 reward scores
Critic Workers 计算 value estimates
Controller 计算 GAE advantages
Actor Workers 执行 PPO update
同步新权重到 Rollout Workers → 回到 1

Hybrid Engine 的关键：veRL 的 Actor Workers 和 Rollout Workers 可以 colocate 在同一组 GPU 上：

Training 阶段：GPU 做 FSDP training
Rollout 阶段：offload optimizer states，切换到 inference mode
减少了 weight sync 的网络开销

7. SLIME 架构

SGLang + Megatron 的高效 RLHF

SLIME 将 SGLang (高效推理引擎) 与 Megatron/FSDP (分布式训练) 结合，实现高吞吐 RLHF。

组件	框架	特点
Rollout Engine	SGLang	RadixAttention, continuous batching, chunked prefill
Training Backend	FSDP / Megatron	ZeRO-3 / 3D parallelism
Orchestration	Ray	灵活的 resource allocation
Weight Sync	NCCL / Shared Memory	异步更新

Async Rollout 设计：SLIME 的核心优化是 training 和 rollout overlap 执行：

# SLIME async rollout pseudo-code
class SLIMEController:
    def train_loop(self):
        # 初始: 先生成第一批 rollout
        rollout_future = self.rollout_workers.generate_async(batch_0)

        for step in range(num_steps):
            # 等待当前 rollout 完成
            experiences = rollout_future.get()

            # 立即启动下一批 rollout (overlap!)
            next_batch = self.get_next_batch()
            rollout_future = self.rollout_workers.generate_async(next_batch)

            # 同时进行 training
            advantages = self.compute_advantages(experiences)
            self.train_step(experiences, advantages)

            # Training 完成后同步权重给 rollout workers
            self.sync_weights_to_rollout()

SLIME 完整工作流：Ray 编排 + SGLang Rollout + Megatron/FSDP Training。

8. OpenRLHF

开源 RLHF 框架

OpenRLHF 是社区广泛使用的开源 RLHF 实现，集成了 vLLM/SGLang 作为 rollout engine，Ray 作为分布式调度。

设计特点：

模块化设计：Actor, Critic, Reward, Reference 四个模型独立部署
vLLM/SGLang 集成：利用成熟推理引擎的高吞吐特性
Ray 编排：灵活分配 GPU 给不同角色
支持多种算法：PPO, DPO, GRPO, REINFORCE++

典型部署拓扑：

Node 0-1: Actor (FSDP, 16 GPUs) — 承担 policy training
Node 2:   Critic (FSDP, 8 GPUs) — value function training
Node 3:   vLLM Rollout (TP=4, 2 instances) — 生成 responses
Node 4:   Reward Model (TP=2, 4 instances) — 打分
Node 5:   Reference Model (TP=2) — 计算 KL penalty

GPU 利用率优化：OpenRLHF 通过 packing 和 async scheduling 最大化 GPU 利用率：

Rollout 和 Training 可以 overlap（near on-policy）
Reward 和 Reference 计算可以并行
动态 batch size 适配不同 sequence length

OpenRLHF 训练阶段：vLLM/SGLang 推理 + DeepSpeed/Ray 训练编排。

9. Multi-turn RL

Agent Loop 中的 RL

现代 LLM agent 需要多轮交互：调用工具、观察结果、继续推理。这对 RL 系统提出了新挑战。

Multi-turn Rollout 的复杂性：

变长轨迹：不同 query 可能需要 1-10+ 轮 tool call
异步执行：tool call 的延迟不确定（API call, code execution）
Credit Assignment：reward 如何分配到各轮 action？
Batching 困难：不同样本进度不同，无法简单 batch

下面动画演示一个完整的多轮交互循环（Agent → Tool → Environment → Reward）：

Async Rollout 设计模式：

class MultiTurnRolloutManager:
    """管理多轮 RL rollout，支持异步 tool execution"""

    def __init__(self, policy, tool_executor, max_turns=10):
        self.policy = policy
        self.tool_executor = tool_executor
        self.max_turns = max_turns
        self.active_trajectories = {}

    async def run_episode(self, prompt):
        """运行一个完整的 multi-turn episode"""
        trajectory = []
        context = prompt

        for turn in range(self.max_turns):
            # Agent 生成 action (可能包含 tool call)
            response = await self.policy.generate_async(context)

            if self.is_tool_call(response):
                # 异步执行 tool
                tool_result = await self.tool_executor.execute(response)
                observation = self.format_observation(tool_result)
                trajectory.append({
                    'action': response,
                    'observation': observation,
                    'turn': turn
                })
                context += response + observation
            elif self.is_final_answer(response):
                trajectory.append({'action': response, 'final': True})
                break

        return trajectory

    async def batch_rollout(self, prompts):
        """并发运行多个 episode"""
        import asyncio
        tasks = [self.run_episode(p) for p in prompts]
        trajectories = await asyncio.gather(*tasks)
        return trajectories

veRL Agent Loop：异步多轮交互架构，支持 tool calling 和 environment step。

10. Training-Inference Mismatch

On-Policy vs Off-Policy 的权衡

在理想情况下，PPO 是 on-policy 算法 — rollout 必须由当前策略生成。但在分布式系统中，严格 on-policy 有巨大开销。

Truly On-Policy	Near On-Policy (Importance Sampling)
每次 update 前重新 rollout	Rollout 和 training overlap
数据来自最新 policy	数据可能来自旧 policy (stale)
需要 training ↔ inference 交替	需要 importance sampling 修正
GPU 利用率低（互相等待）	GPU 利用率高
KL divergence 精确	Staleness 可能导致 bias

Staleness 的影响：假设在 training step 使用的 rollout 来自 step （延迟步）：

当较大时，的方差急剧增大 → 训练不稳定。

实践中的解决方案：

方法	思路	Trade-off
Strict sync	rollout 完成后才 train	低利用率但稳定
Bounded staleness	允许最多步延迟	平衡利用率和稳定性
IS clipping	clip importance weights	有 bias 但方差可控
μ-PPO	多步 IS correction	理论优雅但实现复杂

11. 练习题

练习 1：手动计算 GAE

给定以下序列（episode 长度 T=4）：Rewards ，Values ，，，Terminal state ()。计算每个时间步的和。

完整解答：

Step 1: 计算 TD errors δ_t
δ_3 = r_3 + γ*V(s_4) - V(s_3) = 1 + 0.99*0 - 0.8 = 0.2
δ_2 = r_2 + γ*V(s_3) - V(s_2) = 0 + 0.99*0.8 - 0.7 = 0.092
δ_1 = r_1 + γ*V(s_2) - V(s_1) = 0 + 0.99*0.7 - 0.6 = 0.093
δ_0 = r_0 + γ*V(s_1) - V(s_0) = 0 + 0.99*0.6 - 0.5 = 0.094

Step 2: 递推计算 GAE (从后往前)
A_3 = δ_3 = 0.2
A_2 = δ_2 + γλ*A_3 = 0.092 + 0.9405*0.2 = 0.2801
A_1 = δ_1 + γλ*A_2 = 0.093 + 0.9405*0.2801 = 0.3564
A_0 = δ_0 + γλ*A_1 = 0.094 + 0.9405*0.3564 = 0.4292

结果: A = [0.4292, 0.3564, 0.2801, 0.2]

直觉：越早的步骤 advantage 越高，因为最终的正 reward 会通过 GAE 向前传播。

练习 2：PPO Loss 计算

给定（单个 token level）：Old log prob ，New log prob ，Advantage ，Clip 。计算 (a) probability ratio (b) unclipped objective (c) clipped objective (d) 最终 PPO loss。

完整解答：

(a) ratio = exp(-1.5 - (-2.0)) = exp(0.5) ≈ 1.6487
(b) unclipped = ratio * A = 1.6487 * 0.8 = 1.319
(c) clipped_ratio = clip(1.6487, 0.8, 1.2) = 1.2
    clipped = 1.2 * 0.8 = 0.96
(d) PPO objective = min(1.319, 0.96) = 0.96
    PPO loss = -0.96 (因为我们最小化 loss = 最大化 objective)

分析：因为（好的 action），策略增加了这个 action 的概率 (ratio > 1)。但 ratio = 1.65 超出了 [0.8, 1.2] 范围，所以 clip 生效，阻止了过大的更新。这正是 PPO 保守更新的核心机制。

练习 3：设计 Weight Update 策略

GPU 拓扑：2 个 node，每 node 8 × H100 (80GB)；Node 内 NVLink (900 GB/s)，Node 间 InfiniBand (400 Gb/s)；Model 70B 参数 (FP16 ≈ 140GB)；需求 Training (FSDP) + Inference (TP=4) 同时部署。设计最优 GPU 分配和 weight sync 策略。

推荐方案：

Node 0 (8 GPUs): Training - FSDP (全部 8 卡)
Node 1 (8 GPUs):
  - GPU 0-3: Inference Instance 1 (TP=4)
  - GPU 4-7: Inference Instance 2 (TP=4)

理由：

Training 独占 Node 0：70B FSDP 在 8 卡上每卡约 17.5GB 参数 + optimizer states (~70GB/卡)，刚好适合 H100 80GB。Node 内 NVLink 确保 AllGather 高效。
Inference 在 Node 1：两个 TP=4 实例并行处理不同 batch 的 rollout，提高吞吐。TP=4 在 NVLink 内通信无瓶颈。
Weight Sync：使用 NCCL broadcast (跨 node via InfiniBand)。70B FP16 ≈ 140GB，在 400Gb/s IB 上传输约 2.8 秒。配合 double-buffering，可在下一批 rollout 开始时异步传输。
利用率优化：Training 和 Inference overlap — training update 时 inference 用旧权重继续 rollout；update 完成后用 NCCL broadcast 更新。Bounded staleness = 1。

练习 4：分析 On-Policy vs Off-Policy Trade-offs

训练 7B 模型做 RLHF，观察到不同 staleness 下的 IS weight 方差、KL divergence、reward 改善、GPU 利用率。

Staleness	IS Weight 方差	KL Divergence	Reward 改善	GPU 利用率
0	0	0.02	+0.15/step	35%
1	0.08	0.05	+0.13/step	55%
3	0.35	0.12	+0.10/step	75%
5	1.2	0.25	+0.05/step	85%
10	5.8	0.8	-0.02/step	92%

分析：(a) 最优 staleness 是多少？(b) 为什么 staleness=10 时 reward 反而下降？(c) 如何在不降低 staleness 的情况下缓解问题？

(a) 最优 staleness 分析（Wall-clock improvement ∝ reward_per_step × throughput，设 base throughput = 1 at 35% util）：

Staleness 0:  0.15 × (35/35) = 0.150
Staleness 1:  0.13 × (55/35) = 0.204
Staleness 3:  0.10 × (75/35) = 0.214 ← BEST
Staleness 5:  0.05 × (85/35) = 0.121
Staleness 10: -0.02 × (92/35) = -0.053

最优 staleness = 3，在 per-step 效率和吞吐之间取得最佳平衡。

(b) Staleness=10 reward 下降的原因：IS weight 方差 = 5.8 意味着梯度估计极不稳定；KL = 0.8 表明当前 policy 已和生成 rollout 的旧 policy 差异很大；相当于用完全不相关的数据训练，高方差导致 optimizer “随机游走”而非收敛。

(c) 缓解方案（不降低 staleness）：

IS Clipping：clip importance weights 到 [0.5, 2.0]，牺牲 unbiasedness 换稳定性
减少 PPO epochs：从 4 降到 1-2，减少 policy 每步变化量
增大 KL penalty β：限制 policy 变化速度，降低有效 staleness
Early stopping：监控 approx KL，超过阈值就丢弃该 batch
Partial refresh：保留 50% 旧数据 + 50% 新 rollout 的混合 batch

编程练习 A：实现 GAE

实现完整的 GAE 计算，正确处理 episode boundaries。公式：，。

import numpy as np

def compute_gae(rewards, values, dones, gamma=0.99, lambda_=0.95):
    """
    计算 Generalized Advantage Estimation.
    Args:
        rewards: np.array, shape (T,)
        values: np.array, shape (T+1,), 包含 bootstrap value
        dones: np.array, shape (T,)
    Returns:
        advantages: np.array, shape (T,)
    """
    T = len(rewards)
    advantages = np.zeros(T)
    last_gae = 0.0

    for t in reversed(range(T)):
        # 如果 done, 下一个 state 的 value 不参与 (episode 结束)
        next_non_terminal = 1.0 - dones[t]
        # TD error
        delta = rewards[t] + gamma * values[t + 1] * next_non_terminal - values[t]
        # GAE 递推
        last_gae = delta + gamma * lambda_ * next_non_terminal * last_gae
        advantages[t] = last_gae

    return advantages

# ======== 手动验证 ========
rewards = np.array([1.0, 0.0, 1.0])
values = np.array([0.5, 0.6, 0.7, 0.8])  # T+1 个 values
dones = np.array([0.0, 0.0, 0.0])
gamma, lambda_ = 0.99, 0.95

advantages = compute_gae(rewards, values, dones, gamma, lambda_)
print(f"函数计算结果: {advantages}")
# 预期: [2.1474 1.1201 1.0920]

# ======== Lambda 的效果 ========
for lam in [0.0, 0.5, 0.95, 1.0]:
    adv = compute_gae(rewards, values, dones, gamma, lam)
    print(f"  λ={lam:.2f}: advantages = {adv}")
# λ=0: 纯 TD(0), 低方差高偏差
# λ=1: 纯 Monte Carlo, 高方差低偏差
# λ=0.95: 实践中的最佳平衡点

编程练习 B：实现 PPO Clip Loss

实现完整的 PPO clip loss 计算。公式：，，。

import numpy as np

def ppo_clip_loss(log_probs, old_log_probs, advantages, eps=0.2):
    """
    计算 PPO Clipped Surrogate Loss.
    Returns:
        loss: scalar (negated for minimization)
        clip_fraction: fraction of samples that were clipped
    """
    # Ratio in log space for numerical stability
    ratio = np.exp(log_probs - old_log_probs)

    # Unclipped objective
    surr1 = ratio * advantages

    # Clipped objective
    clipped_ratio = np.clip(ratio, 1.0 - eps, 1.0 + eps)
    surr2 = clipped_ratio * advantages

    # PPO objective: take the min (pessimistic bound)
    objective = np.minimum(surr1, surr2)

    # Loss = negative objective (for gradient descent)
    loss = -objective.mean()

    # Clip fraction: how often was the ratio clipped?
    clip_fraction = np.mean(np.abs(ratio - clipped_ratio) > 1e-6)

    return loss, clip_fraction, ratio


# ======== 验证 Clip 行为 ========
eps = 0.2
test_advantages = np.array([1.0, 1.0, -1.0, -1.0])
test_old_log = np.array([0.0, 0.0, 0.0, 0.0])
# ratio = 1.5 (> 1+eps) and ratio = 0.5 (< 1-eps)
test_log = np.log(np.array([1.5, 1.0, 0.5, 1.0]))

loss, clip_frac, ratios = ppo_clip_loss(test_log, test_old_log, test_advantages, eps=0.2)
print(f"Test ratios: {np.exp(test_log - test_old_log)}")
print(f"Clip fraction: {clip_frac:.2%} (expect 50% = 2/4 clipped)")

关键性质：当且 ratio > 时，被截断（不再增长），防止 policy 变化过大；当且 ratio < 时同理被截断，防止过度惩罚某个动作。取 min 保证无论 advantage 正负都是保守更新。

Exercise C：PPO vs GRPO 显存对比

给定 70B 参数模型做 RL 训练。PPO 需要维护 Actor + Critic + Reference + Reward（4 个模型），GRPO 需要 Actor + Reference（2 个模型），Adam + BF16，每张 H100 80GB。

(a) 总显存需求：可训练模型 (Adam BF16) 约 16 bytes/param（2B 权重 + 2B 梯度 + 4B+4B Adam states）；冻结模型仅推理，2 bytes/param。

1 2	`PPO: Actor 16B×70B=1.12TB + Critic 1.12TB + Ref 140GB + Reward 140GB ≈ 2.52 TB GRPO: Actor 1.12TB + Ref 140GB ≈ 1.26 TB`

(b) 最少 H100：PPO 2520/80 = 31.5 → 至少 32 张；GRPO 1260/80 = 15.75 → 至少 16 张。

(c) GRPO 采样开销：group_size=16 意味着每个 prompt 采样 16 个 response（PPO 通常 1 个），即 16× 推理量。但这是纯推理（无反向传播），可用 vLLM 批量生成。关键 insight：推理成本远低于训练一个 Critic 模型 —— 用廉价推理替代昂贵的 Critic 训练。

Exercise D：On-Policy 的系统影响与 overlap

在线 RL 要求 rollout 来自当前策略（on-policy），产生 “training-inference pipeline bubble”。若 rollout 占 60% 时间，training 占 40%：

不做 overlap（顺序执行）：训练 GPU 在 rollout 期间空闲 → 有效利用率 40/(60+40) = 40%。

1 2	`\|--Rollout(gen N)--\|--Train(on N)--\|--Rollout(gen N+1)--\|--Train--\| 60% 40% 60% 40%`

做 overlap（disaggregated + pipeline）：将训练集群和推理集群分离，training cluster 训练 batch N 时，inference cluster 同时用（略微过时的）权重生成 batch N+1。两个集群利用率均接近 ~90-95%（只在 weight sync 时有小 bubble）。

1 2	`Training: \|--Train N--\|--Train N+1--\|--Train N+2--\| Inference: \|--Rollout N+1--\|--Rollout N+2--\|--Rollout N+3--\|`

Exercise E：Weight Update Propagation 延迟分析

70B 参数 BF16 = 140GB；InfiniBand 400 Gbps (= 50 GB/s)；训练每 step 约 10 秒。

(a) 传输时间：秒。

(b) 能否跟上： → 可以。一个 step 内完成传输还有 7.2s 余量，可做 pipeline：训练 step N+1 时异步传输 step N 的权重。

(c) Stale weights 影响：用过时权重做 rollout 引入 off-policy error。1-step stale 质量损失极小（大多数系统采用）；N-step stale 误差近似线性增长，KL 超阈值时质量显著下降。设计方案：async weight update —— 开始 rollout 时用略旧权重，后台传输新权重，完成后切换，将 2.8s 延迟隐藏在计算中。

Exercise F：Reward Hacking 的系统级检测

GRPO 训练中模型可能 “hack” reward model（高 reward 但实际质量下降）。

(a) 监控指标：

KL divergence 趋势：KL() — hacking 时急剧上升
Reward-KL 相关性：正常训练 reward 上升且 KL 平缓；hacking 时 reward 上升但 KL 暴涨
输出多样性：unique n-gram ratio, distinct-N — hacking 常导致输出 collapse
特定模式检测器：重复度、格式游戏（过度 bullet/markdown）、长度异常
RM score vs ground truth：定期抽样 human eval / LLM-as-judge，检测偏差

(b) 暂停触发条件：

if (KL_current > 2.0 * KL_moving_average) and reward_still_increasing:
    trigger_pause()   # KL spike + reward increasing = 强 hacking 信号
if output_diversity < 0.3 * baseline_diversity:
    trigger_pause()   # Output collapse
if reward_model_score - human_eval_score > threshold:
    trigger_pause()   # RM 与 human eval 背离

(c) 恢复策略：回滚到最近的”健康” checkpoint；增大 KL penalty β；更新/重训 Reward Model；添加 diversity bonus 到 reward function。

本文是 ML Systems 系列 Chapter 5。正文 markdown 渲染，5 个交互动画通过自定义 {% anim %} 标签以隔离 iframe 嵌入，源自 Arkive 教程。