交互式:Multi-turn RL Loop
Agent
(Policy)
Environment
(User/API)
Tool
Executor
Reward
Signal
← 上一步
下一步 →
↺ 重置
▶ 自动播放