智能体强化学习
支持游戏、多轮对话、工具使用等复杂决策场景的多轮交互能力
Towards Async, Omni-Modal RL at Scale, Just Relax.
支持游戏、多轮对话、工具使用等复杂决策场景的多轮交互能力
支持 Megatron 训练和 SGLang 推理引擎,优化的分布式计算能力
无缝集成文本、视觉和音频模态,实现全面的 AI 训练
面向服务的设计,包含控制层、服务层和实现层,易于定制
内置 Metrics 服务、健康检查管理器和通知系统,满足生产环境需求
快速入门示例和全面的文档,让您在几分钟内上手
支持游戏、多轮对话、工具使用等复杂决策场景的多轮交互能力
支持 Megatron 训练和 SGLang 推理引擎,优化的分布式计算能力
无缝集成文本、视觉和音频模态,实现全面的 AI 训练
面向服务的设计,包含控制层、服务层和实现层,易于定制
内置 Metrics 服务、健康检查管理器和通知系统,满足生产环境需求
快速入门示例和全面的文档,让您在几分钟内上手
Layered service design — Orchestration layer coordinates training loop, Components layer deploys Actor/Rollout/GenRM as Ray Serve services, Engine & Backends layer runs Megatron and SGLang. Data flows through TransferQueue, weights sync via Checkpoint Engine.
Rollout, Reference, Forward, and Training stages run as interleaved pipelines — achieving near-zero GPU idle time with perfect overlap across batches.
Define environments and reward functions as simple Python classes. Relax orchestrates multi-turn agent interactions with tool use, vision, and custom scoring — all natively integrated into the training loop.
Scale inference engines up or down via REST API — without restarting training. Supports same-cluster scaling and cross-cluster federation with P2P weight sync.
| 1 | class CustomAgentEnv(BaseInteractionEnv): |
| 2 | def reset(self): |
| 3 | self.turn, self.memory, self.trajectory = 0, [], [] |
| 4 | |
| 5 | def step(self, response): |
| 6 | if has_answer(response): |
| 7 | return obs("done"), 0.0, True, False, {"answer": extract(response)} |
| 8 | |
| 9 | tool_call = parse(response) |
| 10 | if tool_call is None: |
| 11 | return obs("invalid"), -0.1, True, False, {"error": True} |
| 12 | |
| 13 | result = await tools[tool_call.name].execute(tool_call.args) |
| 14 | self.memory.append({"action": tool_call, "obs": result}) |
| 15 | self.trajectory.append({tool_call, result, self.turn}) |
| 16 | if should_stop(self.memory): |
| 17 | return obs(result), 0.0, True, False, {"stop": True} |
| 18 | |
| 19 | self.turn += 1 |
| 20 | truncated = (self.turn >= self.max_turns) |
| 21 | return obs(result), 0.0, False, truncated, {"turn": self.turn} |
| 1 | # --custom-generate-function-path your_module.generate |
| 2 | async def generate(args, sample: Sample, sampling_params) -> Sample: |
| 3 | env = build_env(sample, args); env.reset() |
| 4 | tokens = sample.tokens |
| 5 | loss_mask, log_probs, rewards = [], [], [] |
| 6 | |
| 7 | for turn in range(max_turns): |
| 8 | out = await engine.generate(tokens) |
| 9 | tokens += out.ids; loss_mask += [1] * len(out.ids) |
| 10 | log_probs += out.log_probs |
| 11 | |
| 12 | obs, reward, terminated, truncated, info = env.step(out.text) |
| 13 | rewards.append(reward) |
| 14 | if terminated or truncated: break |
| 15 | |
| 16 | obs_ids = tokenize(env.format_observation(obs)) |
| 17 | tokens += obs_ids; loss_mask += [0] * len(obs_ids) |
| 18 | log_probs += [0.0] * len(obs_ids) |
| 19 | |
| 20 | return finalize(sample, tokens, loss_mask, log_probs, rewards) |
| 1 | # --custom-rm-path your_module.reward |
| 2 | async def reward(sample: Sample) -> dict: |
| 3 | answer = extract_answer(sample.response) |
| 4 | acc = 1.0 if is_exact_match(answer, sample.label) else 0.0 |
| 5 | |
| 6 | judge_response = await judge_model.generate( |
| 7 | prompt=sample.prompt, |
| 8 | answer=answer, |
| 9 | label=sample.label, |
| 10 | multimodal_inputs=sample.multimodal_inputs, |
| 11 | ) |
| 12 | judge_score = parse_verdict(judge_response) |
| 13 | fmt = 1.0 if is_well_formatted(sample.response) else 0.0 |
| 14 | |
| 15 | score = 0.5 * acc + 0.3 * judge_score + 0.2 * fmt |
| 16 | return {"score": score, "acc": acc, "format": fmt, |
| 17 | "judge_response": judge_response} |
