Relax

Towards Async, Omni-Modal RL at Scale, Just Relax.

快速开始

在 GitHub 上查看

智能体强化学习

支持游戏、多轮对话、工具使用等复杂决策场景的多轮交互能力

高性能计算

支持 Megatron 训练和 SGLang 推理引擎，优化的分布式计算能力

多模态支持

无缝集成文本、视觉和音频模态，实现全面的 AI 训练

灵活架构

面向服务的设计，包含控制层、服务层和实现层，易于定制

高级监控

内置 Metrics 服务、健康检查管理器和通知系统，满足生产环境需求

易于使用

快速入门示例和全面的文档，让您在几分钟内上手

智能体强化学习

支持游戏、多轮对话、工具使用等复杂决策场景的多轮交互能力

高性能计算

支持 Megatron 训练和 SGLang 推理引擎，优化的分布式计算能力

多模态支持

无缝集成文本、视觉和音频模态，实现全面的 AI 训练

灵活架构

面向服务的设计，包含控制层、服务层和实现层，易于定制

高级监控

内置 Metrics 服务、健康检查管理器和通知系统，满足生产环境需求

易于使用

快速入门示例和全面的文档，让您在几分钟内上手

技术精深，
但体验极简。

Server-Based Architecture

Layered service design — Orchestration layer coordinates training loop, Components layer deploys Actor/Rollout/GenRM as Ray Serve services, Engine & Backends layer runs Megatron and SGLang. Data flows through TransferQueue, weights sync via Checkpoint Engine.

Fully Async Pipeline

Rollout, Reference, Forward, and Training stages run as interleaved pipelines — achieving near-zero GPU idle time with perfect overlap across batches.

Agentic RL Training

Define environments and reward functions as simple Python classes. Relax orchestrates multi-turn agent interactions with tool use, vision, and custom scoring — all natively integrated into the training loop.

Elastic Rollout Scaling

Scale inference engines up or down via REST API — without restarting training. Supports same-cluster scaling and cross-cluster federation with P2P weight sync.

Relax Framework

🐍custom_env.py

1	class CustomAgentEnv(BaseInteractionEnv):
2	def reset(self):
3	self.turn, self.memory, self.trajectory = 0, [], []
4
5	def step(self, response):
6	if has_answer(response):
7	return obs("done"), 0.0, True, False, {"answer": extract(response)}
8
9	tool_call = parse(response)
10	if tool_call is None:
11	return obs("invalid"), -0.1, True, False, {"error": True}
12
13	result = await tools[tool_call.name].execute(tool_call.args)
14	self.memory.append({"action": tool_call, "obs": result})
15	self.trajectory.append({tool_call, result, self.turn})
16	if should_stop(self.memory):
17	return obs(result), 0.0, True, False, {"stop": True}
18
19	self.turn += 1
20	truncated = (self.turn >= self.max_turns)
21	return obs(result), 0.0, False, truncated, {"turn": self.turn}

1	# --custom-generate-function-path your_module.generate
2	async def generate(args, sample: Sample, sampling_params) -> Sample:
3	env = build_env(sample, args); env.reset()
4	tokens = sample.tokens
5	loss_mask, log_probs, rewards = [], [], []
6
7	for turn in range(max_turns):
8	out = await engine.generate(tokens)
9	tokens += out.ids; loss_mask += [1] * len(out.ids)
10	log_probs += out.log_probs
11
12	obs, reward, terminated, truncated, info = env.step(out.text)
13	rewards.append(reward)
14	if terminated or truncated: break
15
16	obs_ids = tokenize(env.format_observation(obs))
17	tokens += obs_ids; loss_mask += [0] * len(obs_ids)
18	log_probs += [0.0] * len(obs_ids)
19
20	return finalize(sample, tokens, loss_mask, log_probs, rewards)

1	# --custom-rm-path your_module.reward
2	async def reward(sample: Sample) -> dict:
3	answer = extract_answer(sample.response)
4	acc = 1.0 if is_exact_match(answer, sample.label) else 0.0
5
6	judge_response = await judge_model.generate(
7	prompt=sample.prompt,
8	answer=answer,
9	label=sample.label,
10	multimodal_inputs=sample.multimodal_inputs,
11	)
12	judge_score = parse_verdict(judge_response)
13	fmt = 1.0 if is_well_formatted(sample.response) else 0.0
14
15	score = 0.5 * acc + 0.3 * judge_score + 0.2 * fmt
16	return {"score": score, "acc": acc, "format": fmt,
17	"judge_response": judge_response}

Rollout

GPU0

GPU1

GPU2

GPU3

Reference

GPU4

Forward

GPU5

Train

GPU6

GPU7

智能体强化学习

高性能计算

多模态支持

灵活架构

高级监控

易于使用

智能体强化学习

高性能计算

多模态支持

灵活架构

高级监控

易于使用

技术精深，但体验极简。

Server-Based Architecture

Fully Async Pipeline

Agentic RL Training

Elastic Rollout Scaling

技术精深，
但体验极简。