Skip to content

Architecture

Overview

Relax is a reinforcement learning training framework for large language models, built on Ray Serve. It supports the Megatron training backend, SGLang inference engine, and algorithm families including GRPO/GSPO/SAPO. The framework uses a layered architecture that decouples orchestration, components, engines, backends, and distributed capabilities into independent modules.

Layered Architecture

┌──────────────────────────────────────────────────────────────────┐
│                    Entrypoints Layer                             │
│              train.py — signal handling, launch training         │
└────────────────────────────┬─────────────────────────────────────┘

┌────────────────────────────▼─────────────────────────────────────┐
│                   Orchestration Layer (Core)                     │
│  ┌──────────────────┐  ┌──────────┐  ┌────────────────────────┐  │
│  │   Controller     │  │ Service  │  │     Registry           │  │
│  │ (training loop,  │  │(lifecycle│  │  (roles/algo registry) │  │
│  │  deployment)     │  │ mgmt)    │  │                        │  │
│  └──────────────────┘  └──────────┘  └────────────────────────┘  │
└────────────────────────────┬─────────────────────────────────────┘

┌────────────────────────────▼─────────────────────────────────────┐
│               Components Layer (Ray Serve Deployments)           │
│  ┌────────┐ ┌──────────┐ ┌────────┐ ┌───────────┐ ┌──────────┐   │
│  │ Actor  │ │ Rollout  │ │ Critic │ │ ActorFwd  │ │Advantages│   │
│  └────────┘ └──────────┘ └────────┘ └───────────┘ └──────────┘   │
│  ┌────────┐                                                      │
│  │ GenRM  │                                                      │
│  └────────┘                                                      │
└────────────────────────────┬─────────────────────────────────────┘

┌────────────────────────────▼─────────────────────────────────────┐
│                   Engine Layer                                   │
│  ┌─────────────────────┐  ┌──────────────────────────────────┐   │
│  │  Rollout Engine     │  │  Reward Functions                │   │
│  │  (SGLang rollout,   │  │  (deepscaler, math, genrm, ...)  │   │
│  │   on-policy distill)│  │                                  │   │
│  └─────────────────────┘  └──────────────────────────────────┘   │
└────────────────────────────┬─────────────────────────────────────┘

┌────────────────────────────▼─────────────────────────────────────┐
│               Backends Layer            Distributed Layer        │
│  ┌──────────────────┐ ┌───────────┐  ┌──────────┐ ┌───────────┐  │
│  │ Megatron Backend │ │  SGLang   │  │ Ray Actor│ │   DCS     │  │
│  │ (Actor, weight   │ │  Inference│  │  Groups  │ │ Checkpoint│  │
│  │  update, HF conv)│ │  Engine   │  │  Mgmt    │ │  Service  │  │
│  └──────────────────┘ └───────────┘  └──────────┘ └───────────┘  │
└──────────────────────────────────────────────────────────────────┘

1. Entrypoints Layer

relax/entrypoints/train.py is the main training entry point:

  • Parses command-line arguments (extended from Megatron-LM argument parser)
  • Initializes Ray cluster connection
  • Registers SIGTERM/SIGINT signal handlers for SGLang process cleanup
  • Creates Controller and starts training loop

2. Orchestration Layer (Core)

The orchestration layer coordinates the entire training workflow:

  • Controller: Core orchestrator — deploys all RL service components, manages the training loop, handles health checking and global restart
  • Service: Lifecycle wrapper for Ray Serve deployments — manages placement groups, heartbeat reporting, service restart
  • Registry: Global constants and role registry — defines ROLES, ALGOS mappings

3. Components Layer

The components layer contains all RL service components, each deployed as a @serve.deployment:

ComponentFileResponsibility
Actoractor.pyPolicy training (Megatron backend)
Rolloutrollout.pyRollout service orchestration, manages RolloutManager
Criticcritic.pyValue estimation
ActorFwdactor_fwd.pyForward inference log-prob (fully-async mode)
Advantagesadvantages.pyAdvantage computation (GRPO/GSPO/SAPO etc.)
GenRMgenrm.pyGenerative reward model

Sync mode deploys three core components: Actor + Rollout + Critic. Fully-async mode (--fully-async) additionally deploys ActorFwd and Advantages components.

4. Engine Layer

The engine layer provides rollout data generation and reward computation:

  • Rollout Engine: SGLang rollout implementation, on-policy distillation, etc.
  • Reward Functions: Pluggable reward computation (deepscaler, math_utils, genrm, etc.)
  • Router: Request routing and load balancing
  • Filters: Data filtering strategies

5. Backends Layer

The backends layer wraps underlying training and inference engines:

  • Megatron Backend: Distributed training based on Megatron-LM, supporting TP/PP/CP/EP parallelism
  • SGLang Engine: High-performance inference engine management and process lifecycle control

6. Distributed Layer

  • Ray Actor Groups: Manages RolloutManager, GenRMManager, and other Ray Actor groups
  • DCS Checkpoint Service: Distributed weight synchronization service supporting NCCL/GLOO/TCP communication backends

Directory Structure

relax/
├── core/                    Orchestration layer
│   ├── controller.py        Controller: deployment, training loop, health mgmt
│   ├── service.py           Service: Ray Serve deployment lifecycle wrapper
│   └── registry.py          Global constants and role registry
├── components/              Components layer (Ray Serve Deployments)
│   ├── actor.py             Policy training
│   ├── actor_fwd.py         Forward inference log-prob
│   ├── critic.py            Value estimation
│   ├── advantages.py        Advantage computation
│   ├── genrm.py             Generative reward model
│   └── rollout.py           Rollout service orchestration
├── engine/                  Engine layer
│   ├── rollout/             Rollout engine implementations
│   ├── rewards/             Reward functions
│   ├── router/              Request routing
│   └── filters/             Data filtering
├── backends/                Backends layer
│   ├── megatron/            Megatron training backend
│   └── sglang/              SGLang inference engine
├── distributed/             Distributed layer
│   ├── ray/                 Ray Actor group management
│   └── checkpoint_service/  DCS distributed checkpoint service
├── entrypoints/             Entrypoints layer
│   ├── train.py             Main training entry point
│   └── deploy_metrics_service.py  Standalone metrics service deployment
└── utils/                   Infrastructure
    ├── arguments.py         Command-line argument parsing
    ├── data/                Dataset loading and streaming
    ├── metrics/             Metrics collection and reporting
    └── ...                  Logging, timers, health monitoring, etc.

Data Flow

Sync Mode

┌───────────────────────────────────────────────────────────────────────┐
│                       Sync Training Data Flow                         │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌────────────┐   ┌──────────────────┐   ┌────────────────────────┐   │
│  │ JSONL/     │   │ Dataset /        │   │ RolloutDataSource      │   │
│  │ Parquet    │──>│ StreamingDataset │──>│ WithBuffer             │   │
│  │ Files      │   │                  │   │                        │   │
│  └────────────┘   └──────────────────┘   └───────────┬────────────┘   │
│                                                      │                │
│                                                      ▼                │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │                  RolloutManager.generate()                       │ │
│  │  SGLang engine inference → reward computation → assemble data    │ │
│  └───────────────────────────┬──────────────────────────────────────┘ │
│                              │                                        │
│                              ▼                                        │
│  ┌───────────┐   ┌──────────────────┐   ┌──────────────────────────┐  │
│  │  Actor    │──>│  DCS Weight Sync │──>│  Rollout (update weights)│  │
│  │ (training)│   │  (NCCL AllGather)│   │                          │  │
│  └───────────┘   └──────────────────┘   └──────────────────────────┘  │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

Fully-Async Mode (--fully-async)

┌───────────────────────────────────────────────────────────────────────┐
│                    Fully-Async Training Data Flow                     │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  Rollout ──→ TransferQueue ──→ Advantages ──→ TransferQueue ──→ Actor │
│     ↑                              │                             │    │
│     │                              │                             │    │
│     └──── DCS Weight Sync ◄── ActorFwd ◄──── DCS Weight Sync ───┘     │
│                                                                       │
│  5 components run independently, passing data via TransferQueue       │
│  Actor updates weights to ActorFwd → Reference → Rollout every N steps│
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

Communication Patterns

Inter-Component Communication

Components communicate via Ray Serve HTTP interfaces and Ray Actor handles:

python
# Controller deploys components via Service wrapper
service = Service(role="actor", cls=Actor, num_gpus=8, ...)
service.run()  # Start training task

# Components sync weights via DCS Checkpoint Service
client = CheckpointEngineClient(coordinator_url="http://...", role="actor", ...)
await client.update_weights_for_rollout(rollout_only=True)

Network Port Allocation

Services use fixed port ranges to prevent conflicts during distributed process group initialization:

ServicePort Range
DCS weight sync (Actor → Rollout)11000 - 11999
Rollout (SGLang engine)15000 - 15999
GenRM (SGLang engine)16000 - 16999
OS ephemeral (Megatron NCCL)32768 - 50000 (recommended)

See Distributed Checkpoint - Network Port Allocation for details.

Controller Orchestration

The Controller manages the lifecycle of all components via Ray Serve:

python
# Controller registers all services
ctrl = Controller(args, runtime_env)
ctrl.register_all_serve()   # Deploy Actor, Rollout, Critic, ...
ctrl.training_loop()        # Start training loop
ctrl.shutdown()             # Clean up SGLang engines

Deployment

Deploy training jobs via launch scripts:

bash
# Sync training (8 GPUs)
bash scripts/training/text/run-qwen3-4B-8xgpu.sh

# Fully-async training (8 GPUs)
bash scripts/training/text/run-qwen3-4B-8xgpu-async.sh

# Or call entry point directly
python relax/entrypoints/train.py [ARGS...]

Scalability

The architecture supports multi-dimensional scaling:

  • Rollout Component: Configure SGLang inference engines via --rollout-num-gpus and --rollout-num-gpus-per-engine, with elastic scaling support
  • Actor Component: Supports Megatron TP/PP/CP/EP parallelism strategies
  • Reward Functions: Pluggable reward computation modules, specified via --rm-type (built-in) or --custom-rm-path (custom)

Monitoring and Observability

Built-in monitoring components:

  • Metrics Service: Collects and aggregates training metrics (WandB, Prometheus)
  • Health Check Manager: Monitors service health with automatic restart and global recovery
  • Notification System: Alerts on failures via Apprise

See:

Next Steps

Released under the Apache 2.0 License.