Configuration Reference
The parameters are divided into three categories:
- Megatron parameters: Relax will read all parameters defined in Megatron from the PYTHONPATH. You can configure them by passing arguments such as --tensor-model-parallel-size 2.
- SGLang parameters: All parameters supported by the installed SGLang environment are available. These parameters must be prefixed with --sglang. For example, --mem-fraction-static should be passed as --sglang-mem-fraction-static.
- Relax-specific parameters: Please refer to
relax/utils/arguments.py.
For common configuration usage and examples, see the Quick Start Guide.
Cluster and Resource Configuration
Ray Launch Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--rollout-num-gpus-per-engine | int | 1 | GPUs per SGLang inference engine, equivalent to SGLang's tp_size |
--num-gpus-per-node | int | 8 | GPUs per node. Set this if using fewer than 8 GPUs per node in colocate mode |
--resource | json | - | Ray resource configuration in JSON format. Example: '{"actor":[replicas, gpus], "rollout":[replicas, gpus]}' |
--colocate | flag | False | Whether to colocate inference engines and training Actors on the same GPUs. Automatically sets --offload to True |
--offload | flag | False | Equivalent to setting both --offload-train and --offload-rollout |
--offload-train | flag | None | Whether to offload training Actor to CPU during training. Always True when --colocate is enabled |
--offload-rollout | flag | None | Whether to offload Rollout generator to CPU during training. Always True when --colocate is enabled |
--distributed-backend | str | nccl | Distributed backend |
--distributed-timeout-minutes | int | 10 | Distributed timeout in minutes |
TransferQueue Data Queue
| Parameter | Type | Default | Description |
|---|---|---|---|
--num-data-storage-units | int | 1 | Number of TransferQueue SimpleStorageUnit actors |
--max-staleness | int | 0 | Maximum staleness for TransferQueue data system (0=on-policy) |
--polling-mode | bool | True | Whether to use polling mode when fetching metadata |
--num-iters-per-train-update | int | 1 | Number of iterations per global batch in fully async pipeline |
Training Backend and Mode
| Parameter | Type | Default | Options | Description |
|---|---|---|---|---|
--train-backend | str | megatron | megatron | Training backend selection |
--qkv-format | str | thd | thd, bshd | QKV layout for Megatron backend. Dynamic batching not supported in bshd mode; must specify --micro-batch-size |
--megatron-to-hf-mode | str | raw | raw, bridge | Megatron to HF weight conversion method. bridge uses megatron bridge for automatic conversion |
--true-on-policy-mode | flag | False | - | Whether to enable true on-policy mode |
--fully-async | flag | False | - | Whether to use fully asynchronous training pipeline |
Checkpoint Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
--hf-checkpoint | str | None | HuggingFace model checkpoint path. Used to initialize SGLang and provide tokenizer. Need not contain latest parameters, only consistent with training model architecture |
--ref-load | str | None | Reference model checkpoint path. Used as initial checkpoint for training when --load is not set |
--ref-ckpt-step | int | None | Reference model checkpoint step |
--load | str | None | Actor model checkpoint load path. Specify for resuming training |
--save | str | None | Path to save model during training |
--save-interval | int | None | Model save interval in steps |
--save-hf | str | None | Path to save HuggingFace format model for Megatron backend. Path can include {rollout_id} placeholder |
--async-save | flag | False | Asynchronous checkpoint saving |
--no-save-optim | flag | False | Do not save optimizer state in checkpoint. Reduces checkpoint size but prevents resuming training from that checkpoint |
--rotate-ckpt | flag | False | Whether to rotate checkpoints. Requires setting --save, --save-interval, and --async-save |
--max-actor-ckpt-to-keep | int | None | Maximum number of Actor checkpoints to keep |
--checkpoint-engine-backend | str | nccl | Checkpoint engine backend |
--critic-load | str | None | Critic model checkpoint path. When None, equals --load |
--critic-save | str | None | Critic model save path |
Data Configuration
Dataset
| Parameter | Type | Default | Description |
|---|---|---|---|
--prompt-data | str | None | Training prompt dataset path |
--input-key | str | input | Key for input field in dataset |
--label-key | str | None | Key for label field in dataset |
--metadata-key | str | metadata | Key for metadata field in dataset |
--tool-key | str | tools | Key for tools field when applying Chat Template |
--apply-chat-template | flag | False | Apply Chat Template to input as OpenAI message format |
--apply-chat-template-kwargs | json | {} | Additional parameters for Chat Template |
--system-prompt | str | None | Optional system prompt added before user input. Final message is <system_prompt> + <dataset_prompt> |
--rollout-shuffle | flag | False | Whether to shuffle prompt order during Rollout |
--rollout-seed | int | 42 | Random seed for Rollout, used for shuffling prompts and random sampling |
--use-streaming-dataset | flag | False | Use streaming dataset to save memory |
--streaming-buffer-size | int | 10000 | Buffer size for streaming dataset |
--prefetch-chunk-size | int | 32 | Number of samples to dispatch to the thread-pool in each prefetch round. Larger values increase throughput but also memory pressure. Only effective when --use-streaming-dataset is set and the dataset contains multimodal data |
--prefetch-max-cached | int | 256 | Maximum number of pre-loaded samples kept in the prefetch cache. When the cache is full the background prefetch thread pauses until consumers free space. Set to 0 to disable prefetching. Only effective when --use-streaming-dataset is set and the dataset contains multimodal data |
--prefetch-num-workers | int | 1 | Number of parallel worker threads inside the prefetch buffer for I/O-bound media decoding (video/image). Set to 1 to serialise all decoding (safest for FFmpeg which is not fully thread-safe). Higher values increase parallelism but may trigger EAGAIN errors on some platforms. Only effective when prefetching is enabled |
--custom-prompt-path | str | None | Dotted import path to a custom function that transforms the prompt before conversation/multimodal processing. Function signature: def custom_fn(prompt, data: dict) -> prompt. Example: my_package.prompt_utils.add_prefix |
--data-source-path | str | relax.engine.rollout.data_source.RolloutDataSourceWithBuffer | Rollout data source class path |
--start-rollout-id | int | None | Starting Rollout step. If not set, attempts to read from checkpoint specified by --load |
Multimodal Data
| Parameter | Type | Default | Description |
|---|---|---|---|
--multimodal-keys | json | None | Multimodal data field mapping. Example: '{"image": "image_key"}' |
--use-audio-in-video | flag | False | Whether to process audio in video |
--image-max-token-num | int | None | Maximum token count for image processing. Default is 16384 if not set |
--image-min-token-num | int | None | Minimum token count for image processing. Default is 4 if not set |
--video-min-token-num | int | None | Minimum token count for video frame processing. Default is 128 if not set |
--video-max-token-num | int | None | Maximum token count for video frame processing. Default is 768 if not set |
--video-fps | float | None | Target FPS for video processing. Default is 2.0 if not set |
--video-fps-min-frames | int | None | Minimum frames for video processing. Default is 4 if not set |
--video-fps-max-frames | int | None | Maximum frames for video processing. Default is 768 if not set |
--image-resize-scale-factor | int | None | Scale factor for image resize dimension alignment. Default uses patch_size * spatial_merge_size (typically 28). Set to 0 to disable alignment |
--audio-sample-rate | int | None | Sample rate for audio processing. Default is 16000 if not set |
--frame-factor | int | None | Frame alignment factor. Default is 2 if not set |
--mm-processor-pool-size | int | 0 | Size of the multimodal processor pool. 0 (default) disables the pool and uses ThreadPoolExecutor. When set to a positive integer, creates a ProcessPoolExecutor with the specified number of workers for true parallelism without GIL contention |
Rollout Configuration
Sampling Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--num-rollout | int | None | Total number of Rollout rounds. Choose either this or --num-epoch |
--num-epoch | int | None | Number of training epochs. Automatically calculates num_rollout based on dataset size. Ignored if --num-rollout is also set |
--rollout-batch-size | int | Required | Number of prompts per rollout round. Total data = rollout-batch-size * n-samples-per-prompt |
--n-samples-per-prompt | int | 1 | Number of responses generated per prompt |
--rollout-temperature | float | 1.0 | Sampling temperature for inference engine |
--rollout-top-p | float | 1.0 | Top-p sampling parameter for inference engine |
--rollout-top-k | int | -1 | Top-k sampling parameter for inference engine. -1 means not used |
--rollout-max-response-len | int | None | Maximum response length, equivalent to SGLang's max_tokens |
--rollout-max-prompt-len | int | None | Maximum prompt length. Filters long prompts during dataset initialization if set |
--rollout-max-context-len | int | None | Maximum context length for inference engine. Should not exceed max_position_embeddings in HuggingFace model config.json |
--rollout-stop | str (list) | None | Stop words for Rollout. Can be one or multiple strings |
--rollout-stop-token-ids | int (list) | None | Stop token IDs for Rollout |
--rollout-skip-special-tokens | flag | False | Whether to skip special tokens in responses |
Oversampling and Dynamic Filtering
| Parameter | Type | Default | Description |
|---|---|---|---|
--over-sampling-batch-size | int | None | Sampling batch granularity. When None, uses rollout-batch-size. Must be >= rollout-batch-size |
--dynamic-sampling-filter-path | str | None | Dynamic sampling filter function path. Implements filters like DAPO (e.g., excluding all-correct or all-wrong samples). Example: relax.engine.filters.dynamic_sampling_filters.check_reward_nonzero_std |
--buffer-filter-path | str | None | Buffer filter function path. Function signature: list[list[Sample]] -> list[list[Sample]] |
Partial Rollout
| Parameter | Type | Default | Description |
|---|---|---|---|
--partial-rollout | flag | False | Enable partial Rollout. Incomplete samples are recycled to data buffer, suitable for long response scenarios |
--partial-rollout-max-aborted-count | int | None | Maximum number of times a sample can be aborted. After reaching threshold, sample is guaranteed to complete |
--mask-offpolicy-in-partial-rollout | flag | False | Whether to mask previous generation in partial Rollout. When set, only on-policy generated tokens participate in training |
Weight Update
| Parameter | Type | Default | Description |
|---|---|---|---|
--update-weight-buffer-size | int | 512MB | Buffer size for weight updates in bytes. Updates weights in chunks, useful for MoE models |
--update-weights-interval | int | 1 | Weight update interval |
--keep-old-actor | flag | False | Whether to keep Rollout model during training |
External Inference Engine
| Parameter | Type | Default | Description |
|---|---|---|---|
--rollout-external | flag | False | Use external SGLang instance instead of framework-launched instance |
--rollout-external-engine-addrs | str (list) | None | List of external engine addresses and ports |
SGLang Engine Parameters
For more parameters, refer to SGLang official documentation.
| Parameter | Type | Default | Description |
|---|---|---|---|
--sglang-mem-fraction-static | float | - | SGLang static memory allocation ratio |
--sglang-profile | flag | False | Enable torch profiling on SGLang engines during rollout. Profile traces will be saved per rollout step |
--sglang-profile-steps | int (list) | None | List of absolute rollout step IDs (0-indexed) at which to enable SGLang profiling. Takes precedence over --sglang-profile-step-start/end. Example: --sglang-profile-steps 3 10 50 |
--sglang-profile-step-start | int | None | Start of the rollout step range for SGLang profiling (inclusive, 0-indexed). Used with --sglang-profile-step-end to specify a contiguous range. Ignored if --sglang-profile-steps is set |
--sglang-profile-step-end | int | None | End of the rollout step range for SGLang profiling (inclusive, 0-indexed). Used with --sglang-profile-step-start to specify a contiguous range. Ignored if --sglang-profile-steps is set. E.g. start=2, end=4 profiles steps 2, 3, 4 |
--sglang-profile-output-dir | str | None | Output directory for SGLang profile traces. Defaults to traces/<tb_experiment_name>/sglang_trace |
--sglang-profile-num-steps | int | 3 | Number of SGLang forward steps to profile per rollout. -1 profiles the entire rollout step until stop_profile is called |
--sglang-profile-activities | str (list) | ["CPU", "GPU"] | Activities to profile (e.g., CPU GPU) |
--sglang-profile-by-stage | flag | False | Profile by stage (prefill/decode) separately |
--sglang-profile-with-stack | flag | False | Record call stack in profile traces |
--sglang-profile-record-shapes | flag | False | Record tensor shapes in profile traces |
Custom Rollout Functions
| Parameter | Type | Default | Description |
|---|---|---|---|
--rollout-function-path | str | relax.engine.rollout.sglang_rollout.generate_rollout | Rollout generation function path |
--custom-generate-function-path | str | None | Custom generate function to replace default rollout generate function. Suitable for multi-turn dialogue, function calling, etc. |
--rollout-data-postprocess-path | str | None | Rollout data postprocessing function, called after all data (including log_probs) is fetched. Can be used to update loss mask |
--custom-rollout-log-function-path | str | None | Custom Rollout logging function |
--custom-eval-rollout-log-function-path | str | None | Custom evaluation Rollout logging function |
Batch Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
--global-batch-size | int | None | Global batch size. Defines sample count needed for one parameter update (optimizer.step) |
--micro-batch-size | int | 1 | Micro batch size. Ignored when --use-dynamic-batch-size is enabled |
--num-steps-per-rollout | int | None | Training steps per Rollout. Equivalent to setting GBS = rollout_batch_size * n_samples_per_prompt / num_steps_per_rollout |
--use-dynamic-batch-size | flag | False | Enable dynamic batching. Dynamically packs samples by length so each micro-batch's total tokens approach --max-tokens-per-gpu limit |
--max-tokens-per-gpu | int | None | Maximum tokens per GPU. Must be set when dynamic batching is enabled. Should be set to approximately max_response_len / cp_size when using CP |
--log-probs-max-tokens-per-gpu | int | None | Maximum tokens per GPU when computing log probs. When None, equals max-tokens-per-gpu |
--balance-data | flag | False | Use karmarkar_karp algorithm to balance token count across data parallel ranks. Only available in colocate mode; not supported with --fully-async. Note: different responses for the same prompt may be assigned to different training steps |
Parallelism Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
--tensor-model-parallel-size | int | 1 | Tensor parallelism size |
--pipeline-model-parallel-size | int | 1 | Pipeline parallelism size |
--sequence-parallel | flag | False | Enable sequence parallelism |
--context-parallel-size | int | 1 | Context parallelism size |
--expert-model-parallel-size | int | 1 | Expert parallelism size (for MoE models) |
--expert-tensor-parallel-size | int | 1 | Expert tensor parallelism size |
Recomputation
Recomputation parameters use native Megatron parameters. For details, refer to Megatron documentation.
| Parameter | Type | Default | Description |
|---|---|---|---|
--recompute-granularity | str | - | Recomputation granularity: full, selective |
--recompute-method | str | - | Recomputation method: uniform, block |
--recompute-num-layers | int | - | Number of layers to recompute |
Optimizer Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
--lr | float | 1e-6 | Learning rate |
--optimizer | str | - | Optimizer type (native Megatron parameter) |
--lr-decay-style | str | - | Learning rate decay style (native Megatron parameter) |
--weight-decay | float | - | Weight decay (native Megatron parameter) |
--adam-beta1 | float | - | Adam beta1 (native Megatron parameter) |
--adam-beta2 | float | - | Adam beta2 (native Megatron parameter) |
--clip-grad | float | 1.0 | Gradient clipping |
--seed | int | 1234 | Random seed |
--optimizer-cpu-offload | flag | - | Enable CPU offload for optimizer state (native Megatron parameter) |
--overlap-cpu-optimizer-d2h-h2d | flag | - | Overlap CPU optimizer D2H/H2D communication (native Megatron parameter) |
--use-precision-aware-optimizer | flag | - | Use precision-aware optimizer (native Megatron parameter) |
--use-distributed-optimizer | flag | - | Shard optimizer state, ZeRO-1 style (native Megatron parameter) |
--overlap-grad-reduce | flag | - | Overlap backward compute with grad reduce-scatter (native Megatron parameter) |
--overlap-param-gather | flag | - | Overlap reduce-scatter with next-step param all-gather; requires --overlap-grad-reduce (native Megatron parameter) |
--calculate-per-token-loss | flag | False | Calculate loss per token (native Megatron parameter) |
Optimizer Flag Compatibility
| Scenario | --use-distributed-optimizer | --overlap-grad-reduce / --overlap-param-gather |
|---|---|---|
| Text-only dense | ✅ | ✅ |
| Dense VL, CP = 1 | ✅ | ✅ |
| Dense VL, CP > 1 | ✅ | ❌ |
| MoE | ✅ | ❌ |
Algorithm Configuration
Advantage Estimation
| Parameter | Type | Default | Options | Description |
|---|---|---|---|---|
--advantage-estimator | str | grpo | grpo, gspo, on_policy_distillation, sapo | Advantage estimator. Note: OPD is now independent of advantage estimator; enable OPD on any estimator with --opd-kl-coef > 0 |
--normalize-advantages | flag | False | - | Whether to normalize advantages |
--disable-grpo-std-normalization | flag | - | - | Disable GRPO standard deviation normalization (from Dr.GRPO) |
--disable-rewards-normalization | flag | - | - | Disable reward normalization |
--disable-compute-advantages-and-returns | flag | - | - | Disable advantage and return computation. Used for SFT or custom loss functions |
Loss Function
| Parameter | Type | Default | Options | Description |
|---|---|---|---|---|
--loss-type | str | policy_loss | policy_loss, custom_loss | Loss type. When custom_loss, must set --custom-loss-function-path |
--custom-loss-function-path | str | None | - | Custom loss function path |
--eps-clip | float | 0.2 | - | PPO clipping range (lower bound) |
--eps-clip-high | float | None | - | PPO clipping upper bound. When None, equals --eps-clip |
--eps-clip-c | float | None | - | Dual-clip PPO value lower bound (paper) |
--value-clip | float | 0.2 | - | Value function clipping range |
--entropy-coef | float | 0.0 | - | Entropy loss coefficient |
KL Divergence Related
| Parameter | Type | Default | Description |
|---|---|---|---|
--kl-coef | float | 0.0 | KL penalty coefficient for reward shaping (applied to reward signal before advantage calculation). Cannot be non-zero simultaneously with --kl-loss-coef |
--use-kl-loss | flag | False | Whether to use KL loss in GRPO |
--kl-loss-coef | float | 0.0 | KL penalty coefficient added to final PPO loss. Cannot be non-zero simultaneously with --kl-coef |
--kl-loss-type | str | k1 | k1, k2, k3, low_var_kl |
--use-unbiased-kl | flag | False | Enable unbiased KL estimation |
--ref-update-interval | int | None | Reference model update interval in Rollout steps. None means no update |
SAPO Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--sapo-tau-pos | float | 1.0 | SAPO positive advantage temperature |
--sapo-tau-neg | float | 1.05 | SAPO negative advantage temperature |
Critic Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
--num-critic-only-steps | int | 0 | Number of steps to train Critic only |
--critic-train-only | flag | False | Train Critic model only |
--critic-lr | float | None | Critic learning rate. When None, equals --lr |
--critic-lr-warmup-iters | int | 0 | Number of iterations for linear warmup of Critic model |
Off-Policy Correction
| Parameter | Type | Default | Description |
|---|---|---|---|
--use-rollout-logprobs | flag | False | Use Rollout's logprobs when computing importance sampling ratio. When not set, uses Actor model's logprobs |
--use-tis | flag | False | Enable TIS (Truncated Importance Sampling) off-policy importance sampling |
--tis-clip | float | 2.0 | Upper clipping threshold for importance sampling ratio |
--tis-clip-low | float | 0 | Lower clipping threshold for importance sampling ratio |
--custom-tis-function-path | str | None | Custom TIS/RS function path |
--custom-pg-loss-reducer-function-path | str | None | Custom pg_loss reducer function path. When set, pg_loss uses custom reducer while other metrics use default sum_of_sample_mean |
Routing Replay and OPSM
| Parameter | Type | Default | Description |
|---|---|---|---|
--use-routing-replay | flag | False | Enable Routing Replay (paper) |
--use-rollout-routing-replay | flag | False | Enable Rollout Routing Replay (paper). Automatically enables --use-routing-replay |
--use-opsm | flag | False | Enable Off-Policy Sequence Masking (OPSM) |
--opsm-delta | float | 1e-4 | OPSM threshold |
Other Training Options
| Parameter | Type | Default | Description |
|---|---|---|---|
--reset-optimizer-states | flag | False | Whether to reset optimizer state after each Rollout |
--use-rollout-entropy | flag | False | Whether to compute entropy when calculating logprobs. Used for special loss mask |
--get-mismatch-metrics | flag | False | Whether to compute mismatch metrics. Requires setting --custom-tis-function-path |
Parameter Freezing and Selective Training
| Parameter | Type | Default | Description |
|---|---|---|---|
--only-train-params-name-list | str (list) | None | List of regex patterns for parameters to train. Other parameters are frozen. Cannot be used simultaneously with --freeze-params-name-list. Example: --only-train-params-name-list experts |
--freeze-params-name-list | str (list) | None | List of regex patterns for parameters to freeze. Other parameters remain trainable. Example: --freeze-params-name-list embedding output_layer |
Evaluation Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
--eval-interval | int | None | Evaluation interval in Rollout rounds |
--eval-prompt-data | str (list) | None | Evaluation datasets in format: dataset_name /path/to/data.jsonl. Can specify multiple pairs |
--eval-config | str | None | OmegaConf YAML/JSON evaluation config file path. When set, overrides --eval-prompt-data |
--eval-function-path | str | None | Evaluation generation function path. When None, uses --rollout-function-path |
--skip-eval-before-train | flag | False | Whether to skip evaluation before training |
--eval-input-key | str | None | Key for input field in evaluation data. When None, uses --input-key |
--eval-label-key | str | None | Key for label field in evaluation data |
--eval-tool-key | str | None | Key for tool field in evaluation data |
--n-samples-per-eval-prompt | int | 1 | Number of samples per evaluation prompt |
--eval-temperature | float | None | Sampling temperature for evaluation |
--eval-top-p | float | None | Top-p parameter for evaluation |
--eval-top-k | int | None | Top-k parameter for evaluation |
--eval-max-response-len | int | None | Maximum response length for evaluation |
--eval-max-prompt-len | int | None | Maximum prompt length for evaluation |
--eval-min-new-tokens | int | None | Minimum new tokens generated for evaluation |
--eval-max-context-len | int | None | Maximum context length for evaluation. When None, equals --rollout-max-context-len |
Reward Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
--rm-type | str | None | Built-in reward model type |
--custom-rm-path | str | None | Custom reward function path. Function signature: def custom_rm(args, sample) -> float |
--reward-key | str | None | Key to extract reward value when reward function returns dict |
--eval-reward-key | str | None | Reward key for evaluation. When None, equals --reward-key |
--group-rm | flag | False | Whether to compute reward for entire group |
--rm-url | str | None | Remote reward model service URL (for --rm-type remote_rm) |
--custom-reward-post-process-path | str | None | Custom reward postprocessing function path. Default is GRPO normalization |
--custom-convert-samples-to-train-data-path | str | None | Custom function to convert samples to training data. Signature: def convert_samples_to_train_data(args, samples) -> dict |
GenRM (Generative Reward Model)
| Parameter | Type | Default | Description |
|---|---|---|---|
--genrm-model-path | str | None | GenRM model path. Enables GenRM when set |
--genrm-num-gpus | int | 1 | Total GPUs for GenRM |
--genrm-num-gpus-per-engine | int | 1 | GPUs per GenRM engine instance |
--genrm-engine-config | json | None | GenRM engine initialization parameters. Example: '{"dp_size": 1, "pp_size": 1, "max_total_tokens": 8192}' |
--genrm-sampling-config | json | None | GenRM sampling parameters. Available keys: temperature (default 0.1), top_p (default 1.0), top_k (default -1), max_response_len (default 4096) |
On-Policy Distillation (OPD)
| Parameter | Type | Default | Options | Description |
|---|---|---|---|---|
--use-opd | flag | False | - | Enable On-Policy Distillation. Must also specify --opd-type |
--opd-type | str | None | sglang, megatron | OPD type. sglang: fetch teacher model logprobs from external SGLang server; megatron: load teacher model via --opd-teacher-load |
--opd-kl-coef | float | 1.0 | - | OPD KL penalty coefficient |
--opd-teacher-load | str | None | - | OPD teacher model checkpoint path. Required when --opd-type=megatron |
--opd-teacher-ckpt-step | int | None | - | OPD teacher model checkpoint step |
Fault Tolerance Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
--use-fault-tolerance | flag | False | Whether to enable fault tolerance during Rollout |
--use-health-check | flag | False | Whether to enable global health check system. Controller's HealthManager monitors all services and triggers automatic restart on failure |
--max-global-restart | int | 3 | Maximum number of global restarts allowed. Training terminates after exceeding. Only effective when --use-health-check is enabled |
--rollout-health-check-interval | float | 30.0 | Rollout engine health check interval in seconds |
--rollout-health-check-timeout | float | 30.0 | Rollout engine health check timeout in seconds |
--rollout-health-check-first-wait | float | 0 | Initial wait time before starting health check in seconds. Increase this value when using deepgemm |
Elastic Scaling Configuration
Autoscaler
| Parameter | Type | Default | Description |
|---|---|---|---|
--autoscaler-config | str | None | Path to autoscaler YAML configuration file. Enables autoscaling when set, disabled when not set. Example: --autoscaler-config relax/utils/autoscaler/autoscaler.yaml |
For autoscaler YAML configuration details, see relax/utils/autoscaler/autoscaler.yaml.
Scale-Out Operation Parameters
| Parameter | Type | Default | Options | Description |
|---|---|---|---|---|
--scale-out-timeout | float | 300.0 | - | Timeout for all scale-out operations (engine startup, connect, health check, weight sync) in seconds |
--scale-out-partial-success-policy | str | rollback_all | rollback_all, keep_partial | Policy for partial success during scale-out. rollback_all reverts all engines on any failure; keep_partial keeps successfully scaled engines |
Scale-In Operation Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--scale-in-drain-timeout | float | 30.0 | Timeout to wait for in-flight requests to drain before force-aborting (seconds) |
--scale-in-shutdown-timeout | float | 30.0 | Timeout for graceful SGLang engine shutdown; ray.kill is used if exceeded (seconds) |
Logging and Monitoring Configuration
TensorBoard
| Parameter | Type | Default | Description |
|---|---|---|---|
--use-tensorboard | flag | False | Enable TensorBoard logging |
--tb-project-name | str | None | TensorBoard log directory. Defaults to environment variable TENSORBOARD_DIR |
--tb-experiment-name | str | None | TensorBoard experiment name |
ClearML
| Parameter | Type | Default | Description |
|---|---|---|---|
--use-clearml | flag | False | Enable ClearML logging |
Metrics Service
| Parameter | Type | Default | Description |
|---|---|---|---|
--use-metrics-service | flag | False | Enable centralized metrics collection and reporting service |
--timeline-dump-dir | str | None | Timeline trace event export directory (Chrome Trace format). Timeline tracing disabled if not set |
Logging Options
| Parameter | Type | Default | Description |
|---|---|---|---|
--log-passrate | flag | False | Enable pass@n pass rate logging |
--log-multi-turn | flag | False | Enable multi-turn Rollout information logging |
--log-correct-samples | flag | False | Log correct samples |
--log-reward-category | str | None | Log reward category statistics. Specify key in reward dict |
Notifications
| Parameter | Type | Default | Description |
|---|---|---|---|
--notify-urls | str | None | Apprise notification URL list (comma-separated) |
Router Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
--use-slime-router | flag | False | Use SlimeRouter for text-based routing instead of SGLang's token-based routing |
--slime-router-middleware-paths | str (list) | "" | List of middleware paths |
--slime-router-timeout | float | None | SlimeRouter HTTP request timeout in seconds |
--slime-router-max-connections | int | None | SlimeRouter HTTP client maximum connections |
--slime-router-health-check-failure-threshold | int | 3 | Mark worker as unhealthy after this many consecutive failures |
Other Training Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--train-env-vars | json | {} | Additional environment variables for training process |
--train-memory-margin-bytes | int | 1GB (1024³) | Reserved space for training memory allocation |
--disable-weights-backuper | flag | - | Disable weight backup to save host memory |
--custom-model-provider-path | str | None | Custom model provider function path |
--recompute-loss-function | flag | False | Whether to recompute loss function to save VRAM |
--log-probs-chunk-size | int | -1 | Chunk size for computing log probs. Used to save VRAM |
--allgather-cp | flag | False | - |
--model-name | str | None | Model name for Megatron to HF weight conversion. Inferred from AutoConfig.from_pretrained(hf_checkpoint) if not set |
--only-load-weight | flag | False | Reference model and actor fwd only load weights |
--rlsp-server-port | int | 8234 | RLSP Server HTTP port |
--custom-config-path | str | None | Custom parameter YAML config file path. Key-value pairs in file override existing parameters |
Debug & Profiling Parameters
Debug
| Parameter | Type | Default | Description |
|---|---|---|---|
--debug-rollout-only | flag | False | Run Rollout only, no training |
--debug-train-only | flag | False | Run training only, no Rollout |
--load-debug-rollout-data | str | None | Load debug Rollout data. Automatically enables --debug-train-only when set |
--load-debug-rollout-data-subsample | float | None | Subsample debug Rollout data to accelerate debugging |
--save-debug-rollout-data | str | None | Save Rollout data. Path supports {rollout_id} placeholder |
--save-debug-train-data | str | None | Save training data. Path supports {rollout_id} placeholder |
--dump-details | str | None | Export all training details for post-hoc analysis |
--check-weight-update-equal | flag | False | Check if weight updates are equal |
--enable-cuda-memory-check | flag | False | Enable memory check around low-level NCCL communication calls. Logs available GPU memory before each collective and attaches memory info to exceptions on failure |
Training Performance Profiling
These parameters control the PyTorch Profiler for training steps. Trace files are saved to traces/<tb_experiment_name>/train_trace/ by default.
| Parameter | Type | Default | Description |
|---|---|---|---|
--use-pytorch-profiler | flag | False | Enable PyTorch's built-in profiler to record CUDA kernels, CPU ops, and communication during training (from Megatron) |
--profile-step-start | int | 10 | Step offset at which to start profiling (inclusive, from Megatron). Counts from 0 since the current training launch, not absolute rollout ID; resets on checkpoint resumption |
--profile-step-end | int | 12 | Step offset at which to stop profiling (inclusive, from Megatron). Same counting semantics as above. E.g. start=10, end=12 profiles steps 10, 11, 12 (3 steps) |
--profile-target | str (list) | train_overall | Profiling targets: train_overall, train_actor, train_log_probs |
--profile-with-stack | flag | False | Record stack information in profiler traces |
--profile-with-memory | flag | False | Record memory information in profiler traces |
--profile-with-flops | flag | False | Estimate FLOPs in profiler traces |
GPU Memory Profiling
These parameters control GPU memory snapshot collection for diagnosing memory leaks and OOM issues. Snapshot files can be viewed with PyTorch Memory Viz tools (torch.cuda.memory._viz).
| Parameter | Type | Default | Description |
|---|---|---|---|
--record-memory-history | flag | False | Enable CUDA memory allocation history recording (from Megatron). Records call stacks and tensor info for each allocation/deallocation, and auto-dumps a snapshot on OOM |
--memory-snapshot-path | str | snapshot.pickle | Memory snapshot filename (from Megatron) |
--memory-snapshot-dir | str | None | Memory snapshot output directory. Defaults to traces/<tb_experiment_name>/memory_snapshot |
--memory-snapshot-num-steps | int | None | Proactively dump a memory snapshot after the specified number of steps (0-indexed, i.e., setting 3 means dump after step 2) |
--memory-recorder | str | torch | Memory recorder backend: torch (PyTorch built-in), memray (requires pip install memray) |
Network
| Parameter | Type | Default | Description |
|---|---|---|---|
--http-proxy | str | None | HTTP proxy address |
--use-distributed-post | flag | False | Use distributed POST requests |
Environment Configuration
Relax uses configs/env.yaml to configure runtime environment variables:
env_vars:
TOKENIZERS_PARALLELISM: 'true'
NCCL_DEBUG: 'WARN'
CUDA_DEVICE_MAX_CONNECTIONS: '1'
GLOO_SOCKET_IFNAME: "eth0"
TP_SOCKET_IFNAME: "eth0"