Skip to content

Customize Training

Prerequisites

Make sure you have completed the Installation steps.

Model Preparation

Download

You can download models and datasets from platforms like Hugging Face and ModelScope. Below are example commands using huggingface_hub to download sample resources:

bash
# Download model weights (Qwen3-VL-4B)
hf download Qwen/Qwen3-VL-4B-Instruct --local-dir /root/Qwen3-VL-4B-Instruct

Megatron Weights to HF Weights

No Manual Conversion Needed with Megatron Bridge

Relax uses Megatron Bridge as the weight bridging layer for its training backend, automatically handling bidirectional HF ↔ Megatron weight conversion during training — no manual conversion steps required. Simply specify the following option in your launch script:

bash
--megatron-to-hf-mode bridge

HF Weights to Megatron Weights

See Quick Start — Export Model.

Adding New Models

Adding a new model requires two parts:

1. Model Configuration Script

Model configuration files are located in scripts/models/. Extract the corresponding Megatron architecture parameters from the HF config. For example, scripts/models/qwen3-4B.sh:

bash
MODEL_ARGS=(
   --swiglu
   --num-layers 36
   --hidden-size 2560
   --ffn-hidden-size 9728
   --num-attention-heads 32
   --group-query-attention
   --num-query-groups 8
   --use-rotary-position-embeddings
   --disable-bias-linear
   --normalization "RMSNorm"
   --norm-epsilon 1e-6
   --rotary-base 1000000
   --vocab-size 151936
   --kv-channels 128
   --qk-layernorm
)

After adding the file, source the corresponding model configuration in your training launch script.

2. Megatron Bridge Model Adaptation

Relax uses Megatron Bridge for automatic HF ↔ Megatron weight conversion. If your model is not yet supported by Megatron Bridge, you need to add support on the Megatron Bridge side first — see its project documentation for details.

AI-Assisted Integration

This project provides a Codewiz skill model-integration (located at .codewiz/skills/model-integration/), covering the complete integration workflow for Bridge / Raw / FSDP backends, weight converter specifications, TP sharding logic, and common pitfalls. Invoke it in Codewiz via invoke skill model-integration for step-by-step guidance.

Data Preparation

Relax supports loading .jsonl and .parquet format files. Using .jsonl as an example, each line is a JSON object:

json
{
  "prompt": [
    {
      "content": "<image><audio><video>What happened in the video?\nOptions:\nA. a sunny day\nB. It's Hailing\nC. a furious storm\nD. Flood",
      "role": "user"
    }
  ],
  "image_key": ["path to your image"],
  "audio_key": ["path to your audio"],
  "video_key": ["path to your video"],
  "label": "<answer>B</answer>"
}

For multimodal data, each modality should have a corresponding placeholder in the content field, such as <image><audio><video> above, for correct message formatting. Multimodal data supports local file paths, URLs, and binary files.

The corresponding configuration in the training script is:

bash
--input-key prompt
--label-key label
--apply-chat-template
# Each multimodal data type must be explicitly configured to be loaded
--multimodal-keys '{"image":"image_key","audio":"audio_key","video":"video_key"}'

We provide conversion scripts for OpenR1 and AVQA datasets in scripts/tools/:

bash
python scripts/tools/process_openr1.py \
  --input-dir /root/multimodal-open-r1-8k-verified/data/train-00000-of-00001.parquet \
  --output-dir /root/multimodal-open-r1-8k-verified/data/train-00000-of-00001-test.parquet

# --md-dir points to the directory containing image and audio files,
# used to join relative paths into absolute paths.
# If not provided, relative paths are used.
python scripts/tools/process_avqa.py \
  --input-dir /root/AVQA-R1-6K/AVQA_R1/train/omni_rl_format_train.json \
  --output-dir /root/AVQA-R1-6K/AVQA_R1/train/omni_rl_format_train_test.jsonl \
  --md-dir /root/AVQA-R1-6K/AVQA_R1/train

Custom Reward Methods

You can define reward_func(args, sample: Sample, **kwargs) -> float in your own .py file, then add it to your task launch script. See DeepEyes for a concrete example.

bash
--custom-rm-path examples.deepeyes.reward_deepeyes.reward_func
# Custom reward_func may return a dict; if so, specify which key corresponds to the actual reward score
--reward-key score

Custom Generate Function

For multi-turn dialogue, tool calling, or agentic rollout, define a custom generate function to replace the default single-turn logic. The function signature is:

python
from relax.utils.types import Sample
# Required signature
async def generate(args: Any, sample: Sample, sampling_params: dict) -> Sample: ...
# Optional: add evaluation param — framework auto-passes True during eval
async def generate(args: Any, sample: Sample, sampling_params: dict, evaluation: bool = False) -> Sample: ...

The function must populate these sample fields before returning: tokens (full prompt+response token IDs), response (decoded string), response_length, loss_mask (per-token: 1=trainable, 0=skip), rollout_log_probs, and status (Sample.Status.COMPLETED / TRUNCATED etc.).

Example — simplified from examples/deepeyes/rollout.py (multi-turn tool-use rollout):

python
from relax.engine.rollout.sglang_rollout import GenerateState
from relax.utils.http_utils import post

async def generate(args, sample: Sample, sampling_params) -> Sample:
    state = GenerateState(args)
    url = f"http://{args.sglang_router_ip}:{args.sglang_router_port}/generate"
    env = build_env(sample=sample, args=args); env.reset()
    prompt_ids = state.tokenizer.encode(sample.prompt, add_special_tokens=False)
    sample.tokens, sample.loss_mask, sample.rollout_log_probs, response_tokens = list(prompt_ids), [], [], []
    for turn in range(args.max_turns):
        output = await post(url, {"input_ids": sample.tokens, "sampling_params": sampling_params, "return_logprob": True})
        new_tokens = [t[1] for t in output["meta_info"]["output_token_logprobs"]]
        new_probs = [t[0] for t in output["meta_info"]["output_token_logprobs"]]
        sample.tokens.extend(new_tokens); response_tokens.extend(new_tokens)                 # model output
        sample.loss_mask.extend([1] * len(new_tokens)); sample.rollout_log_probs.extend(new_probs)
        observation, done, info = env.step(output["text"])
        if done: break
        obs_ids = state.tokenizer.encode(observation, add_special_tokens=False)
        sample.tokens.extend(obs_ids); response_tokens.extend(obs_ids)                       # env observation
        sample.loss_mask.extend([0] * len(obs_ids)); sample.rollout_log_probs.extend([0.0] * len(obs_ids))
    sample.response = state.tokenizer.decode(response_tokens, skip_special_tokens=False)
    sample.response_length = len(response_tokens)
    sample.status = Sample.Status.COMPLETED
    return sample

Specify via launch script (--custom-generate-function-path examples.deepeyes.rollout.generate), or per eval dataset via custom_generate_function_path in eval config.

Training Script and Key Parameters

For complete parameter reference, see Configuration.

After completing the preparation steps, you can run the training script. Using Qwen3 VL 4B as an example:

bash
cd /root/Relax && \
export MODEL_CONFIG_DIR=$(pwd)/scripts/models && \
bash scripts/training/multimodal/run-qwen3-vl-4B-8xgpu.sh

Model Configuration Parameters

bash
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${MODEL_CONFIG_DIR}/qwen3-vl-4B.sh"

This section provides Megatron with the required hyperparameters. Since Megatron cannot read model configurations directly from checkpoints, they must be specified manually. We provide configuration examples for common models in the scripts/models/ directory. To add a new model, create a configuration file there and source it in your task launch script.

Checkpoint and Path Parameters

bash
CKPT_ARGS=(
  # Used to load tokenizer and other info; model weights from this HF path are not actually used
  --hf-checkpoint ${MODEL_DIR}/Qwen3-VL-4B-Instruct/
  # Reference model checkpoint
  # When --load is not set, this will be used as the initial checkpoint for training
  --ref-load ${MODEL_DIR}/Qwen3-VL-4B-Instruct/
  # Enable megatron bridge automatic weight conversion
  --megatron-to-hf-mode bridge
  # Actor model load path. If empty or no valid checkpoint exists, loads from --ref-load
  # For resuming training, point this to the checkpoint path
  --load /path/checkpoint/
  # Save path for model during training
  --save /path/checkpoint/
  # Model save interval (in steps)
  --save-interval 20
)

Path variable convention

The launcher scripts define three path variables at the top:

  • MODEL_DIR — model paths such as HF weights and --ref-load
  • DATA_DIR — dataset paths such as PROMPT_SET and --eval-prompt-data
  • EXP_DIR — training output paths such as --load / --save

Each one can be overridden independently via environment variable. MODEL_DIR and DATA_DIR fall back to EXP_DIR when unset, so a single export EXP_DIR=/root still drives all three paths.

Data Generation and Training Parameters

bash
# Dataset path
--prompt-data ${PROMPT_SET}
# Number of prompts to sample per round
--rollout-batch-size 32
# Number of responses to generate per prompt
# Multiplied with --rollout-batch-size to determine total samples per round
--n-samples-per-prompt 8
# Number of samples required for one parameter update (optimizer.step)
--global-batch-size 256
# Total number of "sample → train" loop iterations
--num-rollout ${NUM_ROLLOUT}

Message Processing Parameters

bash
# Dataset input key
--input-key prompt
# Dataset label key
--label-key label
# Apply Chat Template if the prompt's input_key is in OpenAI message format
--apply-chat-template
# Reward computation method; this option only supports built-in reward methods
# For custom reward, use --custom-rm-path
--rm-type openr1mm
# Multimodal data extraction keys
--multimodal-keys '{"image":"image"}'
# Custom SYSTEM_PROMPT; inserts a new message at the head of the prompt
--system-prompt ${SYSTEM_PROMPT}

Evaluation Parameters

You can add eval datasets for evaluation. Note that each eval call processes the entire dataset, so keep eval datasets small.

bash
VAL_ARGS=(
  # Evaluation interval (in rollout count)
  --eval-interval 5
  # Evaluation prompt dataset
  --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
  # Number of samples per evaluation prompt
  --n-samples-per-eval-prompt 16
  # Maximum response length during evaluation
  --eval-max-response-len 16384
  # Sampling parameters during evaluation
  --eval-top-p 0.7
)

Monitoring and Dump

bash
# Enable ClearML
--use-clearml
# Enable TensorBoard
--use-tensorboard
# Enable centralized metrics collection and reporting service
--use-metrics-service
# TensorBoard/ClearML storage path
--tb-project-name ${PROJECT_NAME}
# TensorBoard/ClearML storage name
--tb-experiment-name name
# Dump per-step rollout and training details to a specified directory
--dump-details /path

Parallelism and Performance Tuning

bash
# Training parallelism
--tensor-model-parallel-size 2
--sequence-parallel
--pipeline-model-parallel-size 1
--expert-model-parallel-size 8
# Recomputation
--recompute-granularity full
--recompute-method uniform
--recompute-num-layers 1
# CPU offload optimizer
--optimizer-cpu-offload
--overlap-cpu-optimizer-d2h-h2d
--use-precision-aware-optimizer
# Inference
--rollout-num-gpus-per-engine 2 # sglang tp
--sglang-mem-fraction-static 0.8
# Enables dynamic batching. When enabled, --micro-batch-size is ignored.
--use-dynamic-batch-size
# Maximum number of tokens processed per GPU. 
# When dynamic batching (use_dynamic_batch_size) is enabled, the system intelligently packs samples of varying lengths
# so that each micro-batch's total token count approaches this limit, improving training efficiency.
# If a single sample exceeds this value, it will form its own batch.
--max-tokens-per-gpu 9216

Ray Launch Command

bash
ray job submit --address="http://127.0.0.1:8265" \
  -- python3 relax/entrypoints/train.py \
  # [1, 8] represent replica count and total GPU count respectively; set replicas to 1
  # Resources are partitioned via _derive_cluster_args_from_resource
  --resource '{"actor": [1, 8], "rollout": [1, 8]}' \
  --max-staleness 0 \
  --num-data-storage-units 1 \
  --colocate \
  # Other parameters expanded below

Multi-Node Launch

Relax provides two multi-node launch methods: SPMD Multi-Node Mode (self-built Ray cluster) and Ray Job Mode (existing Ray cluster).

Method 1: SPMD Multi-Node Mode

Suitable for launching a Ray cluster from scratch on bare-metal or container environments and running training. The script automatically distinguishes between Head and Worker nodes, forms a cluster, and submits the training task on the Head node.

Required environment variables (must be set on each machine):

VariableDescriptionExample
MASTER_ADDRHostname of the Head nodenode-0
POD_NAMEHostname of the current nodenode-0 / node-1
HOST_IPIP address of the current node<node-ip>
WORLD_SIZETotal number of nodes (default 2)2
NUM_GPUSGPUs per node (default 8)8

Run the same command on every machine:

bash
bash scripts/entrypoint/spmd-multinode.sh scripts/training/multimodal/run-qwen3-30B-A3B-omni-16xgpu.sh

The script determines roles automatically based on MASTER_ADDR == POD_NAME:

  • Head node: Starts Ray Head → waits for all Workers to join → executes training script
  • Worker node: Joins the Ray cluster → blocks until training completes

Method 2: Ray Job Mode

Suitable when the Ray cluster is already managed by an external platform (e.g., KubeRay). The script does not start or stop Ray; it only cleans up residual processes and runs training directly.

Prerequisites:

  • Ray cluster is running and the current node can connect via ray status
  • The script automatically obtains MASTER_ADDR from the Ray cluster
bash
bash scripts/entrypoint/ray-job.sh scripts/training/multimodal/run-qwen35-9B-8xgpu-async.sh

Comparison of the Two Methods

SPMD Multi-NodeRay Job
Ray Cluster ManagementSelf-built by script (Head + Worker)Externally managed (KubeRay, etc.)
Must run on each machineYesNo (submit node only)
Use CaseBare-metal / container SPMD schedulingExisting Ray cluster
Entry Scriptscripts/entrypoint/spmd-multinode.shscripts/entrypoint/ray-job.sh

Next Steps

Custom Experiments

  1. Modify launch scripts: Edit shell scripts in examples/ or scripts/
  2. Switch models: Update the --hf-checkpoint parameter to point to your model
  3. Tune training: Modify optimizer, learning rate, and batch size parameters
  4. Custom rewards: Implement custom reward functions (see DeepEyes example)

Explore Examples

Learn Core Concepts

Getting Help

Released under the Apache 2.0 License.