Quick Start
This guide provides three end-to-end examples covering text, vision-language, and omni-modal training tasks. Each example includes data preparation, model download, and training launch commands.
Make sure you have completed the Installation steps before proceeding.
Task 1: DAPO Math (Text)
Train Qwen3-4B on the dapo-math-17k math reasoning dataset using GRPO with 8 GPUs.
Data Preparation
Download the training dataset and (optionally) the evaluation dataset:
# Download training dataset (dapo-math-17k)
hf download --repo-type dataset zhuzilin/dapo-math-17k \
--local-dir /root/dapo-math-17k
# Download evaluation dataset (aime-2024)
hf download --repo-type dataset zhuzilin/aime-2024 \
--local-dir /root/aime-2024
# Add math instruction prefix to the aime evaluation dataset (in-place)
python scripts/tools/process_aime.py --input /root/aime-2024/aime-2024.jsonlThe training dataset is in .jsonl format and can be used directly — no conversion required. The evaluation dataset needs the above script to prepend an instruction prefix that guides the model to produce answers in \boxed{} format.
Model Download
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4BLaunch Training
Since the model and dataset are both downloaded to the /root directory, you only need to set EXP_DIR=/root and the script will automatically locate them — no manual script editing required.
Persistent Storage
Make sure the /root directory is mounted to persistent storage on the host, otherwise data will be lost when the container is destroyed. See the Docker mount instructions in the Installation Guide.
cd /root/Relax
export EXP_DIR=/root
# Single node
bash scripts/training/text/run-qwen3-4B-8xgpu.sh
# Multi-node
bash -x scripts/entrypoint/spmd-multinode.sh scripts/training/text/run-qwen3-4B-8xgpu.shReward Method
This task uses the built-in dapo reward type (see relax/engine/rewards/math_dapo_utils.py). It employs rule-based answer extraction and symbolic math verification to evaluate correctness, assigning 1.0 for correct answers and 0.0 for incorrect ones.
Task 2: Open-R1 (Vision-Language)
Train Qwen3-VL-4B on the multimodal-open-r1-8k-verified image + text dataset using GRPO with 8 GPUs.
Data Preparation
Download the dataset and convert it to the Relax format:
# Download dataset
hf download --repo-type dataset lmms-lab/multimodal-open-r1-8k-verified \
--local-dir /root/multimodal-open-r1-8k-verified
# Convert to Relax format
python scripts/tools/process_openr1.py \
--input-dir /root/multimodal-open-r1-8k-verified/data/train-00000-of-00001.parquet \
--output-dir /root/multimodal-open-r1-8k-verified/data/train-00000-of-00001-converted.parquetThe conversion script reads the raw parquet, extracts problem, image, and solution fields, and produces a new parquet with prompt, image, and label columns in the format expected by Relax.
Model Download
hf download Qwen/Qwen3-VL-4B-Instruct --local-dir /root/Qwen3-VL-4B-InstructLaunch Training
Since the model and dataset are both downloaded to the /root directory, you only need to set EXP_DIR=/root and the script will automatically locate them — no manual script editing required.
cd /root/Relax
export EXP_DIR=/root
# Single node
bash scripts/training/multimodal/run-qwen3-vl-4B-8xgpu.sh
# Multi-node
bash -x scripts/entrypoint/spmd-multinode.sh scripts/training/multimodal/run-qwen3-vl-4B-8xgpu.shReward Method
This task uses the built-in openr1mm reward type (see relax/engine/rewards/openr1mm.py). It extracts the final answer from <answer>...</answer> tags in both the ground-truth and model output. Symbolic verification via math_verify is attempted first; string matching serves as a fallback.
Task 3: AVQA (Omni-Modal: Image + Audio)
Train Qwen3-Omni-30B-A3B on the AVQA-R1-6K image + audio question answering dataset using GRPO with 16 GPUs (2 nodes).
Data Preparation
Download the dataset and convert it to the Relax format:
# Download dataset
hf download --repo-type dataset harryhsing/AVQA-R1-6K \
--local-dir /root/AVQA-R1-6K
# Convert to Relax format
# --md-dir points to the directory containing image and audio files,
# used to join relative paths into absolute paths (Optional, default to use relative paths).
python scripts/tools/process_avqa.py \
--input-dir /root/AVQA-R1-6K/AVQA_R1/train/omni_rl_format_train.json \
--output-dir /root/AVQA-R1-6K/AVQA_R1/train/omni_rl_format_train_convert.jsonl \
--md-dir /root/AVQA-R1-6K/AVQA_R1/train
python scripts/tools/process_avqa.py \
--input-dir /root/AVQA-R1-6K/AVQA_R1/valid/omni_rl_format_valid.json \
--output-dir /root/AVQA-R1-6K/AVQA_R1/valid/small_valid.jsonl \
--md-dir /root/AVQA-R1-6K/AVQA_R1/validThe conversion script reads the raw JSON, extracts problem, options, image, and audio fields, and produces a .jsonl file with prompt, image, audio, and label columns.
Model Download
hf download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir /root/Qwen3-Omni-30B-A3B-Instruct
# Qwen3-Omni ships its chat_template in a standalone chat_template.json that
# AutoTokenizer does not auto-load. Merge it into tokenizer_config.json (skipped if already present).
python -c "import json,sys; m=sys.argv[1]; p=f'{m}/tokenizer_config.json'; tc=json.load(open(p)); ('chat_template' in tc) or (tc.update(chat_template=json.load(open(f'{m}/chat_template.json'))['chat_template']) or json.dump(tc, open(p,'w'), indent=2, ensure_ascii=False))" /root/Qwen3-Omni-30B-A3B-InstructLaunch Training
Since the model and dataset are both downloaded to the /root directory, you only need to set EXP_DIR=/root and the script will automatically locate them — no manual script editing required.
cd /root/Relax
export EXP_DIR=/root
# Single node (requires 16 GPUs on a single machine)
bash scripts/training/multimodal/run-qwen3-30B-A3B-omni-16xgpu.sh
# Multi-node (recommended: 2 nodes × 8 GPUs)
bash -x scripts/entrypoint/spmd-multinode.sh scripts/training/multimodal/run-qwen3-30B-A3B-omni-16xgpu.shReward Method
This task uses the built-in multiple_choice reward type (see relax/engine/rewards/multiple_choice.py). It extracts the answer from <answer>...</answer> tags and performs exact string matching against the ground truth, assigning 1.0 for correct and 0.0 for incorrect.
Task 4: NextQA (Video)
Train the Qwen3-Omni-30B-A3B model using the GRPO algorithm on the TinyLLaVA-Video-R1-NextQA video question-answering dataset with 16 GPUs (2 nodes).
Data Preparation
Download the dataset and convert it to Relax format:
# Download the dataset
hf download --repo-type dataset Zhang199/TinyLLaVA-Video-R1-training-data \
--local-dir /root/NextQA
# Unzip video files
unzip /root/NextQA/NextQA.zip -d /root/NextQA
# Convert to Relax format
python scripts/tools/process_nextqa.py \
--input-dir /root/NextQAThe conversion script reads the original JSON file, extracts the question, options, and video fields, and generates a .jsonl file containing prompt, video, and label columns.
Model Download
hf download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir /root/Qwen3-Omni-30B-A3B-Instruct
# Qwen3-Omni ships its chat_template in a standalone chat_template.json that
# AutoTokenizer does not auto-load. Merge it into tokenizer_config.json (skipped if already present).
python -c "import json,sys; m=sys.argv[1]; p=f'{m}/tokenizer_config.json'; tc=json.load(open(p)); ('chat_template' in tc) or (tc.update(chat_template=json.load(open(f'{m}/chat_template.json'))['chat_template']) or json.dump(tc, open(p,'w'), indent=2, ensure_ascii=False))" /root/Qwen3-Omni-30B-A3B-InstructLaunch Training
Since both the model and dataset are downloaded to the /root directory, simply set EXP_DIR=/root and the script will automatically locate the corresponding paths without manually editing the script.
cd /root/Relax
export EXP_DIR=/root
# Single node (requires 16 GPUs on a single machine)
bash scripts/training/multimodal/run-qwen3-30B-A3B-omni-16xgpu-video.sh
# Multi-node (recommended: 2 nodes × 8 GPUs)
bash -x scripts/entrypoint/spmd-multinode.sh scripts/training/multimodal/run-qwen3-30B-A3B-omni-16xgpu-video.shReward Method
This task uses the built-in multiple_choice reward type (see relax/engine/rewards/multiple_choice.py). It extracts the answer from <answer>...</answer> tags and performs exact string matching against the ground truth, assigning 1.0 for correct and 0.0 for incorrect.
Verifying Training Progress
After launching any of the above tasks, you should see logs like:
Finish rollout 0/200
training step 0/200This indicates training is running successfully.
Export Model
Checkpoints saved by Relax are in Megatron DCP format. To convert them to Hugging Face weight format, use the convert_torch_dist_to_hf_bridge.py script:
# Ensure PYTHONPATH includes Megatron-LM and Relax
export PYTHONPATH=${MEGATRON}:${RELAX}:$PYTHONPATH
python scripts/tools/convert_torch_dist_to_hf_bridge.py \
--input-dir /path/to/dcp_checkpoint \
--output-dir /path/to/hf_output \
--origin-hf-dir /path/to/original_hf_modelParameter descriptions:
| Parameter | Description |
|---|---|
--input-dir | Megatron DCP format checkpoint directory |
--output-dir | Output directory for converted HF weights |
--origin-hf-dir | Original HF model directory (used to read model config) |
--force | Optional, force overwrite if output directory already exists |
Next Steps
- Customize Training — Learn how to customize training scripts, parameters, reward functions, and multi-node launch
- Configuration — Full parameter reference
- Architecture — Understand the system design
