Hybrid Training Mode
Overview
Hybrid mode is a third execution mode in Relax that sits between Colocate (Sync) and Fully Async. It combines:
- the streaming data pipeline of Fully Async (TransferQueue +
max-stalenessfor off-policy tolerance), with - the in-process weight sharing of Colocate (TensorBackuper +
_switch_model, so ref / actor_fwd / advantages all run on the actor's own GPUs).
Concretely, Actor and Rollout still run on separate GPU placement groups (like Fully Async), but the actor no longer ships weights to standalone ActorFwd / Reference / Advantages services. Instead it cycles a single set of weights between actor, ref, old_actor, and teacher tags via a CPU/GPU TensorBackuper, computing every forward pass locally and pushing weights to rollout through the sync UpdateWeightFromTensor path.
Mode Comparison
| Dimension | Colocate (Sync) | Fully Async | Hybrid |
|---|---|---|---|
| GPU layout | Actor and Rollout time-share same GPUs | Actor / Rollout / ActorFwd / Reference each have own GPUs | Actor and Rollout on separate GPUs; ref / actor_fwd / adv share actor's GPUs |
| Data pipeline | TransferQueue, batch-synchronous | TransferQueue + StreamingDataLoader, fully streaming | TransferQueue + sub-batch streaming (num-iters-per-train-update) |
| Weight sync | In-process tensor copy | NCCL broadcast via DCS (Checkpoint Engine) | Sync UpdateWeightFromTensor to rollout; TensorBackuper for ref/actor_fwd |
| Staleness | max_staleness = 0 (strict on-policy) | Configurable max_staleness | Configurable max_staleness |
| Roles deployed | actor, critic, rollout | actor, critic, rollout, advantages, reference, actor_fwd | actor, critic, rollout (same as Colocate; ref/actor_fwd live inside actor) |
--balance-data | Supported | Not supported | Supported (one of hybrid's reasons to exist) |
When to Use Hybrid
Pick Hybrid when:
- You want the throughput benefits of dedicated rollout GPUs and pipelined data flow, but
- Your model is large enough that running independent ref / actor_fwd services would waste GPUs, or
- You need
--balance-data(load-balanced micro-batching across DP ranks), which pure Fully Async cannot provide.
Pick Fully Async when you have spare GPUs for separate ref / actor_fwd / advantages services and want true cross-step pipelining.
Pick Colocate when GPU count is tight and you can tolerate serial rollout → train cycles.
Architecture
Role Layout
Hybrid uses the same role set as Colocate — only actor, critic (optional), and rollout are deployed as Ray Serve services. The decision lives in relax/core/registry.py:
def process_role(config):
if config.hybrid:
# hybrid mode: actor handles ref/actor_fwd internally
# via _switch_model, only need actor + rollout services
return ROLES_COLOCATE
if config.fully_async:
...But unlike Colocate, the actor and rollout placement groups are disjoint, matching Fully Async semantics. From relax/core/controller.py:
if colocate and not self.config.hybrid:
# Sync colocate: actor and rollout share GPUs via time-sharing (offload/onload)
actor_rollout_pgs = create_placement_group(num_gpus=num_gpus)
else:
# fully_async (pure or hybrid): actor and rollout use separate GPUs
actor_rollout_pgs = NoneFlag Resolution
--hybrid is the only public switch. relax/utils/arguments.py resolves it into the two underlying flags downstream machinery already understands:
if args.hybrid:
args.fully_async = True
args.colocate = TruePassing --fully-async --colocate directly is rejected; use --hybrid instead. This single-switch design keeps args.hybrid as the canonical hybrid-only branch in the registry, controller dispatch, and train_hybrid call site.
Diagram
┌────────────────────────────────────────────────────────────────────────────┐
│ Controller (Orchestrator) │
│ relax/core/controller.py │
│ │
│ ┌───────────────────────────────────┐ ┌────────────────────┐ │
│ │ Actor Service │ │ Rollout Service │ │
│ │ (own placement group, N GPUs) │ │ (own PG, M GPUs) │ │
│ │ │ │ SGLang engines │ │
│ │ ┌────────────────────────────┐ │ └─────────┬──────────┘ │
│ │ │ TensorBackuper │ │ ▲ │
│ │ │ tags: actor / ref / │ │ │ │
│ │ │ old_actor / teacher │ │ │ │
│ │ │ _switch_model(tag) swaps │ │ │ │
│ │ │ weights on the same GPUs │ │ │ │
│ │ └────────────────────────────┘ │ │ │
│ │ train_hybrid(): │ │ │
│ │ ├─ ref forward (switch:ref) │ │ │
│ │ ├─ actor forward (switch:actor)│ │
│ │ ├─ advantages (in-process) │ │ │
│ │ └─ train (switch:actor)│ │
│ └──────────────┬────────────────────┘ │ │
│ │ UpdateWeightFromTensor (sync) ───┘ │
└──────────────────────┼─────────────────────────────────────────────────────┘
▼
┌───────────────────────────────────────────────────────────────────────────┐
│ TransferQueue (Data Plane) │
│ Rollout writes train_N partition incrementally ──► Actor consumes │
│ in sub-batches via get_meta(batch_size, batch_index) with max-staleness │
└───────────────────────────────────────────────────────────────────────────┘The train_hybrid Loop
relax/backends/megatron/actor.py:708 implements the hybrid training step in three phases:
Collect sub-batches and compute forward log-probs (small memory footprint)
The global batch is split into
num_iters_per_train_updatesub-batches. For each sub-batch the actor:- pulls data from TransferQueue (
_get_data_from_transfer_queue("train", rollout_id, fields, batch_size, batch_index)) - runs
_switch_model("ref")(if ref weights are backed up) and computes ref log-probs - runs
_switch_model("teacher")(if OPD teacher weights are backed up) and computes teacher log-probs - runs
_switch_model("old_actor" or "actor")and computes current actor log-probs - appends the enriched sub-batch to an in-memory list
- pulls data from TransferQueue (
Merge sub-batches and compute advantages globally
All sub-batch dicts are concatenated into one
rollout_data, thencompute_advantages_and_returns(self.args, rollout_data)runs once over the merged batch. This is the key correctness reason for the two-phase design — advantage normalization must see the full DP-group batch, not per-sub-batch slices.Train on the merged batch and push weights
A single
train(...)call runs the optimizer step on the merged batch. Afterwards the actor backs up the new weights to theactortag and (on the ref-update interval) refreshes thereftag, then callsself.update_weights()to push the updated weights to rollout viaUpdateWeightFromTensor.
The sub-batched forward keeps peak activation memory bounded — matching Fully Async behavior — while the merged training step preserves Colocate-style global statistics.
Configuration
Required Flags
| Flag | Purpose |
|---|---|
--hybrid | Enable hybrid mode (resolves to fully_async=True, colocate=True internally) |
--resource '{...}' | Declare actor and rollout placement groups separately, e.g. {"actor":[1,4],"rollout":[1,4]} |
--num-iters-per-train-update | Number of sub-batches per global batch (larger → smaller peak memory, more TQ polls) |
--max-staleness | Off-policy budget (0 = strict on-policy, >0 allows staleness) |
Optional but Common
| Flag | Notes |
|---|---|
--balance-data | Supported in hybrid (rejected in pure fully-async). Enable for DP load balancing. |
--num-data-storage-units | Number of TransferQueue storage actors. |
--use-streaming-dataset | Stream prompts from disk instead of loading into memory. |
--ref-update-interval | Periodically refresh the cached ref weights from the latest actor weights. |
Default Overrides
When --hybrid is set, relax/utils/arguments.py defaults the following (unless the user passes them explicitly):
offload_train = Falseandoffload_rollout = False— actor and rollout are on separate GPUs, so no offload neededcompute_advantages_and_returns = True— actor must compute advantages internallyfully_async = True,colocate = True— derived from--hybrid
WARNING
--balance-data requires --hybrid if you also want a streaming pipeline. The combination --fully-async --balance-data (without --hybrid) is rejected at argument parse time.
Quick Start
A reference launch script for an 8-GPU multimodal hybrid run lives at scripts/training/multimodal/run-qwen35-9B-8xgpu-openr1mm-hybrid-async.sh.
The hybrid invocation it builds:
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
-- python3 -m relax.entrypoints.train \
--resource '{"actor": [1, 4], "rollout": [1, 4]}' \
--max-staleness 2 \
--num-data-storage-units 1 \
--num-iters-per-train-update 8 \
--balance-data \
--hybrid \
"${MODEL_ARGS[@]}" \
"${CKPT_ARGS[@]}" \
"${ROLLOUT_ARGS[@]}" \
"${OPTIMIZER_ARGS[@]}" \
"${GRPO_ARGS[@]}" \
"${PERF_ARGS[@]}" \
"${SGLANG_ARGS[@]}" \
"${MISC_ARGS[@]}"Key points in this configuration:
- 8 total GPUs split 4 + 4 between actor and rollout
max-staleness 2— actor may consume rollout output up to 2 steps behind the freshest weightsnum-iters-per-train-update 8— each global batch is split into 8 sub-batches for forward passesbalance-data— DP load balancing enabled- GRPO algorithm with
--use-kl-lossand--use-tis(these are algorithm flags, orthogonal to hybrid)
Troubleshooting
train_hybrid(rollout_id=N) batch_index=K stalled for ... seconds
This warning fires in relax/backends/megatron/actor.py when the actor's TransferQueue poll for the next sub-batch keeps returning empty while the partition is not marked all_consumed. Typical causes:
- Rollout under-filled this partition (dropped samples without refilling).
- Rollout is paused on a health-check failure or restart.
- Staleness budget exhausted: rollout cannot produce new data because it is waiting for fresh weights.
Check rollout-side logs and partition status before assuming a code bug.
--balance-data is not supported in pure fully-async mode
You passed --fully-async --balance-data without --hybrid. Either drop --balance-data or switch to --hybrid, which supports DP-balanced data.
Rollout sees stale weights for a long time
Hybrid uses the sync UpdateWeightFromTensor path at the end of each train_hybrid call. If you see large weight-update gaps, check:
update_weights()timing in actor logs- Whether rollout health-checks are paging the actor (
_check_services_health()is called before weight sync)
Next Steps
Planned follow-ups for hybrid mode:
- Integrate DCS for weight sync — replace the current synchronous
UpdateWeightFromTensorpath with the Distributed Checkpoint Service so weight broadcast to rollout can overlap with the next training iteration, closing the remaining sync gap at the end of everytrain_hybridcall. - Split
train_actorintonum_iters_per_train_updateiterations — todaynum_iters_per_train_updateonly chunks the forward phase; the merged training step still runs once on the full global batch. Extend the actor train step to also iteratenum_iters_per_train_updatetimes so optimizer updates can be pipelined with TransferQueue consumption and peak training-side memory drops further.
Related docs:
- Fully Async Training Pipeline — the streaming-data engine hybrid borrows
- Architecture — overview of Relax's service layering
- Update Weights Pipeline — how
UpdateWeightFromTensorand DCS differ
