GenRM Service API

Offload: Releases GPU memory before Actor training
Onload: Loads model weights back to GPU before rollout

The GenRM (Generative Reward Model) service provides LLM-based response evaluation. It is deployed as a Ray Serve deployment with a FastAPI ingress.

Overview

Property	Value
Module	`relax.components.genrm`
Deployment	`@serve.deployment(logging_config=...)`
Ingress	FastAPI

Unlike Actor and Rollout, GenRM is a passive HTTP service — it does not run a background loop. It only responds to incoming /generate requests.

The service uses SGLang engines to perform preference evaluation:

When colocated with the Actor (sharing GPU resources), GenRM supports offload/onload operations:

Two colocate sub-modes are auto-detected from the GPU allocation:

Split (rollout_num_gpus + genrm_num_gpus == actor_total_gpus): GenRM and Rollout occupy disjoint bundles.
Shared (rollout_num_gpus == genrm_num_gpus == actor_total_gpus): GenRM and Rollout occupy the same bundles, splitting each GPU's memory via SGLang mem_fraction_static. GenRM reads its mem_fraction_static from --genrm-engine-config. GenRM never sources weights from the Actor; onload only resumes its KV cache and CUDA graphs.

See GenRM example for full configuration.