Skip to main content

What this is

Service-mode RLOR jobs provision GPU-backed trainer endpoints that your custom Python loop connects to via the Tinker SDK. In the current training SDK, job lifecycle is handled by TrainerJobManager.

Creating a service-mode trainer job

Use TrainerJobManager.create_and_wait(...) with TrainerJobConfig:
import os
from fireworks.training.sdk import TrainerJobManager, TrainerJobConfig

api_key = os.environ["FIREWORKS_API_KEY"]
account_id = os.environ.get("FIREWORKS_ACCOUNT_ID", "")
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")

rlor_mgr = TrainerJobManager(
    api_key=api_key,
    account_id=account_id,
    base_url=base_url,
)

endpoint = rlor_mgr.create_and_wait(TrainerJobConfig(
    base_model="accounts/fireworks/models/qwen3-8b",
    lora_rank=0,
    max_context_length=4096,
    learning_rate=1e-5,
    gradient_accumulation_steps=4,
    display_name="grpo-policy-trainer",
    hot_load_deployment_id="my-serving-deployment",
))

# Ready-to-use trainer endpoint
print(endpoint.base_url)  # https://<trainer-endpoint>
print(endpoint.job_id)    # <JOB_ID>
print(endpoint.job_name)  # accounts/<ACCOUNT_ID>/rlorTrainerJobs/<JOB_ID>

Create parameters reference

ParameterTypeDescription
base_modelstrRequired. Base model for the trainer.
lora_rankint0 for full-parameter training, >0 for LoRA.
max_context_lengthintMax sequence length.
learning_ratefloatLearning rate for trainer-side optimizer state.
gradient_accumulation_stepsintNumber of micro-batches before optimizer step.
node_countintNumber of trainer nodes.
hot_load_deployment_idstrLink this trainer to a deployment for checkpoint uploads and hotloading.
display_namestrHuman-readable name for the job.
regionstrRegion override (optional).
custom_image_tagstrTrainer image tag override (optional).
extra_argslist[str]Extra trainer args (for example --forward-only).
accelerator_type / accelerator_countstr / intAccelerator overrides (optional).
skip_validationsboolBypass control-plane validation checks.
forward_onlyboolMark the job as forward-only.

Inspecting a job

status = rlor_mgr.get(job_id=endpoint.job_id)
print(status["state"])              # JOB_STATE_RUNNING
print(status["directRouteHandle"])  # Trainer endpoint URL

Job states

StateMeaning
JOB_STATE_CREATINGResources being provisioned
JOB_STATE_PENDINGQueued, waiting for GPU availability
JOB_STATE_RUNNINGTrainer is ready — you can connect a Tinker client
JOB_STATE_IDLEService-mode job is idle (no active training)
JOB_STATE_COMPLETEDJob finished successfully
JOB_STATE_FAILEDJob failed
JOB_STATE_CANCELLEDJob was cancelled

Waiting for readiness

create_and_wait(...) and wait_for_existing(...) already block until the endpoint is healthy. For an existing job:
existing = rlor_mgr.wait_for_existing(job_id="<existing-job-id>")
print(existing.base_url)

Resuming a job

endpoint = rlor_mgr.resume_and_wait(job_id="<job-id>")
print(endpoint.base_url)

Reconnecting after preemption

reconnect_and_wait handles pod preemption and transient failures. It waits for the job to reach a resumable state (tolerating transitional states like CREATING or DELETING), resumes it, then polls until the endpoint is healthy:
endpoint = rlor_mgr.reconnect_and_wait(
    job_id="<job-id>",
    timeout_s=600,
    max_wait_for_resumable_s=120,
)
print(endpoint.base_url)
This is more robust than resume_and_wait() because it retries when the job is in a transitional state (e.g. the control plane is still processing the pod death).

Parameters

ParameterTypeDefaultDescription
job_idstrThe RLOR job ID to reconnect
poll_interval_sfloat5.0Seconds between health checks after resume
timeout_sfloat600Overall timeout for the job to become RUNNING
max_wait_for_resumable_sfloat120Max seconds to wait for a resumable state (FAILED/CANCELLED/PAUSED/COMPLETED)

Deleting a job

Always clean up trainer jobs when done to release GPU resources:
rlor_mgr.delete(job_id="<job-id>")

Resolving training shapes

Training shapes bundle region, accelerator, image tag, node count, and sharding config into a single resolved profile:
profile = rlor_mgr.resolve_training_profile("ts-qwen3-8b-policy")
print(profile.accelerator_type)   # e.g. "NVIDIA_B200_192GB"
print(profile.trainer_image_tag)   # e.g. "0.0.0-dev-..."
print(profile.node_count)          # e.g. 1
print(profile.pipeline_parallelism)  # e.g. 1
The returned TrainingShapeProfile contains: training_shape_version, trainer_image_tag, max_supported_context_length, node_count, deployment_shape_version, deployment_image_tag, accelerator_type, accelerator_count, base_model_weight_precision, pipeline_parallelism.

Operational guidance

  • Service mode supports both full-parameter and LoRA tuning. Set lora_rank=0 for full-parameter or a positive integer for LoRA.
  • Set hot_load_deployment_id when you plan to hotload checkpoints onto a deployment. This configures the checkpoint upload path.
  • Clean up jobs when your experiment is done — trainer jobs hold GPU resources.
  • Use display_name to identify jobs in logs and in the Fireworks console.
  • Use reconnect_and_wait for long-running experiments where pod preemption is possible. It handles transitional states and auto-resumes.
  • Use training shapes (resolve_training_profile) to auto-populate infra config instead of manually setting region, accelerator, and image tag.