Skip to main content

Overview

TrainerJobManager is a low-level compatibility API. New user code should not create trainer managers directly; use FiretitanServiceClient.from_firetitan_config(...) or cookbook recipes instead. This page remains for existing integrations, migration support, and advanced lifecycle debugging.
TrainerJobManager manages the lifecycle of service-mode trainer jobs — GPU-backed trainer endpoints that your Python loop connects to with a training client. TrainerJobManager extends FireworksClient, so all trainer-free operations (checkpoint promotion, training shape resolution) are also available here.
from fireworks.training.sdk import TrainerJobManager, TrainerJobConfig

Constructor

rlor_mgr = TrainerJobManager(
    api_key="<FIREWORKS_API_KEY>",
    base_url="https://api.fireworks.ai",  # optional, defaults to https://api.fireworks.ai
)
ParameterTypeDefaultDescription
api_keystrFireworks API key
base_urlstr"https://api.fireworks.ai"Control-plane URL
additional_headersdict | NoneNoneExtra HTTP headers
verify_sslbool | NoneNoneSSL verification override

Methods

create(config)

Create a service-mode trainer job and return immediately (without waiting). Returns a CreatedTrainerJob:
created = rlor_mgr.create(TrainerJobConfig(
    base_model="accounts/fireworks/models/qwen3-8b",
    training_shape_ref="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    lora_rank=0,
    learning_rate=1e-5,
))

print(created.job_id)    # <JOB_ID>
print(created.job_name)  # accounts/<ACCOUNT_ID>/rlorTrainerJobs/<JOB_ID>

wait_for_ready(job_id, job_name=None, poll_interval_s=5.0, timeout_s=900)

Poll until a trainer job reaches RUNNING state and is healthy. Returns a TrainerServiceEndpoint:
endpoint = rlor_mgr.wait_for_ready(created.job_id)

create_and_wait(config, poll_interval_s=5.0, timeout_s=900)

Create a service-mode trainer and poll until the endpoint is healthy. Combines create() + wait_for_ready(). Returns a TrainerServiceEndpoint.
endpoint = rlor_mgr.create_and_wait(TrainerJobConfig(
    base_model="accounts/fireworks/models/qwen3-8b",
    training_shape_ref="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    lora_rank=0,
    learning_rate=1e-5,
    display_name="grpo-policy-trainer",
))

print(endpoint.base_url)  # https://<trainer-endpoint>
print(endpoint.job_id)    # <JOB_ID>
print(endpoint.job_name)  # accounts/<ACCOUNT_ID>/rlorTrainerJobs/<JOB_ID>

wait_for_existing(job_id, poll_interval_s=5.0, timeout_s=900)

Wait for an already-existing trainer job to reach RUNNING state:
existing = rlor_mgr.wait_for_existing(job_id="<existing-job-id>")
print(existing.base_url)

resume_and_wait(job_id, poll_interval_s=5.0, timeout_s=900)

Resume a failed/cancelled/paused job and wait until healthy:
endpoint = rlor_mgr.resume_and_wait(job_id="<job-id>")

reconnect_and_wait(job_id, ...)

Handle pod preemption and transient failures. Waits for the job to reach a resumable state, resumes it, then polls until healthy:
endpoint = rlor_mgr.reconnect_and_wait(
    job_id="<job-id>",
    timeout_s=600,
    max_wait_for_resumable_s=120,
)
More robust than resume_and_wait() — retries when the job is in a transitional state (e.g. the control plane is still processing a pod death).
ParameterTypeDefaultDescription
job_idstrThe RLOR job ID to reconnect
poll_interval_sfloat5.0Seconds between health checks after resume
timeout_sfloat600Overall timeout for the job to become RUNNING
max_wait_for_resumable_sfloat120Max seconds to wait for a resumable state

get(job_id)

Inspect job status:
status = rlor_mgr.get(job_id=endpoint.job_id)
print(status["state"])  # JOB_STATE_RUNNING

delete(job_id)

Delete a trainer job and release GPU resources:
rlor_mgr.delete(job_id="<job-id>")

promote_checkpoint(*, name, output_model_id, base_model)

Inherited from FireworksClient. Promote a sampler checkpoint to a deployable Fireworks model. The trainer job does not need to be running — the checkpoint resource name resolves the storage location.
entry = rlor_mgr.list_checkpoints(endpoint.job_id)[0]
model = rlor_mgr.promote_checkpoint(
    name=entry["name"],
    output_model_id="my-fine-tuned-model",
    base_model="accounts/fireworks/models/qwen3-8b",
)
See FireworksClient.promote_checkpoint for full parameter docs.

resolve_training_profile(shape_id)

Inherited from FireworksClient. Resolve a training shape ID into a full configuration profile.
shape_id = "accounts/fireworks/trainingShapes/ts-qwen3-8b-policy"
profile = rlor_mgr.resolve_training_profile(shape_id)
See FireworksClient.resolve_training_profile for full parameter docs.

TrainerJobConfig

TrainerJobManager.create_and_wait(...) accepts a TrainerJobConfig dataclass: Launching through a training shape is the recommended path. In normal user code, you should not hand-author training_shape_ref; pass a training shape ID to resolve_training_profile(...) and use the returned versioned ref. Advanced manual launches can omit training_shape_ref and provide infra fields directly. When training_shape_ref is set (the recommended shape path), the training shape owns the trainer’s hardware and image configuration. The fields below are what you set as a user:
FieldTypeDefaultDescription
base_modelstrBase model name (e.g. "accounts/fireworks/models/qwen3-8b")
training_shape_refstr | NoneNoneFull training-shape resource name (e.g. accounts/fireworks/trainingShapes/<shape> or .../versions/<ver>). Use mgr.resolve_training_profile(...) to get the pinned versioned ref. See Training Shapes.
lora_rankint0LoRA rank. 0 for full-parameter tuning, or a positive integer (e.g. 16, 64) for LoRA
max_context_lengthint | NoneNoneMaximum sequence length. Usually inherited from the training shape on the shape path.
learning_ratefloat1e-5Learning rate for the optimizer
display_namestr | NoneNoneHuman-readable trainer name
regionstr | NoneNoneRegion for the job
extra_argslist[str] | NoneNoneExtra trainer arguments
forward_onlyboolFalseCreate a forward-only trainer (reference model pattern)
inactivity_timeoutdatetime.timedelta | str | NoneNoneTrainer inactivity timeout. The trainer reports tracked activity, including trainer API operations and active-session heartbeats. If no tracked activity is observed for this duration, the trainer is automatically stopped. When unset or 0, Fireworks uses the 60-minute default. String values must use protobuf JSON duration format, such as "1800s".
disable_inactivity_cleanupboolFalseDisable trainer inactivity cleanup. GPU usage continues to accrue while the trainer is running.
gradient_accumulation_steps is deprecated in TrainerJobConfig. Do not use it to request server-side accumulation. Accumulate gradients in client code by calling forward_backward... multiple times before one optim_step(...); see Loss Functions.
On the recommended shape path, accelerator_type, accelerator_count, node_count, and custom_image_tag are automatically configured by the training shape and cannot be overridden. Advanced manual launches can omit training_shape_ref and set those fields directly.

CreatedTrainerJob

Returned by create():
FieldTypeDescription
job_namestrFull resource name (accounts/<id>/rlorTrainerJobs/<id>)
job_idstrRLOR trainer job ID

TrainerServiceEndpoint

Returned by create_and_wait, wait_for_ready, wait_for_existing, resume_and_wait, and reconnect_and_wait:
FieldTypeDescription
base_urlstrTrainer endpoint URL for connecting a training client
job_idstrRLOR trainer job ID
job_namestrFull resource name (accounts/<id>/rlorTrainerJobs/<id>)

TrainingShapeProfile

See FireworksClient > TrainingShapeProfile for the full field reference.

Job states

StateMeaning
JOB_STATE_CREATINGResources being provisioned
JOB_STATE_PENDINGQueued, waiting for GPU availability
JOB_STATE_RUNNINGTrainer is ready — you can connect a training client
JOB_STATE_IDLEService-mode job is idle
JOB_STATE_COMPLETEDJob finished successfully
JOB_STATE_FAILEDJob failed
JOB_STATE_CANCELLEDJob was cancelled