TrainerJobManager

Overview

TrainerJobManager manages the lifecycle of service-mode RLOR trainer jobs — GPU-backed trainer endpoints that your custom Python loop connects to via FiretitanServiceClient.

from fireworks.training.sdk import TrainerJobManager, TrainerJobConfig

Constructor

rlor_mgr = TrainerJobManager(
    api_key="<FIREWORKS_API_KEY>",
    account_id="<ACCOUNT_ID>",
    base_url="https://api.fireworks.ai",  # optional, defaults to https://api.fireworks.ai
)

Parameter	Type	Default	Description
`api_key`	`str`	—	Fireworks API key
`account_id`	`str`	—	Fireworks account ID
`base_url`	`str`	`"https://api.fireworks.ai"`	Control-plane URL
`additional_headers`	`dict \| None`	`None`	Extra HTTP headers
`verify_ssl`	`bool \| None`	`None`	SSL verification override

Methods

`create(config)`

Create a service-mode trainer job and return immediately (without waiting). Returns a CreatedTrainerJob:

created = rlor_mgr.create(TrainerJobConfig(
    base_model="accounts/fireworks/models/qwen3-8b",
    training_shape_ref="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    lora_rank=0,
    learning_rate=1e-5,
))

print(created.job_id)    # <JOB_ID>
print(created.job_name)  # accounts/<ACCOUNT_ID>/rlorTrainerJobs/<JOB_ID>

`wait_for_ready(job_id, job_name=None, poll_interval_s=5.0, timeout_s=900)`

Poll until a trainer job reaches RUNNING state and is healthy. Returns a TrainerServiceEndpoint:

endpoint = rlor_mgr.wait_for_ready(created.job_id)

`create_and_wait(config, poll_interval_s=5.0, timeout_s=900)`

Create a service-mode trainer and poll until the endpoint is healthy. Combines create() + wait_for_ready(). Returns a TrainerServiceEndpoint.

endpoint = rlor_mgr.create_and_wait(TrainerJobConfig(
    base_model="accounts/fireworks/models/qwen3-8b",
    training_shape_ref="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    lora_rank=0,
    learning_rate=1e-5,
    gradient_accumulation_steps=4,
    display_name="grpo-policy-trainer",
    hot_load_deployment_id="my-serving-deployment",
))

print(endpoint.base_url)  # https://<trainer-endpoint>
print(endpoint.job_id)    # <JOB_ID>
print(endpoint.job_name)  # accounts/<ACCOUNT_ID>/rlorTrainerJobs/<JOB_ID>

`wait_for_existing(job_id, poll_interval_s=5.0, timeout_s=900)`

Wait for an already-existing trainer job to reach RUNNING state:

existing = rlor_mgr.wait_for_existing(job_id="<existing-job-id>")
print(existing.base_url)

`resume_and_wait(job_id, poll_interval_s=5.0, timeout_s=900)`

Resume a failed/cancelled/paused job and wait until healthy:

endpoint = rlor_mgr.resume_and_wait(job_id="<job-id>")

`reconnect_and_wait(job_id, ...)`

Handle pod preemption and transient failures. Waits for the job to reach a resumable state, resumes it, then polls until healthy:

endpoint = rlor_mgr.reconnect_and_wait(
    job_id="<job-id>",
    timeout_s=600,
    max_wait_for_resumable_s=120,
)

More robust than resume_and_wait() — retries when the job is in a transitional state (e.g. the control plane is still processing a pod death).

Parameter	Type	Default	Description
`job_id`	`str`	—	The RLOR job ID to reconnect
`poll_interval_s`	`float`	`5.0`	Seconds between health checks after resume
`timeout_s`	`float`	`600`	Overall timeout for the job to become RUNNING
`max_wait_for_resumable_s`	`float`	`120`	Max seconds to wait for a resumable state

`get(job_id)`

Inspect job status:

status = rlor_mgr.get(job_id=endpoint.job_id)
print(status["state"])  # JOB_STATE_RUNNING

`delete(job_id)`

Delete a trainer job and release GPU resources:

rlor_mgr.delete(job_id="<job-id>")

`promote_checkpoint(job_id, checkpoint_id, output_model_id)`

Promote a sampler checkpoint to a deployable Fireworks model:

model = rlor_mgr.promote_checkpoint(
    job_id="<job-id>",
    checkpoint_id="<snapshot-name>",
    output_model_id="my-fine-tuned-model",
)

Parameter	Type	Description
`job_id`	`str`	RLOR trainer job ID that produced the checkpoint
`checkpoint_id`	`str`	The `snapshot_name` from `save_weights_for_sampler_ext`
`output_model_id`	`str`	Desired model ID (1-63 chars, lowercase a-z, 0-9, hyphen only)

Returns the model dict from the API (includes state, kind, peftDetails). See Saving and Loading for details.

`validate_output_model_id(output_model_id)`

Client-side validation helper for promote_checkpoint(..., output_model_id=...):

from fireworks.training.sdk import validate_output_model_id

errors = validate_output_model_id("my-fine-tuned-model")
if errors:
    raise ValueError("\n".join(errors))

Returns a list of formatted error strings. An empty list means the model ID is valid.

`resolve_training_profile(shape_id)`

Resolve a training shape ID into a full configuration profile:

shape_id = "accounts/fireworks/trainingShapes/ts-qwen3-8b-policy"
profile = rlor_mgr.resolve_training_profile(shape_id)
print(profile.accelerator_type)      # e.g. "NVIDIA_B200_192GB"
print(profile.trainer_image_tag)      # e.g. "0.0.0-dev-..."
print(profile.node_count)             # e.g. 1
print(profile.pipeline_parallelism)   # e.g. 1

Most users only need to know the training shape ID. In most cases, use the full shared path accounts/fireworks/trainingShapes/<shape>. The fireworks account is the public shared shape catalog, and the SDK resolves the versioned training_shape_ref for you. See Training Shapes for the user-facing shape workflow.

TrainerJobConfig

TrainerJobManager.create_and_wait(...) accepts a TrainerJobConfig dataclass: Launching a trainer requires training_shape_ref. In normal user code, you should not hand-author that value. Instead, pass a training shape ID to resolve_training_profile(...) and use the returned versioned ref. When training_shape_ref is set, the training shape owns the trainer hardware/image configuration. In normal user-facing flows, do not treat those shape-owned fields as knobs you should set manually.

Field	Type	Default	Description
`base_model`	`str`	—	Base model name (e.g. `"accounts/fireworks/models/qwen3-8b"`)
`lora_rank`	`int`	`0`	LoRA rank. `0` for full-parameter tuning, or a positive integer (e.g. `16`, `64`) for LoRA
`max_context_length`	`int \| None`	`None`	Maximum sequence length. Usually inherited from the selected training shape.
`learning_rate`	`float`	`1e-5`	Learning rate for the optimizer
`gradient_accumulation_steps`	`int`	`1`	Number of micro-batches before an optimizer step
`node_count`	`int \| None`	`None`	Number of trainer nodes. Shape-owned in normal shape-based launches; do not override it manually.
`display_name`	`str \| None`	`None`	Human-readable trainer name
`hot_load_deployment_id`	`str \| None`	`None`	Deployment ID for checkpoint weight sync
`region`	`str \| None`	`None`	Region for the job. Auto-resolved in normal shape-based launches.
`custom_image_tag`	`str \| None`	`None`	Override trainer image tag. Shape-owned in normal shape-based launches; do not override it manually.
`extra_args`	`list[str] \| None`	`None`	Extra trainer arguments
`accelerator_type`	`str \| None`	`None`	Accelerator type override. Shape-owned in normal shape-based launches; do not override it manually.
`accelerator_count`	`int \| None`	`None`	Accelerator count override. Shape-owned in normal shape-based launches; do not override it manually.
`training_shape_ref`	`str \| None`	`None`	Required launch-time full training-shape resource name (`accounts/<acct>/trainingShapes/<shape>` or `.../versions/<ver>`). In most cases this comes from the shared public shape account as `accounts/fireworks/trainingShapes/<shape>`. Use `mgr.resolve_training_profile("accounts/<acct>/trainingShapes/<shape>").training_shape_version` to get the pinned versioned ref. See Training Shapes.
`forward_only`	`bool`	`False`	Create a forward-only trainer (reference model pattern)

CreatedTrainerJob

Returned by create():

Field	Type	Description
`job_name`	`str`	Full resource name (`accounts/<id>/rlorTrainerJobs/<id>`)
`job_id`	`str`	RLOR trainer job ID

TrainerServiceEndpoint

Returned by create_and_wait, wait_for_ready, wait_for_existing, resume_and_wait, and reconnect_and_wait:

Field	Type	Description
`base_url`	`str`	Trainer endpoint URL for `FiretitanServiceClient`
`job_id`	`str`	RLOR trainer job ID
`job_name`	`str`	Full resource name (`accounts/<id>/rlorTrainerJobs/<id>`)

TrainingShapeProfile

Returned by resolve_training_profile:

Field	Type	Description
`training_shape_version`	`str`	Resolved shape version
`trainer_image_tag`	`str`	Docker image tag for the trainer
`max_supported_context_length`	`int`	Maximum supported context length
`node_count`	`int`	Number of trainer nodes
`deployment_shape_version`	`str`	Linked deployment shape
`deployment_image_tag`	`str`	Docker image tag for deployment
`accelerator_type`	`str`	GPU type
`accelerator_count`	`int`	Number of GPUs per node
`base_model_weight_precision`	`str`	Model weight precision
`pipeline_parallelism`	`int`	Pipeline parallelism degree
`training_shape`	`str`	Training shape name (without `/versions/...` suffix)
`deployment_shape`	`str`	Deployment shape name (without `/versions/...` suffix)

Job states

State	Meaning
`JOB_STATE_CREATING`	Resources being provisioned
`JOB_STATE_PENDING`	Queued, waiting for GPU availability
`JOB_STATE_RUNNING`	Trainer is ready — you can connect a training client
`JOB_STATE_IDLE`	Service-mode job is idle
`JOB_STATE_COMPLETED`	Job finished successfully
`JOB_STATE_FAILED`	Job failed
`JOB_STATE_CANCELLED`	Job was cancelled

FiretitanServiceClient — connect a training client to this trainer
Training Shapes — available shapes and deployment linkage
Cleanup — resource cleanup

Get Started

Developer Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Overview

Constructor

Methods

`create(config)`

`wait_for_ready(job_id, job_name=None, poll_interval_s=5.0, timeout_s=900)`

`create_and_wait(config, poll_interval_s=5.0, timeout_s=900)`

`wait_for_existing(job_id, poll_interval_s=5.0, timeout_s=900)`

`resume_and_wait(job_id, poll_interval_s=5.0, timeout_s=900)`

`reconnect_and_wait(job_id, ...)`

`get(job_id)`

`delete(job_id)`

`promote_checkpoint(job_id, checkpoint_id, output_model_id)`

`validate_output_model_id(output_model_id)`

`resolve_training_profile(shape_id)`

TrainerJobConfig

CreatedTrainerJob

TrainerServiceEndpoint

TrainingShapeProfile

Job states

Get Started

Developer Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Overview

​Constructor

​Methods

​create(config)

​wait_for_ready(job_id, job_name=None, poll_interval_s=5.0, timeout_s=900)

​create_and_wait(config, poll_interval_s=5.0, timeout_s=900)

​wait_for_existing(job_id, poll_interval_s=5.0, timeout_s=900)

​resume_and_wait(job_id, poll_interval_s=5.0, timeout_s=900)

​reconnect_and_wait(job_id, ...)

​get(job_id)

​delete(job_id)

​promote_checkpoint(job_id, checkpoint_id, output_model_id)

​validate_output_model_id(output_model_id)

​resolve_training_profile(shape_id)

​TrainerJobConfig

​CreatedTrainerJob

​TrainerServiceEndpoint

​TrainingShapeProfile

​Job states

​Related guides

Overview

Constructor

Methods

`create(config)`

`wait_for_ready(job_id, job_name=None, poll_interval_s=5.0, timeout_s=900)`

`create_and_wait(config, poll_interval_s=5.0, timeout_s=900)`

`wait_for_existing(job_id, poll_interval_s=5.0, timeout_s=900)`

`resume_and_wait(job_id, poll_interval_s=5.0, timeout_s=900)`

`reconnect_and_wait(job_id, ...)`

`get(job_id)`

`delete(job_id)`

`promote_checkpoint(job_id, checkpoint_id, output_model_id)`

`validate_output_model_id(output_model_id)`

`resolve_training_profile(shape_id)`

TrainerJobConfig

CreatedTrainerJob

TrainerServiceEndpoint

TrainingShapeProfile

Job states

Related guides