Overview
TrainerJobManager manages the lifecycle of service-mode RLOR trainer jobs — GPU-backed trainer endpoints that your custom Python loop connects to via FiretitanServiceClient.
Constructor
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | — | Fireworks API key |
account_id | str | — | Fireworks account ID |
base_url | str | "https://api.fireworks.ai" | Control-plane URL |
additional_headers | dict | None | None | Extra HTTP headers |
verify_ssl | bool | None | None | SSL verification override |
Methods
create(config)
Create a service-mode trainer job and return immediately (without waiting). Returns a CreatedTrainerJob:
wait_for_ready(job_id, job_name=None, poll_interval_s=5.0, timeout_s=900)
Poll until a trainer job reaches RUNNING state and is healthy. Returns a TrainerServiceEndpoint:
create_and_wait(config, poll_interval_s=5.0, timeout_s=900)
Create a service-mode trainer and poll until the endpoint is healthy. Combines create() + wait_for_ready(). Returns a TrainerServiceEndpoint.
wait_for_existing(job_id, poll_interval_s=5.0, timeout_s=900)
Wait for an already-existing trainer job to reach RUNNING state:
resume_and_wait(job_id, poll_interval_s=5.0, timeout_s=900)
Resume a failed/cancelled/paused job and wait until healthy:
reconnect_and_wait(job_id, ...)
Handle pod preemption and transient failures. Waits for the job to reach a resumable state, resumes it, then polls until healthy:
resume_and_wait() — retries when the job is in a transitional state (e.g. the control plane is still processing a pod death).
| Parameter | Type | Default | Description |
|---|---|---|---|
job_id | str | — | The RLOR job ID to reconnect |
poll_interval_s | float | 5.0 | Seconds between health checks after resume |
timeout_s | float | 600 | Overall timeout for the job to become RUNNING |
max_wait_for_resumable_s | float | 120 | Max seconds to wait for a resumable state |
get(job_id)
Inspect job status:
delete(job_id)
Delete a trainer job and release GPU resources:
promote_checkpoint(job_id, checkpoint_id, output_model_id)
Promote a sampler checkpoint to a deployable Fireworks model:
| Parameter | Type | Description |
|---|---|---|
job_id | str | RLOR trainer job ID that produced the checkpoint |
checkpoint_id | str | The snapshot_name from save_weights_for_sampler_ext |
output_model_id | str | Desired model ID (1-63 chars, lowercase a-z, 0-9, hyphen only) |
state, kind, peftDetails). See Saving and Loading for details.
validate_output_model_id(output_model_id)
Client-side validation helper for promote_checkpoint(..., output_model_id=...):
resolve_training_profile(shape_id)
Resolve a training shape ID into a full configuration profile:
accounts/fireworks/trainingShapes/<shape>. The fireworks account is the public shared shape catalog, and the SDK resolves the versioned training_shape_ref for you.
See Training Shapes for the user-facing shape workflow.
TrainerJobConfig
TrainerJobManager.create_and_wait(...) accepts a TrainerJobConfig dataclass:
Launching a trainer requires training_shape_ref. In normal user code, you should not hand-author that value. Instead, pass a training shape ID to resolve_training_profile(...) and use the returned versioned ref.
When training_shape_ref is set, the training shape owns the trainer hardware/image configuration. In normal user-facing flows, do not treat those shape-owned fields as knobs you should set manually.
| Field | Type | Default | Description |
|---|---|---|---|
base_model | str | — | Base model name (e.g. "accounts/fireworks/models/qwen3-8b") |
lora_rank | int | 0 | LoRA rank. 0 for full-parameter tuning, or a positive integer (e.g. 16, 64) for LoRA |
max_context_length | int | None | None | Maximum sequence length. Usually inherited from the selected training shape. |
learning_rate | float | 1e-5 | Learning rate for the optimizer |
gradient_accumulation_steps | int | 1 | Number of micro-batches before an optimizer step |
node_count | int | None | None | Number of trainer nodes. Shape-owned in normal shape-based launches; do not override it manually. |
display_name | str | None | None | Human-readable trainer name |
hot_load_deployment_id | str | None | None | Deployment ID for checkpoint weight sync |
region | str | None | None | Region for the job. Auto-resolved in normal shape-based launches. |
custom_image_tag | str | None | None | Override trainer image tag. Shape-owned in normal shape-based launches; do not override it manually. |
extra_args | list[str] | None | None | Extra trainer arguments |
accelerator_type | str | None | None | Accelerator type override. Shape-owned in normal shape-based launches; do not override it manually. |
accelerator_count | int | None | None | Accelerator count override. Shape-owned in normal shape-based launches; do not override it manually. |
training_shape_ref | str | None | None | Required launch-time full training-shape resource name (accounts/<acct>/trainingShapes/<shape> or .../versions/<ver>). In most cases this comes from the shared public shape account as accounts/fireworks/trainingShapes/<shape>. Use mgr.resolve_training_profile("accounts/<acct>/trainingShapes/<shape>").training_shape_version to get the pinned versioned ref. See Training Shapes. |
forward_only | bool | False | Create a forward-only trainer (reference model pattern) |
CreatedTrainerJob
Returned bycreate():
| Field | Type | Description |
|---|---|---|
job_name | str | Full resource name (accounts/<id>/rlorTrainerJobs/<id>) |
job_id | str | RLOR trainer job ID |
TrainerServiceEndpoint
Returned bycreate_and_wait, wait_for_ready, wait_for_existing, resume_and_wait, and reconnect_and_wait:
| Field | Type | Description |
|---|---|---|
base_url | str | Trainer endpoint URL for FiretitanServiceClient |
job_id | str | RLOR trainer job ID |
job_name | str | Full resource name (accounts/<id>/rlorTrainerJobs/<id>) |
TrainingShapeProfile
Returned byresolve_training_profile:
| Field | Type | Description |
|---|---|---|
training_shape_version | str | Resolved shape version |
trainer_image_tag | str | Docker image tag for the trainer |
max_supported_context_length | int | Maximum supported context length |
node_count | int | Number of trainer nodes |
deployment_shape_version | str | Linked deployment shape |
deployment_image_tag | str | Docker image tag for deployment |
accelerator_type | str | GPU type |
accelerator_count | int | Number of GPUs per node |
base_model_weight_precision | str | Model weight precision |
pipeline_parallelism | int | Pipeline parallelism degree |
training_shape | str | Training shape name (without /versions/... suffix) |
deployment_shape | str | Deployment shape name (without /versions/... suffix) |
Job states
| State | Meaning |
|---|---|
JOB_STATE_CREATING | Resources being provisioned |
JOB_STATE_PENDING | Queued, waiting for GPU availability |
JOB_STATE_RUNNING | Trainer is ready — you can connect a training client |
JOB_STATE_IDLE | Service-mode job is idle |
JOB_STATE_COMPLETED | Job finished successfully |
JOB_STATE_FAILED | Job failed |
JOB_STATE_CANCELLED | Job was cancelled |
Related guides
- FiretitanServiceClient — connect a training client to this trainer
- Training Shapes — available shapes and deployment linkage
- Cleanup — resource cleanup