What this is
Service-mode RLOR jobs provision GPU-backed trainer endpoints that your custom Python loop connects to via the Tinker SDK. In the current training SDK, job lifecycle is handled byTrainerJobManager.
Creating a service-mode trainer job
UseTrainerJobManager.create_and_wait(...) with TrainerJobConfig:
Create parameters reference
| Parameter | Type | Description |
|---|---|---|
base_model | str | Required. Base model for the trainer. |
lora_rank | int | 0 for full-parameter training, >0 for LoRA. |
max_context_length | int | Max sequence length. |
learning_rate | float | Learning rate for trainer-side optimizer state. |
gradient_accumulation_steps | int | Number of micro-batches before optimizer step. |
node_count | int | Number of trainer nodes. |
hot_load_deployment_id | str | Link this trainer to a deployment for checkpoint uploads and hotloading. |
display_name | str | Human-readable name for the job. |
region | str | Region override (optional). |
custom_image_tag | str | Trainer image tag override (optional). |
extra_args | list[str] | Extra trainer args (for example --forward-only). |
accelerator_type / accelerator_count | str / int | Accelerator overrides (optional). |
skip_validations | bool | Bypass control-plane validation checks. |
forward_only | bool | Mark the job as forward-only. |
Inspecting a job
Job states
| State | Meaning |
|---|---|
JOB_STATE_CREATING | Resources being provisioned |
JOB_STATE_PENDING | Queued, waiting for GPU availability |
JOB_STATE_RUNNING | Trainer is ready — you can connect a Tinker client |
JOB_STATE_IDLE | Service-mode job is idle (no active training) |
JOB_STATE_COMPLETED | Job finished successfully |
JOB_STATE_FAILED | Job failed |
JOB_STATE_CANCELLED | Job was cancelled |
Waiting for readiness
create_and_wait(...) and wait_for_existing(...) already block until the endpoint is healthy. For an existing job:
Resuming a job
Reconnecting after preemption
reconnect_and_wait handles pod preemption and transient failures. It waits for the job to reach a resumable state (tolerating transitional states like CREATING or DELETING), resumes it, then polls until the endpoint is healthy:
resume_and_wait() because it retries when the job is in a transitional state (e.g. the control plane is still processing the pod death).
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
job_id | str | — | The RLOR job ID to reconnect |
poll_interval_s | float | 5.0 | Seconds between health checks after resume |
timeout_s | float | 600 | Overall timeout for the job to become RUNNING |
max_wait_for_resumable_s | float | 120 | Max seconds to wait for a resumable state (FAILED/CANCELLED/PAUSED/COMPLETED) |
Deleting a job
Always clean up trainer jobs when done to release GPU resources:Resolving training shapes
Training shapes bundle region, accelerator, image tag, node count, and sharding config into a single resolved profile:TrainingShapeProfile contains: training_shape_version, trainer_image_tag, max_supported_context_length, node_count, deployment_shape_version, deployment_image_tag, accelerator_type, accelerator_count, base_model_weight_precision, pipeline_parallelism.
Operational guidance
- Service mode supports both full-parameter and LoRA tuning. Set
lora_rank=0for full-parameter or a positive integer for LoRA. - Set
hot_load_deployment_idwhen you plan to hotload checkpoints onto a deployment. This configures the checkpoint upload path. - Clean up jobs when your experiment is done — trainer jobs hold GPU resources.
- Use
display_nameto identify jobs in logs and in the Fireworks console. - Use
reconnect_and_waitfor long-running experiments where pod preemption is possible. It handles transitional states and auto-resumes. - Use training shapes (
resolve_training_profile) to auto-populate infra config instead of manually setting region, accelerator, and image tag.