Skip to main content

What this is

Once a service-mode RLOR trainer job reaches RUNNING state, connect a FiretitanServiceClient to the trainer endpoint and create a FiretitanTrainingClient for train steps, checkpointing, and state management.

Setup

pip install --pre "fireworks-ai[training]"
Use FiretitanServiceClient from the Training SDK instead of tinker.ServiceClient. It returns a FiretitanTrainingClient which adds save_weights_for_sampler_ext() (checkpoint_type support) and session-scoped snapshot naming:
import tinker
from fireworks.training.sdk import FiretitanServiceClient

Creating the training client

# endpoint.base_url comes from TrainerJobManager.create_and_wait(...)
service = FiretitanServiceClient(
    base_url=endpoint.base_url,
    api_key="<FIREWORKS_API_KEY>",
)
training_client = service.create_training_client(
    base_model="accounts/fireworks/models/qwen3-8b",
    lora_rank=0,  # Must match lora_rank from job creation
)

Parameters

ParameterDescription
base_urlTrainer endpoint URL (endpoint.base_url or TrainerJobManager.get(job_id)["directRouteHandle"])
api_keyYour Fireworks API key
base_modelMust match the trainer job’s base_model (from TrainerJobConfig)
lora_rankMust match trainer creation config (0 for full-parameter tuning)
user_metadataOptional dict[str, str] of run metadata
A ValueError is raised if you attempt to create a second training client with the same (base_model, lora_rank) on the same FiretitanServiceClient instance. Create a new FiretitanServiceClient for a separate trainer.

What you can do with the client

Forward pass (get logprobs without training)

result = training_client.forward(datums, "cross_entropy").result()
logprobs = result.loss_fn_outputs[0]["logprobs"].data
This is useful for computing reference logprobs (frozen model) in GRPO/DPO.

Custom forward-backward (train step)

def my_loss(data, logprobs_list):
    loss = compute_loss(data, logprobs_list)
    return loss, {"loss": float(loss.item())}

result = training_client.forward_backward_custom(datums, my_loss).result()
print(result.metrics)  # {"loss": 0.42}

Optimizer step

training_client.optim_step(
    tinker.AdamParams(
        learning_rate=1e-5,
        beta1=0.9,
        beta2=0.999,
        eps=1e-8,
        weight_decay=0.01,
    )
).result()

Save checkpoint for serving

# "base" = full checkpoint, "delta" = incremental (smaller, faster)
# save_weights_for_sampler_ext adds checkpoint_type support and session-scoped naming
result = training_client.save_weights_for_sampler_ext(
    "step-100",
    checkpoint_type="base",
)
print(result.snapshot_name)  # Session-qualified name for hotloading

List available checkpoints

checkpoint_names, _ = training_client.list_checkpoints()
print(checkpoint_names)  # e.g. ["step-2", "step-4"]

Save and restore train state

training_client.save_state("train_state_step_100")
training_client.load_state_with_optimizer("train_state_step_100")
save_state also accepts an optional ttl_seconds parameter for auto-expiring checkpoints.

Resolve cross-job checkpoint path

checkpoint_ref = training_client.resolve_checkpoint_path(
    "step-4",
    source_job_id="previous-job-id",
)
training_client.load_state_with_optimizer(checkpoint_ref)
Cookbook users: If you are using cookbook recipes, prefer checkpoint_utils.save_checkpoint and checkpoint_utils.resolve_resume which wrap these methods with structured persistence. See Checkpointing and Hotload.

Connecting to an existing trainer

If you already have a running trainer (e.g. from a previous session), connect directly by URL:
service = FiretitanServiceClient(
    base_url="https://<existing-trainer-url>",
    api_key="<FIREWORKS_API_KEY>",
)
training_client = service.create_training_client(
    base_model="accounts/fireworks/models/qwen3-8b",
    lora_rank=0,
)
You can fetch the trainer URL with TrainerJobManager.get(job_id)["directRouteHandle"].

Operational guidance

  • Service mode supports both full-parameter and LoRA tuning. Set lora_rank=0 for full-parameter or a positive integer for LoRA.
  • Use FiretitanServiceClient instead of tinker.ServiceClient to get FiretitanTrainingClient with save_weights_for_sampler_ext().
  • Retry client creation if the trainer is still warming up — poll the job state first.
  • All Tinker API calls return futures. Call .result() to wait for completion.