> ## Documentation Index
> Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Training and Sampling

> End-to-end API walkthrough: bootstrap resources, train, checkpoint, and sample through a serving deployment.

## What this is

This is the default lifecycle for research loops: bootstrap a trainer and deployment, run iterative updates, export checkpoints, sync weights to the deployment, then sample through it for realistic evaluation.

## Workflow

1. **Request resources**: create a service-mode trainer ([`TrainerJobManager`](/fine-tuning/training-api/reference/trainer-job-manager)) first and capture its `job_id`/`job_name`, then create or attach a deployment ([`DeploymentManager`](/fine-tuning/training-api/reference/deployment-manager)) linked to that trainer's weight-sync bucket.
2. **Connect a training client** from your Python loop.
3. **Run train steps**: `forward_backward_custom` + `optim_step` in a loop.
4. **Save checkpoints** at regular intervals using base/delta pattern.
5. **Weight-sync** the checkpoint to your serving deployment.
6. **Sample and evaluate** through the deployment endpoint.
7. **Record metrics** and decide whether to continue or branch experiments.

## End-to-end example

The only training-shape input you choose below is the shape ID. The API resolves the versioned reference for you before launch.

### 1. Bootstrap

```python theme={null}
import os
import tinker
from concurrent.futures import ThreadPoolExecutor
from fireworks.training.sdk import (
    FiretitanServiceClient,
    TrainerJobManager,
    TrainerJobConfig,
    DeploymentManager,
    DeploymentConfig,
    WeightSyncer,
)

api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")
shape_id = "accounts/fireworks/trainingShapes/qwen3-8b-128k-h200"

rlor_mgr = TrainerJobManager(api_key=api_key, base_url=base_url)
deploy_mgr = DeploymentManager(api_key=api_key, base_url=base_url)

# This is the only shape-specific value you choose
profile = rlor_mgr.resolve_training_profile(shape_id)

# Request the trainer first, then wait separately.
created = rlor_mgr.create(TrainerJobConfig(
    base_model="accounts/fireworks/models/qwen3-8b",
    training_shape_ref=profile.training_shape_version,
    lora_rank=0,
    learning_rate=1e-5,
    gradient_accumulation_steps=4,
))
print(f"Trainer requested: {created.job_id}")

# Create deployment linked to the trainer.
deploy_info = deploy_mgr.create_or_get(DeploymentConfig(
    deployment_id="research-serving",
    base_model="accounts/fireworks/models/qwen3-8b",
    hot_load_trainer_job=created.job_name,
    min_replica_count=0,
    max_replica_count=1,
))

# Wait for trainer and deployment readiness in parallel.
with ThreadPoolExecutor(max_workers=2) as pool:
    trainer_future = pool.submit(rlor_mgr.wait_for_ready, created.job_id)
    deploy_future = pool.submit(deploy_mgr.wait_for_ready, deploy_info.deployment_id)
    endpoint = trainer_future.result()
    deploy_info = deploy_future.result()

# Connect a training client to the live trainer endpoint.
service = FiretitanServiceClient(base_url=endpoint.base_url, api_key=api_key)
training_client = service.create_training_client(
    base_model="accounts/fireworks/models/qwen3-8b", lora_rank=0,
)
```

### 2. Train step with custom objective

```python theme={null}
def objective(data, logprobs_list):
    loss = compute_objective(data=data, logprobs_list=logprobs_list)
    return loss, {"loss": float(loss.item())}

for step in range(total_steps):
    batch = build_batch(step)
    training_client.forward_backward_custom(batch, objective).result()
    training_client.optim_step(
        tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
    ).result()
```

### 3. Checkpoint, weight sync, and evaluate

```python theme={null}
import asyncio

from transformers import AutoTokenizer
from fireworks.training.sdk import DeploymentSampler, AdaptiveConcurrencyController

# Set up WeightSyncer for automatic delta-chain management
tracker = WeightSyncer(
    policy_client=training_client,
    deploy_mgr=deploy_mgr,
    deployment_id="research-serving",
    base_model="accounts/fireworks/models/qwen3-8b",
    hotload_timeout=600,
    first_checkpoint_type="base",
)

if step % eval_interval == 0:
    # WeightSyncer auto-selects base (first) or delta (subsequent)
    tracker.save_and_hotload(f"step_{step:05d}")

    # Sample via deployment for evaluation
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
    sampler = DeploymentSampler(
        inference_url=deploy_mgr.inference_url,
        model=f"accounts/{deploy_mgr.account_id}/deployments/research-serving",
        api_key=api_key,
        tokenizer=tokenizer,
        concurrency_controller=AdaptiveConcurrencyController(initial_window=16),
    )
    completions = asyncio.run(
        sampler.sample_with_tokens(messages=eval_prompts, n=1)
    )
    score = evaluate_responses(completions)
    print({"step": step, "eval_score": score})
```

## Concurrency control

`sample_with_tokens(n=K)` fans out K concurrent requests. A concurrency controller prevents overloading the deployment:

* **`AdaptiveConcurrencyController`** (recommended) — automatically adjusts the concurrency window based on the server's prefill queue latency. Starts at `initial_window` and grows or shrinks between steps using AIMD.
* **`FixedConcurrencyController`** — a static semaphore with a fixed maximum. Use when you already know the right concurrency for your deployment.

See [DeploymentSampler — Concurrency Control](/fine-tuning/training-api/reference/deployment-sampler#concurrency-control) for full details and configuration options.

## Reconnecting to a running trainer

If your client disconnects (script crash, notebook restart, network interruption), the trainer job keeps running on the server. Reconnect without restarting:

```python theme={null}
# Reconnect to existing job (handles preemption, transitional states)
endpoint = rlor_mgr.reconnect_and_wait(job_id, timeout_s=300)

# Create a new client on the same trainer
service = FiretitanServiceClient(base_url=endpoint.base_url, api_key=api_key)
training_client = service.create_training_client(
    base_model="accounts/fireworks/models/qwen3-8b", lora_rank=0,
)

# Continue training — step_id and checkpoints are preserved
training_client.forward_backward_custom(batch, objective).result()
training_client.optim_step(adam_params).result()
```

## Operational guidance

* **Service mode supports both full-parameter and LoRA tuning.** Set `lora_rank=0` for full-parameter or a positive integer (e.g. `16`, `64`) for LoRA, and match `create_training_client(lora_rank=...)` accordingly.
* **Use `checkpoint_type="base"` for the first checkpoint**, then `"delta"` for subsequent ones to reduce save/transfer time. Note: on full-parameter training, only `base` checkpoints are promotable — see [Checkpoint kinds](/fine-tuning/training-api/cookbook/checkpoints#checkpoint-kinds).
* **`DeploymentSampler.sample_with_tokens()` is async** — use `await` in async code or `asyncio.run(...)` from synchronous scripts.
* **Keep checkpoint intervals predictable** so evaluation comparisons are stable.
* **Store the exact prompt set** used for each evaluation sweep for reproducibility.

## Common pitfalls

* **Sampling from trainer internals** instead of deployment endpoints can skew results — always evaluate through the serving path.
* **Missing checkpoint-to-deployment traceability** makes rollback risky — log checkpoint names alongside metrics.
* **Stale deployments**: Always verify the weight-synced checkpoint identity matches what you expect before sampling.

## Related guides

* [Loss Functions](/fine-tuning/training-api/loss-functions) — built-in and custom loss function patterns
* [Vision Inputs](/fine-tuning/training-api/vision-inputs) — fine-tune VLMs with image and text data
* [Saving and Loading](/fine-tuning/training-api/saving-and-loading) — checkpoint types and weight sync details
* [DeploymentSampler reference](/fine-tuning/training-api/reference/deployment-sampler) — sampling API details
* [WeightSyncer reference](/fine-tuning/training-api/reference/weight-syncer) — weight sync lifecycle
