Loss Functions - Fireworks AI Docs

What this is

The Training API supports two ways to compute loss:

Built-in losses via forward_backward with a string identifier (e.g. "cross_entropy") — fastest, no extra forward pass needed.
Custom losses via forward_backward_custom with an arbitrary Python function — flexible, supports any differentiable objective at the cost of an additional forward pass.

Built-in loss: cross_entropy

For supervised fine-tuning, use the built-in cross_entropy loss via forward_backward:

result = training_client.forward_backward(datums, "cross_entropy").result()

This computes standard next-token prediction loss on the server side — no extra forward pass or local loss computation needed. For built-in cross_entropy, the SDK backfills result.metrics["response_tokens"] so you can compute a mean loss from sum-style metrics when needed.

Built-in cross_entropy requires datums with target_tokens in loss_fn_inputs. Datums built with datum_from_model_input_weights (weight-based) will fail with "missing required field 'target_tokens'". For built-in cross_entropy, use the target-token tinker.Datum format in the Using tinker.Datum directly (target-token-based) section below. If you want to keep weight-based datums, use forward_backward_custom with the weight-based format in Building datums and the custom-loss pattern in Example: simple cross-entropy.

For a forward-only pass (e.g. to compute reference logprobs without updating weights):

result = training_client.forward(datums, "cross_entropy").result()
ref_logprobs = [result.loss_fn_outputs[i]["logprobs"].data for i in range(len(datums))]

Custom losses: forward_backward_custom

forward_backward_custom lets you implement any objective function in Python. You provide the loss computation; the API handles the forward pass on remote GPUs, passes logprobs back to your function, then sends the computed gradients back for the backward pass.

How it works

You call training_client.forward_backward_custom(datums, loss_fn).
The trainer runs a forward pass on the GPU and returns per-token logprobs.
The logprobs are converted to PyTorch tensors with requires_grad=True.
Your loss_fn is called with the datums and logprobs.
The API calls loss.backward() to compute d_loss/d_logprob gradients.
Gradients are sent back to the trainer GPU for the model backward pass.

Your loss function runs locally (on your machine), while the forward and backward passes run on remote GPUs.

forward_backward_custom does an extra forward pass compared to forward_backward, requiring ~1.5x FLOPs and up to ~3x wall time per step.

Embedding-space custom losses

For objectives that operate on pooled hidden states instead of logprobs, pass output="embedding" and pooling="mean" or "last":

def embedding_loss(data, embeddings):
    loss = compute_embedding_objective(embeddings)
    return loss, {"embedding_loss": float(loss.item())}

result = training_client.forward_backward_custom(
    datums,
    embedding_loss,
    output="embedding",
    pooling="mean",
).result()

Loss function signature

def loss_fn(
    data: list[tinker.Datum],
    logprobs_list: list[torch.Tensor],
) -> tuple[torch.Tensor, dict[str, float]]:
    """
    Args:
        data: The same datums you passed to forward_backward_custom.
              Access token weights via data[i].loss_fn_inputs["weights"].data
        logprobs_list: Per-token log-probabilities from the forward pass.
              Each tensor has requires_grad=True. Shape: (seq_len,) per sequence.

    Returns:
        loss: A scalar tensor. Must be differentiable w.r.t. logprobs_list entries.
        metrics: A dict of float values for logging (not used for training).
    """

Key rules

logprobs_list[i] has requires_grad=True — your loss must be differentiable through it.
Use torch.dot() to compute weighted sums — this correctly propagates gradients through the logprobs.
Return a scalar tensor as the loss, and a dict[str, float] as metrics.
Access token weights via data[i].loss_fn_inputs["weights"].data — these are 0 for prompt tokens and 1 for response tokens.

Building datums

Using tinker_cookbook (weight-based)

datum_from_model_input_weights constructs datums with explicit token weights:

import tinker
import torch
from tinker_cookbook.supervised.common import datum_from_model_input_weights

tokens = [101, 2054, 2003, ...]
weights = torch.zeros(len(tokens), dtype=torch.float32)
weights[prompt_len:] = 1.0  # Only train on response tokens

datum = datum_from_model_input_weights(tinker.ModelInput.from_ints(tokens), weights, max_length=8192)

Using tinker.Datum directly (target-token-based)

For RL-style objectives where you need per-completion control (e.g. routing matrices, custom loss_fn_inputs), construct datums directly:

import tinker

model_input_len = len(tokens) - 1
datum = tinker.Datum(
    model_input=tinker.ModelInput.from_ints(tokens[:-1]),
    loss_fn_inputs={
        "target_tokens": tinker.TensorData(
            data=tokens[1:], dtype="int64", shape=[model_input_len],
        ),
    },
)

Multi-target cross-entropy

For sparse distillation objectives, built-in cross_entropy also supports multiple candidate target tokens per model position. In this mode, target_tokens has shape [N, K], where:

N is the number of model input positions.
K is the number of candidate targets per position.
target_tokens.data is flattened row-major and must contain N * K token ids.

If you provide weights, it must describe the same flattened target entries as target_tokens.data. That means weights.data must contain exactly the same number of values as target_tokens.data (N * K values), in the same row-major order, with one weight per candidate target.

import tinker

tokens = [101, 2054, 2003, 1029]
model_input = tokens[:-1]

# Two candidate next-token targets for each model input position.
target_tokens_NK = [
    2054, 2055,  # candidates for position 0
    2003, 2004,  # candidates for position 1
    1029, 1030,  # candidates for position 2
]

weights_NK = [
    0.9, 0.1,
    0.8, 0.2,
    1.0, 0.0,
]

datum = tinker.Datum(
    model_input=tinker.ModelInput.from_ints(model_input),
    loss_fn_inputs={
        "target_tokens": tinker.TensorData(
            data=target_tokens_NK,
            dtype="int64",
            shape=[len(model_input), 2],
        ),
        "weights": tinker.TensorData(
            data=weights_NK,
            dtype="float32",
            shape=[len(model_input), 2],
        ),
    },
)

result = training_client.forward_backward([datum], "cross_entropy").result()

If target_tokens.shape == [N, K], any supplied weights must have the same flattened length and layout as target_tokens.data. weights is optional, but when supplied it must have one value per flattened candidate target (N * K values), aligned with target_tokens.data.A common mistake is sending one weight per model position (N values) while target_tokens contains multiple candidate targets per position (N * K values). Any other mismatch in flattening, padding, truncation, or filtering between these two fields is also invalid.

Example: simple cross-entropy

def cross_entropy_loss(data, logprobs_list):
    total_loss = torch.tensor(0.0)
    for i, logprobs in enumerate(logprobs_list):
        weights = torch.tensor(data[i].loss_fn_inputs["weights"].data, dtype=torch.float32)
        min_len = min(len(logprobs), len(weights))
        weighted_sum = torch.dot(logprobs[:min_len].float(), weights[:min_len])
        total_loss = total_loss - weighted_sum  # Negative log-likelihood
    loss = total_loss / len(logprobs_list)
    return loss, {"cross_entropy": loss.item()}

result = training_client.forward_backward_custom(datums, cross_entropy_loss).result()

Example: GRPO with KL penalty

def make_grpo_loss(rewards, ref_logprobs, kl_beta=0.001):
    advantages = compute_advantages(rewards)
    ref_tensors = [torch.tensor(lp, dtype=torch.float32) for lp in ref_logprobs]

    def loss_fn(data, logprobs_list):
        total_loss = torch.tensor(0.0)
        for i in range(len(logprobs_list)):
            weights = torch.tensor(data[i].loss_fn_inputs["weights"].data, dtype=torch.float32)
            pi = logprobs_list[i][:len(weights)]
            ref = ref_tensors[i][:len(weights)]

            pg_loss = -advantages[i] * torch.dot(pi.float(), weights)
            kl_term = torch.dot((pi - ref).float(), weights)
            total_loss = total_loss + pg_loss + kl_beta * kl_term

        return total_loss / len(logprobs_list), {"loss": (total_loss / len(logprobs_list)).item()}

    return loss_fn

Example: DPO margin loss

import torch.nn.functional as F

def make_dpo_loss(ref_chosen, ref_rejected, beta=0.1):
    ref_c = torch.tensor(ref_chosen, dtype=torch.float32)
    ref_r = torch.tensor(ref_rejected, dtype=torch.float32)

    def loss_fn(data, logprobs_list):
        pi_c, pi_r = logprobs_list[0], logprobs_list[1]
        w_c = torch.tensor(data[0].loss_fn_inputs["weights"].data, dtype=torch.float32)
        w_r = torch.tensor(data[1].loss_fn_inputs["weights"].data, dtype=torch.float32)

        margin = (torch.dot(pi_c.float(), w_c) - torch.dot(ref_c, w_c)) - \
                 (torch.dot(pi_r.float(), w_r) - torch.dot(ref_r, w_r))

        return -F.logsigmoid(beta * margin), {"margin": margin.item()}

    return loss_fn

Built-in loss methods: GRPO vs DAPO vs GSPO-token

When using the managed RFT flow or the cookbook’s RL recipe, three built-in loss methods are available via --rl-loss-method:

Method	Clipping	KL penalty	Loss aggregation	Importance sampling
`grpo` (default)	Symmetric `[0.8, 1.2]`	Yes (`0.001`)	Token-mean	Token-level
`dapo`	Asymmetric `[0.8, 1.28]`	No	Token-mean	Token-level
`gspo-token`	Very tight `[1-3e-4, 1+4e-4]`	No	Seq-mean-token-mean	Sequence-level

GRPO (arXiv:2402.03300) is the safe default with KL regularization. DAPO (arXiv:2503.14476) removes KL and uses asymmetric clipping to allow more aggressive exploration in the improve direction. GSPO-token (arXiv:2507.18071) uses sequence-level importance ratios and extremely tight clipping. The seq-mean-token-mean aggregation normalizes per-sequence before averaging, reducing bias toward longer responses. For Training API users implementing custom loss functions via forward_backward_custom, these methods serve as reference implementations. You can replicate or modify their behavior in your custom loss function. See Parameter Tuning for detailed guidance on when to choose each method.

Applying the optimizer step

After forward_backward_custom, call optim_step to update weights:

training_client.forward_backward_custom(datums, loss_fn).result()
training_client.optim_step(
    tinker.AdamParams(
        learning_rate=1e-5,
        beta1=0.9,
        beta2=0.999,
        eps=1e-8,
        weight_decay=0.01,
    )
).result()

For gradient accumulation, call forward_backward_custom multiple times before calling optim_step:

for micro_batch in micro_batches:
    training_client.forward_backward_custom(micro_batch, loss_fn).result()

# One optimizer step after accumulating gradients
training_client.optim_step(tinker.AdamParams(learning_rate=1e-5, ...)).result()

Advanced optimizer-step controls such as server-side gradient accumulation normalization are intentionally kept out of this user-facing guide. See the cookbook skill reference for agent-facing operational guidance.

Common pitfalls

Token-weight misalignment can silently break objective semantics — always verify that len(logprobs) and len(weights) are compatible (truncate to min_len).
Ignoring per-step diagnostics makes instability hard to attribute — log metrics from every train step.
Forgetting .result() — all Tinker API calls return futures. Without .result(), errors are silently swallowed.
Non-differentiable loss: If your loss doesn’t depend on logprobs_list entries through differentiable ops, gradients will be zero.

Training and Sampling — end-to-end workflow
Saving and Loading — checkpoint and weight sync
Cookbook RL recipe — GRPO with full reward pipeline
Cookbook DPO recipe — DPO with preference data

​What this is

​Built-in loss: cross_entropy

​Custom losses: forward_backward_custom

​How it works

​Embedding-space custom losses

​Loss function signature

​Key rules

​Building datums

​Using tinker_cookbook (weight-based)

​Using tinker.Datum directly (target-token-based)

​Multi-target cross-entropy

​Example: simple cross-entropy

​Example: GRPO with KL penalty

​Example: DPO margin loss

​Built-in loss methods: GRPO vs DAPO vs GSPO-token

​Applying the optimizer step

​Common pitfalls

​Related guides

What this is

Built-in loss: cross_entropy

Custom losses: forward_backward_custom

How it works

Embedding-space custom losses

Loss function signature

Key rules

Building datums

Using tinker_cookbook (weight-based)

Using tinker.Datum directly (target-token-based)

Multi-target cross-entropy

Example: simple cross-entropy

Example: GRPO with KL penalty

Example: DPO margin loss

Built-in loss methods: GRPO vs DAPO vs GSPO-token

Applying the optimizer step

Common pitfalls

Related guides