Skip to main content
Batch Inference allows you to perform Chat Completion on our 1000+ models (Model Library) or your own fine-tuned ones, in bulk asynchronously, reducing costs by up to 50%.

Overview

The Chat Completions API allows synchronous inference for a single request. However, if you need to process a large number of requests, our Batch Inference API is a more efficient alternative.

Use Cases

  • ETL Pipelines - Construct production pipelines around large scale inference workloads
  • Evaluations – Automate large-scale testing and benchmarking
  • Distillation – Teach a smaller model using a larger model

Cost Optimization

Batch API Advantages

  • 💸 Volume Discounts
  • Higher throughput – Process more data in less time.
Batch Inference is priced at 50% of our serverless rates:
  • Explore pricing details
  • 🔄 Automatic Prompt Caching – Additional stacked 50% discount on cached prompt tokens

Prompt Caching Discount

Batch Inference automatically benefits from prompt caching. When tokens hit the cache (we try to cache as much as possible), an additional 50% discount is applied. This stacks with the base 50% batch discount. For best results, structure your prompts with static content first and variable content last.

Step-by-Step Guide to Batch Inference with Fireworks AI

1. Preparing the Dataset

Datasets must adhere strictly to the JSONL format, where each line represents a complete JSON-formatted inference request. Requirements:
  • File format: JSONL (each line is a valid JSON object)
  • Total size limit: Under 500MB
  • Format: OpenAI Batch API compatible format with custom_id (unique id) and body fields
Here’s an example input dataset:
{"custom_id": "request-1", "body": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}], "max_tokens": 100}}
{"custom_id": "request-2", "body": {"messages": [{"role": "user", "content": "Explain quantum computing"}], "temperature": 0.7}}
{"custom_id": "request-3", "body": {"messages": [{"role": "user", "content": "Tell me a joke"}]}}
Save this dataset as jsonl file locally, for example batch_input_data.jsonl, making sure custom_id is unique across rows.

2. Uploading the Dataset to Fireworks AI

There are a few ways to upload the dataset to Fireworks platform for batch inference: UI, firectl or HTTP API.
  • UI
  • firectl
  • HTTP API
You can simply navigate to the dataset tab, click Create Dataset and follow the wizard.Dataset Upload

3. Creating a Batch Inference Job

  • UI
  • firectl
  • HTTP API
Navigate to the Batch Inference tab and click “Create Batch Inference Job”. Select your input dataset:BIJ Dataset SelectChoose your model:BIJ Model SelectConfigure optional settings:BIJ Optional Settings

4. Monitoring and Managing Batch Inference Jobs

Batch Job States

Batch Inference Jobs progress through several states during their lifecycle:
StateDescription
VALIDATINGThe input dataset is being validated to ensure it meets format requirements and constraints
PENDINGThe job is queued and waiting for available resources to begin processing
RUNNINGThe batch job is actively processing requests from the input dataset
COMPLETEDAll requests have been successfully processed and results are available in the output dataset
FAILEDThe job encountered an unrecoverable error. Check the job status message for details
EXPIREDThe job exceeded the 24-hour time limit. Any completed requests up to that point are saved to the output dataset
  • UI
  • firectl
  • HTTP API
View all your batch inference jobs in the dashboard:BIJ List

5. Downloading the Results

After the batch inference job is complete, download the output dataset containing the results.
  • UI
  • firectl
  • HTTP API
Navigate to the output dataset and download the results:BIJ Dataset Download

Output Files

The output dataset contains two types of files:
File TypeDescription
Results FileContains successful inference responses in JSONL format, with each line matching the custom_id from your input
Error FileContains any error details for requests that failed processing, and the original custom_id for debugging

6. Best Practices and Considerations

  • Validate your dataset thoroughly before uploading.
  • Use appropriate inference parameters for your use case.
  • Monitor job progress for long-running batches.
  • Set reasonable max_tokens limits to optimize processing time.
  • Use descriptive custom_id values for easier result tracking.

Models

  • Base Models – Any Base Model in our Model Library
  • Account Models – Any model you have uploaded/trained, including fine-tuned models
Note: Newly added models may have a delay before being supported For information about model precisions and how to check them, see Default Precisions.

Limits

  • Each individual request (row in the dataset) will follow the same constraints as Chat Completion Limits
  • The Input Dataset must adhere to Dataset Limits and be under 500Mb total.
  • The Output Dataset will be capped at 8GB, and the job may expire early if the limit is reached.

Expired Jobs

A Batch Job will expire if it runs for 24 hours. Any completed rows will be billed for and written to the output dataset.

Resuming Expired Jobs

You can continue processing from where an expired job left off using the --continue-from flag:
firectl create batch-inference-job \
  --continue-from original-job-id \
  --model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --output-dataset-id new-output-dataset
This will only process the unfinished/failed requests from the original input dataset (you may also use this for COMPLETED jobs where some of the requests unexpectedly failed).

Downloading Results with Lineage

To download the complete chain of datasets (including any continuation jobs):
firectl download dataset output-dataset-id --download-lineage
This downloads the target dataset along with all datasets in the chain of continuation jobs (if multiple).

Appendix

Python builder SDK references HTTP API references firectl references