Overview
The Chat Completions API allows synchronous inference for a single request. However, if you need to process a large number of requests, our Batch Inference API is a more efficient alternative.Use Cases
- ETL Pipelines - Construct production pipelines around large scale inference workloads
- Evaluations – Automate large-scale testing and benchmarking
- Distillation – Teach a smaller model using a larger model
Cost Optimization
Batch API Advantages
- 💸 Volume Discounts
- ⚡ Higher throughput – Process more data in less time.
- Explore pricing details
- 🔄 Automatic Prompt Caching – Additional stacked 50% discount on cached prompt tokens
Prompt Caching Discount
Batch Inference automatically benefits from prompt caching. When tokens hit the cache (we try to cache as much as possible), an additional 50% discount is applied. This stacks with the base 50% batch discount. For best results, structure your prompts with static content first and variable content last.Step-by-Step Guide to Batch Inference with Fireworks AI
1. Preparing the Dataset
Datasets must adhere strictly to the JSONL format, where each line represents a complete JSON-formatted inference request. Requirements:- File format: JSONL (each line is a valid JSON object)
- Total size limit: Under 500MB
- Format: OpenAI Batch API compatible format with
custom_id
(unique id) andbody
fields
batch_input_data.jsonl
, making sure custom_id
is unique across rows.
2. Uploading the Dataset to Fireworks AI
There are a few ways to upload the dataset to Fireworks platform for batch inference:UI
, firectl
or HTTP API
.
- UI
- firectl
- HTTP API
You can simply navigate to the dataset tab, click 
Create Dataset
and follow the wizard.
3. Creating a Batch Inference Job
- UI
- firectl
- HTTP API
Navigate to the Batch Inference tab and click “Create Batch Inference Job”. Select your input dataset:
Choose your model:
Configure optional settings:



4. Monitoring and Managing Batch Inference Jobs
Batch Job States
Batch Inference Jobs progress through several states during their lifecycle:State | Description |
---|---|
VALIDATING | The input dataset is being validated to ensure it meets format requirements and constraints |
PENDING | The job is queued and waiting for available resources to begin processing |
RUNNING | The batch job is actively processing requests from the input dataset |
COMPLETED | All requests have been successfully processed and results are available in the output dataset |
FAILED | The job encountered an unrecoverable error. Check the job status message for details |
EXPIRED | The job exceeded the 24-hour time limit. Any completed requests up to that point are saved to the output dataset |
- UI
- firectl
- HTTP API
View all your batch inference jobs in the dashboard:

5. Downloading the Results
After the batch inference job is complete, download the output dataset containing the results.- UI
- firectl
- HTTP API
Navigate to the output dataset and download the results:

Output Files
The output dataset contains two types of files:File Type | Description |
---|---|
Results File | Contains successful inference responses in JSONL format, with each line matching the custom_id from your input |
Error File | Contains any error details for requests that failed processing, and the original custom_id for debugging |
6. Best Practices and Considerations
- Validate your dataset thoroughly before uploading.
- Use appropriate inference parameters for your use case.
- Monitor job progress for long-running batches.
- Set reasonable
max_tokens
limits to optimize processing time. - Use descriptive
custom_id
values for easier result tracking.
Models
- Base Models – Any Base Model in our Model Library
- Account Models – Any model you have uploaded/trained, including fine-tuned models
Limits
- Each individual request (row in the dataset) will follow the same constraints as Chat Completion Limits
- The Input Dataset must adhere to Dataset Limits and be under 500Mb total.
- The Output Dataset will be capped at 8GB, and the job may expire early if the limit is reached.
Expired Jobs
A Batch Job will expire if it runs for 24 hours. Any completed rows will be billed for and written to the output dataset.Resuming Expired Jobs
You can continue processing from where an expired job left off using the--continue-from
flag:
COMPLETED
jobs where some of the requests unexpectedly failed).
Downloading Results with Lineage
To download the complete chain of datasets (including any continuation jobs):Appendix
Python builder SDK
references
HTTP API
references
firectl
references