Fireworks AI Developer Platform

Fireworks AI is the best platform for building AI product experiences with open source AI models. You can run and customize AI models with just a few lines of code!

Start building

Get Started

Make your first API call with Fireworks Serverless Inference

Explore Model Library

View 100s of supported models across text, vision, audio, image and more

Use Dedicated Deployments

Get the best speed, reliability, & scalability

Fine Tune a Model

Customize a model for your specific use case

Query Vision Language Models

Query vision language models

Transcribe Audio

Convert speech to text async or in realtime

Structured Outputs

Get responses in your specified JSON schema

Function Calling

Customize and deploy a model on Fireworks

Resources

Discord Forum

Get support and discuss with other developers

Cookbook

Code examples, tutorials and guides

Blog

Technical analysis, features and customer stories

Status

Check status of Fireworks AI services

Trust Center

Security and compliance resources

Contact Us

Contact Sales or reach out to our team

What we offer

The Fireworks platform empowers developers to create generative AI systems with the best quality, cost and speed. All publicly available services are pay-as-you-go with developer friendly pricing. See the below list for offerings and docs links. Scroll further for more detailed descriptions and blog links.

Inference: Run generative AI models on Fireworks-hosted infrastructure with our optimized FireAttention inference engine. Multiple inference options ensure there’s always a fit for your use case.
Modalities and Models: Use 100s models (or bring your own) across modalities of:
- Text
- Audio
- Image
- Embedding
- Vision-understanding
Adaptation: Tune and optimize your model and deployment for the best . Serve and experiment with hundreds of fine-tuned models with our multi-LoRA capabilities.
Compound AI Development: Use JSON mode, grammar mode or function calling to build a collaborative system with reliable and performant outputs

Inference

Fireworks has 3 options for running generative AI models with unparalleled speed and costs.

Serverless: The easiest way to get started. Use the most popular models on pre-configured GPUs. Pay per token and avoid cold boots.
On-demand: The most flexible option for scaling. Use private GPUs to support your specific needs and only pay when you’re using it. GPUs running Fireworks software offer both ~250% improved throughput and 50% improved latency compared to vLLM. Excels for:
- Production volume - Per-token costs decrease with more volume and there are no set rate limits
- Custom needs and reliability - On-demand GPUs are private to you. This enables complete control to tailor deployments for speed/throughput/reliability or to run more specialized models
Enterprise Reserved GPUs: Use private GPUs with hardware and software set-up personally tailored by the Fireworks team for your use case. Enjoy SLAs, dedicated support, bring-your-own-cloud (BYOC) deployment options, and enterprise-only optimizations.

Property	Serverless	On-demand	Enterprise reserved
Performance	Industry-leading speed on Fireworks-curated set-up. Performance may vary with others’ usage.	Speed dependent on user-specified GPU configuration and private usage. Per GPU latency should be significantly faster than vLLM.	Tailor-made set-up by Fireworks AI experts for best possible latency
Getting Started	Self-serve - immediately use serverless with 1 line of code	Self-serve - configure GPUs, then use them with 1 line of code.	Chat with Fireworks
Scaling and management	Scale up and down freely within rate limits	Option for auto-scaling GPUs with traffic. GPUs scale to zero automatically, so no charge for unused GPUs and for boot-ups.	Chat with Fireworks
Pricing	Pay fixed price per token	Pay per GPU second with no commitments. Per GPU throughput should be significantly greater than options like vLLM.	Customized price based on reserved GPU capacity
Commitment	None	None	Arrange plan length with Fireworks
Rate limits	Yes, see quotas	No rate limits. Quotas on number of GPUs	None
Model Selection	Collection of popular models, curated by Fireworks	Use 100s of pre-uploaded models or upload your own custom model within supported architecture	Use 100s of pre-uploaded models or upload any model

FireOptimizer

FireOptimizer: Fireworks optimizes inference for your workload and your use case, and performs fine-tuning, through FireOptimizer. FireOptimizer includes several optimization techniques. Publicly available features are:

Fine-tuning: Quickly fine-tune models with LoRA for the best quality on your use case
- Upload data and choose your model to start tuning
- Pay per token of training data.
- Serve and evaluate models immediately on Fireworks
- Download models weights to use anywhere
Multi-LoRA serving: Deploy 100s of fine-tuned models at no extra cost.
- Zero extra cost to serving LoRAs. 1 million requests with 50 models is the same price as 1 million requests with 1 model.
- Use models fine-tuned on Fireworks or upload your own fine-tuned adapter
- Host hundreds of models on the same deployment on either serverless or dedicated deployments

Compound AI

Fireworks makes it easy to use multiple models and modalities together in one compound AI system. Features include:

JSON mode and grammar mode: Provide structure to any LLM on Fireworks with either (a) JSON schema (b) Context-free grammar to guarantee that LLM output follows your desired format. These structured output modes are particularly useful to ensure LLMs can reliably call and pipe outputs to other models, APIs and components.
Function calling: Fireworks offers function calling support via our proprietary Firefunction models or Llama 3.1 70B

​Start building