Welcome to Fireworks AI

What we offer

The Fireworks platform empowers developers to create generative AI systems with the best quality, cost and speed. All publicly available services are pay-as-you-go with developer friendly pricing. See the below list for offerings and docs links. Scroll further for more detailed descriptions and blog links.

  • Inference: Run generative AI models on Fireworks-hosted infrastructure with our optimized FireAttention inference engine. Multiple inference options ensure there’s always a fit for your use case.
  • Modalities and Models: Use 100s models (or bring your own) across modalities of:
  • Adaptation: Tune and optimize your model and deployment for the best . Serve and experiment with hundreds of fine-tuned models with our multi-LoRA capabilities.
  • Compound AI Development Framework: Use JSON mode, grammar mode, function calling or our Flumina framework to build a collaborative system with reliable and performant outputs.

Inference

Fireworks has 3 options for running generative AI models with unparalleled speed and costs.

  • Serverless: The easiest way to get started. Use the most popular models on pre-configured GPUs. Pay per token and avoid cold boots.
  • On-demand -The most flexible option for scaling. Use private GPUs to support your specific needs and only pay when you’re using it. GPUs running Fireworks software offer both ~250% improved throughput and 50% improved latency compared to vLLM. Excels for:
    • Production volume - Per-token costs decrease with more volume and there are no set rate limits
    • Custom needs and reliability - On-demand GPUs are private to you. This enables complete control to tailor deployments for speed/throughput/reliability or to run more specialized models
  • Enterprise Reserved GPUs - Use private GPUs with hardware and software set-up personally tailored by the Fireworks team for your use case. Enjoy SLAs, dedicated support, bring-your-own-cloud (BYOC) deployment options, and enterprise-only optimizations.
PropertyServerlessOn-demandEnterprise reserved
PerformanceIndustry-leading speed on Fireworks-curated set-up. Performance may vary with others’ usage.Speed dependent on user-specified GPU configuration and private usage. Per GPU latency should be significantly faster than vLLM.Tailor-made set-up by Fireworks AI experts for best possible latency
Getting StartedSelf-serve - immediately use serverless with 1 line of codeSelf-serve - configure GPUs, then use them with 1 line of code.Chat with Fireworks
Scaling and managementScale up and down freely within rate limitsOption for auto-scaling GPUs with traffic. GPUs scale to zero automatically, so no charge for unused GPUs and for boot-ups.Chat with Fireworks
PricingPay fixed price per tokenPay per GPU second with no commitments. Per GPU throughput should be significantly greater than options like vLLM.Customized price based on reserved GPU capacity
CommitmentNoneNoneArrange plan length with Fireworks
Rate limitsYes, see quotasNo rate limits. Quotas on number of GPUsNone
Model SelectionCollection of popular models, curated by FireworksUse 100s of pre-uploaded models or upload your own custom model within supported architectureUse 100s of pre-uploaded models or upload any model

FireOptimizer

FireOptimizer - Fireworks optimizes inference for your workload and your use case though FireOptimizer. FireOptimizer includes several optimization techniques. Publicly available features are:

  • Fine-tuning - Quickly fine-tune models with LoRA for the best quality on your use case
    • Upload data and choose your model to start tuning
    • Pay per token of training data.
    • Serve and evaluate models immediately on Fireworks
    • Download models weights to use anywhere
  • Multi-LoRA serving - Deploy 100s of fine-tuned models at no extra cost.
    • Zero extra cost to serving LoRAs. 1 million requests with 50 models is the same price as 1 million requests with 1 model.
    • Use models fine-tuned on Fireworks or upload your own fine-tuned adapter
    • Host hundreds of models on the same deployment on either serverless or dedicated deployments

Compound AI

Fireworks makes it easy to use multiple models and modalities together in one compound AI system. Features include:

  • JSON mode and grammar mode - Provide structure to any LLM on Fireworks with either (a) JSON schema (b) Context-free grammar to guarantee that LLM output follows your desired format. These structured output modes are particularly useful to ensure LLMs can reliably call and pipe outputs to other models, APIs and components.
  • Function calling - Fireworks offers function calling support via our proprietary Firefunction models or Llama 3.1 70B
  • Flumina - Fireworks enables multimedia apps to be easily packaged together and deployed and scaled with low-latency through the Flumina server apps framework. Contact us to get Flumina access