Step 1: Create and export an API key
Before you begin, create an API key in the Fireworks dashboard. Click Create API key and store it in a safe location. Once you have your API key, export it as an environment variable in your terminal:- macOS / Linux
- Windows
Step 2: Install the CLI
To create and manage on-demand deployments, you’ll need thefirectl
CLI tool. Install it using one of the following methods, based on your platform:
Step 3: Create a deployment
This command will create a deployment of GPT OSS 120B optimized for speed. It will take a few minutes to complete. The resulting deployment will scale up to 1 replica.fast
is called a deployment shape, which is a pre-configured deployment template created by the Fireworks team that sets sensible defaults for most deployment options (such as hardware type).You can also pass throughput
or cost
to --deployment-shape
:throughput
creates a deployment that trades off latency for lower cost-per-token at scalecost
creates a deployment that trades off latency and throughput for lowest cost-per-token at small scale, usually for early experimentation and prototyping
Name:
field in the response, as it will be used in the next step to query your deployment.
Learn more about deployment options→
Learn more about autoscaling options→
Step 4: Query your deployment
Now you can query your on-demand deployment using the same API as serverless models, but using your dedicated deployment. Replace<DEPLOYMENT_NAME>
in the below snippets with the value from the Name:
field in the previous step:
- Python
- JavaScript
- curl
Common use cases
Autoscale based on requests per second
Autoscale based on concurrent requests
Next steps
Ready to scale to production, explore other modalities, or customize your models?Upload a custom model
Bring your own model and deploy it on Fireworks
Fine-tune Models
Improve model quality with supervised and reinforcement learning
Speech to Text
Real-time or batch audio transcription
Embeddings & Reranking
Use embeddings & reranking in search & context retrieval
Batch Inference
Run async inference jobs at scale, faster and cheaper
Browse 100+ Models
Explore all available models across modalities
API Reference
Complete API documentation