Client-side performance optimization

On this page

General optimization recommendations
Code example: Optimal concurrent requests (Python)

When using a dedicated deployment, it is important to optimize the client-side HTTP connection pooling for maximum performance. We recommend using our Python SDK as it has good defaults for connection pooling and utilizes aiohttp for optimal performance with Python’s asyncio library. It also includes retry logic for handling 429 errors that Fireworks returns when the server is overloaded. We have run benchmarks that demonstrate the performance benefits.

General optimization recommendations

Based on our benchmarks, we recommend the following:

Use a client library optimized for high concurrency, such as aiohttp in Python or http.Agent in Node.js.
Keep the connection pool size high (1000+).
Increase concurrency until performance stops improving or you observe too many 429 errors.
Use direct routing to avoid the global API load balancer and route requests directly to your deployment.

Code example: Optimal concurrent requests (Python)

Here’s how to implement optimal concurrent requests using asyncio and the LLM class:

main.py

import asyncio
from fireworks import LLM

async def make_concurrent_requests(
    messages: list[str],
    max_workers: int = 1000,
    max_connections: int = 1000, # this is the default value in the SDK
):
    """Make concurrent requests with optimized connection pooling"""
    
    llm = LLM(
        model="your-model-name",
        deployment_type="on-demand", 
        id="your-deployment-id",
        max_connections=max_connections
    )
    
    # Semaphore to limit concurrent requests
    semaphore = asyncio.Semaphore(max_workers)
    
    async def single_request(message: str):
        """Make a single request with semaphore control"""
        async with semaphore:
            response = await llm.chat.completions.acreate(
                messages=[{"role": "user", "content": message}],
                max_tokens=100
            )
            return response.choices[0].message.content
    
    # Create all request tasks
    tasks = [
        single_request(message) 
        for message in messages
    ]
    
    # Execute all requests concurrently
    results = await asyncio.gather(*tasks)
    return results

# Usage example
async def main():
    messages = ["Hello!"] * 1000  # 1000 requests
    
    results = await make_concurrent_requests(
        messages=messages,
    )
    
    print(f"Completed {len(results)} requests")

if __name__ == "__main__":
    asyncio.run(main())

This implementation:

Uses asyncio.Semaphore to control concurrency to avoid overwhelming the server
Allows configuration of the maximum number of concurrent connections to the Fireworks API

Direct routing

Introduction to fine-tuning

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

Client-side performance optimization

General optimization recommendations

Code example: Optimal concurrent requests (Python)

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

​General optimization recommendations

​Code example: Optimal concurrent requests (Python)

General optimization recommendations

Code example: Optimal concurrent requests (Python)