Prompt caching is a performance optimization feature that allows Fireworks to respond faster to requests with prompts that share common prefixes. In many situations, it can reduce time to first token (TTFT) by as much as 80%.

Prompt caching is enabled by default for all Fireworks models and deployments.

For dedicated deployments, prompt caching frees up resources, leading to higher throughput on the same hardware. Dedicated deployments on the Enterprise plan allow additional configuration options to further optimize cache performance.

Using prompt caching

Common use cases

Requests to LLMs often share a large portion of their prompt. For example:

  • Long system prompts with detailed instructions
  • Descriptions of available tools for function calling
  • Growing previous conversation history for chat use cases
  • Shared per-user context, like a current file for a coding assistant

Prompt caching avoids re-processing the cached prefix of the prompt and starts output generation much sooner.

Structuring prompts for caching

Prompt caching works only for exact prefix matches within a prompt. To realize caching benefits, place static content like instructions and examples at the beginning of your prompt, and put variable content, such as user-specific information, at the end.

For function calling models, tools are considered part of the prompt.

For vision-language models, images currently aren’t cached (but this might be improved in the future).

How it works

Fireworks will automatically find the longest prefix of the request that is present in the cache and reuse it. The remaining portion of the prompt will be processed as usual.

The entire prompt is stored in the cache for future reuse. Cached prompts usually stay in the cache for at least several minutes. Depending on the model, load level, and deployment configuration, it can be up to several hours. The oldest prompts are evicted from the cache first.

Prompt caching doesn’t alter the result generated by the model. The response you receive will be identical to what you would get if prompt caching was not used. Each generation is sampled from the model independently on each request and is not cached for future usage.

Monitoring

For dedicated deployments, information about prompt caching is returned in the response headers. The header fireworks-prompt-tokens contains the number of tokens in the prompt, out of which fireworks-cached-prompt-tokens are cached.

Aggregated metrics are also available in the usage dashboard.

Data privacy

Serverless deployments maintain separate caches for each Fireworks account to prevent data leakage and timing attacks.

Dedicated deployments by default share a single cache across all requests. Because prompt caching doesn’t change the outputs, privacy is preserved even if the deployment powers a multi-tenant application. It does open a minor risk of a timing attack: potentially, an adversary can learn that a particular prompt is cached by observing the response time. To ensure full isolation, you can pass the x-prompt-cache-isolation-key header or the prompt_cache_isolation_key field in the body of the request. It can contain an arbitrary string that acts as an additional cache key, i.e., no sharing will occur between requests with different IDs.

Limiting or turning off caching

Additionally, you can pass the prompt_cache_max_len field in the request body to limit the maximum prefix of the prompt (in tokens) that is considered for caching. It’s rarely needed in real applications but can come in handy for benchmarking the performance of dedicated deployments by passing "prompt_cache_max_len": 0.

Advanced: cache locality for Enterprise deployments

Dedicated deployments on an Enterprise plan allow you to pass an additional hint in the request to improve cache hit rates.

First, the deployment needs to be created or updated with an additional flag:

firectl create deployment ... --enable-session-affinity

Then the client can pass an opaque identifier representing a single user or session in the user field of the body or in the x-session-affinity header. Fireworks will try to route requests with the identifier to the same server, further reducing response times.

It’s best to choose an identifier that groups requests with long shared prompt prefixes. For example, it can be a chat session with the same user or an assistant working with the same shared context.