fireworks-prompt-tokens
contains the number of tokens
in the prompt, out of which fireworks-cached-prompt-tokens
are cached.
Aggregated metrics are also available in the usage dashboard.
x-prompt-cache-isolation-key
header or the prompt_cache_isolation_key
field in the body of the request. It can contain an arbitrary string that acts
as an additional cache key, i.e., no sharing will occur between requests with
different IDs.
prompt_cache_max_len
field in the request body to
limit the maximum prefix of the prompt (in tokens) that is considered for
caching. It’s rarely needed in real applications but can come in handy for
benchmarking the performance of dedicated deployments by passing
"prompt_cache_max_len": 0
.
user
field of the body or in the x-session-affinity
header. Fireworks
will try to route requests with the identifier to the same server, further reducing response times.
It’s best to choose an identifier that groups requests with long shared prompt
prefixes. For example, it can be a chat session with the same user or an
assistant working with the same shared context.