We’re introducing an upgraded tuning service with improved speed, usability and reliability! The new service utilizes different commands and model coverage. The new service is offered for free as we’re in public preview.

See these docs to use our legacy service instead.

Introduction

Fireworks’ offers a LoRA-based fine-tuning method designed for usability, reliability and efficiency. LoRA is used for fine-tuning all models besides our 70B models, which uses qLoRA (quantized) to improve training speeds.

The fine-tuning service is provide hassle-free quality improvements through intelligent defaults and little configuration. Models fine-tuned with our service can be seamlessly deployed for inference on Fireworks or downloaded for local usage.

Fine-tuning a model with a dataset can be useful for several reasons:

  1. Enhanced Precision: It allows the model to adapt to the unique attributes and trends within the dataset, leading to significantly improved precision and effectiveness.

  2. Domain Adaptation: While many models are developed with general data, fine-tuning them with specialized, domain-specific datasets ensures they are finely attuned to the specific requirements of that field.

  3. Bias Reduction: General models may carry inherent biases. Fine-tuning with a well-curated, diverse dataset aids in reducing these biases, fostering fairer and more balanced outcomes.

  4. Contemporary Relevance: Information evolves rapidly, and fine-tuning with the latest data keeps the model current and relevant.

  5. Customization for Specific Applications: This process allows for the tailoring of the model to meet unique objectives and needs, an aspect not achievable with standard models.

In essence, fine-tuning a model with a specific dataset is a pivotal step in ensuring its enhanced accuracy, relevance, and suitability for specific applications. Let’s hop on a journey of fine-tuning a model!

Fine-tuned model inference on Serverless is slower than base model inference on Serverless. For use cases that need low latency, we recommend using on-demand deployments. For on-demand deployements, fine-tuned model inference speeds are significant closer to base model speeds (but still slightly slower). If you are only using 1 LoRA on-demand, merging fine-tuned weights into the base model when using on-demand deployments will provide identical speed to base model inference. If you have an enterprise use case that needs fast fine-tuned models, please contact us!

Pricing

Our new tuning service is currently free but will be charged based on the total number of tokens processed (dataset tokens * number of epochs). Running inference on fine-tuned models incurs no extra costs outside of base inference fees.

See our Pricing page for pricing details on our legacy tuning service.

Installing firectl

firectl is the command-line (CLI) utility to manage, and deploy various resources on the Fireworks AI Platform. Use firectl to manage fine-tuning jobs and their resulting models.

Please visit the Firectl Getting Started Guide on installing and using firectl.

Preparing your dataset

To fine-tune a model, we need to first upload a dataset. Once uploaded, this dataset can be used to create one or more fine-tuning jobs. A dataset consists of a single JSONL file, where each line is a separate training example.

Limits:

  • Minimum number of examples is 3.

  • Maximum number of examples is 3,000,000.

Format:

  • Each line of the file must be a valid JSON object.

Each dataset must conform to the schema expected by our OpenAI-compatible Chat Completions API. Each JSON object of the dataset must contain a single array field called messages. Each message is an object containing two fields:

  • role - one of “system”, “user”, or “assistant”.

  • content - the content of the message.

A message with the “system” role is optional, but if specified, must be the first message of the conversation. Subsequent messages start with “user” and alternate between “user” and “assistant”. See below for example training examples:

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "blue"}]}
{"messages": [{"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2"}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4"}]}

Creating your dataset

To create a dataset, run:

firectl create dataset <DATASET_ID> path/to/dataset.jsonl

and you can check the dataset with:

firectl get dataset <DATASET_ID>

Starting your tuning job

To start a structured fine-tuning job (sftj), run:

firectl create sftj --base-model <MODEL_ID> --dataset <dataset_name> --output-model <desired_model_id>

For example:

firectl create sftj --base-model llama-v3p1-8b-instruct --dataset my_dataset --output-model my_model

firectl will return the fine-tuning job ID.

When creating a fine-tuning job, you can start tuning from a base model, or from a model you tuned earlier (LoRA add-on):

  1. Base model: Use the base-model parameter to start from a pre-trained base model.

  2. Existing LoRA add-on: Use the warm-start-from parameter to start from an existing LoRA addon model, where the LoRA is specified with the format “accounts/<account-id>/models/<addon-model-id>”

You must specify either base-model or warm-start-from in your command-line flags.

Checking the job status

You can monitor the progress of the tuning job by running

firectl get fine-tuning-job <JOB_ID>

Once the job successfully completes, a model will be created in your account. You can see a list of models by running:

firectl list models

Or if you specified a model ID when creating the fine-tuning job, you can get the model directly:

firectl get model <MODEL_ID>

Deploying and using a model

Before using your fine-tuned model for inference, you must deploy it. Please refer to our guides on Deploying a model and Querying text models for detailed instructions.

Some base models may not support serverless addons. To check:

  1. Run firectl -a fireworks get <base-model-id>

  2. Look under Deployed Model Refs to see if a fireworks-owned deployment exists, e.g. accounts/fireworks/deployments/3c7a68b0

  3. If so, then it is supported

If the base model doesn’t support serverless addons, you will need use an on-demand deployment to deploy it.

Additional tuning options

Tuning settings are specified when starting a fine-tuning job. All of the below settings are optional and will have reasonable defaults if not specified. For settings that affect tuning quality like epochs learning rate, we recommend using default settings and only changing hyperparameters if results are not as desired. All tuning options must be specified via command line flags as shown in the below example command with multiple flags.

firectl create sftj \
--base-model llama-v3p1-8b-instruct \
--dataset cancerset \
--output-model my-tuned-model \
--job-id my-fine-tuning-job \
--learning-rate 0.0001 \
--epochs 2 \
--early-stop \
--evaluation-dataset my-eval-set

Evaluation

By default, the fine-tuning job will run evaluation by running the fine-tuned model against an evaluation set that’s created by automatically carving out a portion of your training set. You have the option to explicitly specify a separate evaluation dataset to use instead of carving out training data.

  1. evaluation_dataset: The ID of a separate dataset to use for evaluation. Must be pre-uploaded via firectl
firectl create sftj \
 ...
  --evaluation-dataset my-eval-set \
  ...

Early stopping

Early stopping stops training early in the validation loss does not improve. It is off by default

firectl create sftj \
 ...
  --early-stop \
  ...

Max Context Length

By default, fine-tuned models support a max context length of 8k. Increase max context length if your use case requires context above 8k. Maximum context length can be increased up to the default context length of your selected model. For models with over 70B parameters, we only support up to 32k max context length.

firectl create sftj \
 ...
  --max-context-length 16000
  ...

Epochs

Epochs are the number of passes over the training data. Our default value is 1. If the model does not follow the training data as much as expected, increase the number of epochs by 1 or 2. Non-integer values are supported.

Note: we set a max value of 3 million dataset examples * epochs

firectl create sftj \
 ...
  --epochs 2.0 \
  ...

Learning rate

Learning rate controls how fast the model updates from data. We generally do not recommend changing learning rate. The default value set is automatically based on your selected model.

firectl create sftj \
  ...
  --learning-rate 0.0001 \
  ...

Lora Rank

LoRA rank refers to the number of parameters that will be tuned in your LoRA add-on. Higher LoRA rank increases the amount of information that can be captured while tuning. LoRA rank must be a power of 2 up to 64. Our default value is 8.

firectl create sftj \
...
  --lora-rank 16 \
  ...

Training progress and monitoring

The fine-tuning service integrates with Weights & Biases to provide observability into the tuning process. To use this feature, you must have a Weights & Biases account and have provisioned an API key.

firectl create sftj \
 ...
  --wandb-entity my-org \
  --wandb-api-key xxx \
  --wandb-project "My Project" \
  ...

Model ID

By default, the fine-tuning job will generate a random unique ID for the model. This ID is used to refer to the model at inference time. You can optionally specify a custom ID, within (ID constraints)[https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids].

firectl create sftj \
  ...
  --output-model-id my-model \
  ...

Job ID

By default, the fine-tuning job will generate a random unique ID for the fine-tuning job. You can optionally choose a custom ID.

firectl create sftj \
  ...
  --job-id my-fine-tuning-job \
  ...

Downloading model weights

To download model weights run

firectl download model <model-id> <target local filepath>

Appendix

Supported base models - tuning

Fireworks tuning service is limited to select models where we’re confident in providing intelligent defaults for a hassle-free experience. Currently, we only support tuning models with the following architectures:

  • Llama 1,2,3.x architectures are supported. Llama vision models and Llama 405B currently not supported

  • Qwen2 architectures are supported.

Supported base models - LoRAs on dedicated deployment

LoRAs can be deployed for inference on dedicated deployments (on-demand or enterprise reserved) for the following models:

  • All models supported for tuning

  • accounts/fireworks/models/mixtral-8x7b-instruct-hf

  • accounts/fireworks/models/mixtral-8x22b-instruct-hf

  • accounts/fireworks/models/mixtral-8x22b-hf

  • accounts/fireworks/models/mixtral-8x7b

  • accounts/fireworks/models/mistral-7b-instruct-v0p2

  • accounts/fireworks/models/mistral-7b

  • accounts/fireworks/models/code-qwen-1p5-7b

  • accounts/fireworks/models/deepseek-coder-v2-lite-base

  • accounts/fireworks/models/deepseek-coder-7b-base

  • accounts/fireworks/models/deepseek-coder-1b-base

  • accounts/fireworks/models/codegemma-7b

  • accounts/fireworks/models/codegemma-2b

  • accounts/fireworks/models/starcoder2-15b

  • accounts/fireworks/models/starcoder2-7b

  • accounts/fireworks/models/starcoder2-3b

  • accounts/fireworks/models/stablecode-3b

This means that up to 100 LoRAs can be deployed to a dedicated instance for no extra fees compared to the base deployment costs.

Supported base models - LoRAs on serverless

The following base models are supported for low-rank adaptation (LoRA) and can be deployed as LoRA add-ons on Fireworks serverless and on-demand deployments, using the default parameters below. Serverless deployment is only available for a subset of fine-tuned models - run “get (<model id>)[https://docs.fireworks.ai/models/overview#introduction]” or check the models (page)[https://fireworks.ai/models] to see if there’s an active serverless deployment.

A limited number of models are available for serverless LoRA deployment, meaning that up to 100 LoRAs can be deployed to serverless and are constantly available on a pay-per-token basis.

  • accounts/fireworks/models/llama-v3p1-8b-instruct

  • accounts/fireworks/models/llama-v3p1-70b-instruct

  • accounts/fireworks/models/llama-v3p2-3b-instruct

Support

We’d love to hear what you think! Please connect with the team, ask questions, and share your feedback in the #fine-tuning Discord channel.