> ## Documentation Index
> Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Cleanup and Teardown

> Delete trainer jobs and deployments after experiments to avoid leaked resources.

## What this is

RLOR trainer jobs and hotload-enabled deployments hold GPU resources. Always clean up after experiments — especially if jobs terminate unexpectedly.

## Cleaning up RLOR trainer jobs

```python theme={null}
import os
from fireworks.training.sdk import TrainerJobManager, DeploymentManager

api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")

rlor_mgr = TrainerJobManager(api_key=api_key, base_url=base_url)
deploy_mgr = DeploymentManager(api_key=api_key, base_url=base_url)

# Delete known trainer jobs from this run
for job_id in ["<policy-job-id>", "<reference-job-id>"]:
    rlor_mgr.delete(job_id=job_id)
```

## Cleaning up deployments

```python theme={null}
deploy_mgr.delete(deployment_id="<deployment-id>")
```

If you want to keep the deployment resource but release GPUs (lighter alternative to delete):

```python theme={null}
deploy_mgr.scale_to_zero(deployment_id="<deployment-id>")
```

This sets both `minReplicaCount` and `maxReplicaCount` to `0`, releasing all accelerators while keeping the deployment available for future scale-up.

## Automatic cleanup with ResourceCleanup

The cookbook provides `ResourceCleanup`, a context manager that automatically deletes registered trainers and deployments on scope exit — including on exceptions and Ctrl+C:

```python theme={null}
from training.utils.infra import ResourceCleanup

with ResourceCleanup(rlor_mgr, deploy_mgr) as cleanup:
    # Create trainer first (trainer owns the hot-load bucket)
    endpoint = rlor_mgr.create_and_wait(config)
    cleanup.trainer(endpoint.job_id)

    # Create deployment linked to the trainer's bucket
    deploy_config.hot_load_trainer_job = endpoint.job_name
    deploy_mgr.create_or_get(deploy_config)
    cleanup.deployment("research-loop-serving")

    run_training_loop()
```

Resources are deleted in reverse creation order. Pre-existing resources that should survive are simply not registered.

Deployments can be scaled to zero instead of deleted:

```python theme={null}
cleanup.deployment("research-loop-serving", action="scale_to_zero")
```

### Manual cleanup with try/finally

If you're not using the cookbook, use `try/finally` directly:

```python theme={null}
policy_job_id = "<policy-job-id>"
reference_job_id = "<reference-job-id>"
deployment_id = "research-loop-serving"

try:
    run_training_loop()
finally:
    rlor_mgr.delete(policy_job_id)
    rlor_mgr.delete(reference_job_id)
    deploy_mgr.delete(deployment_id)
```

## Checking for leaked resources

Track the IDs you create (trainer job IDs + deployment ID) and clean those explicitly. For broad account-wide discovery, use the Fireworks console or the managed `fw.*.list()` APIs.

## Operational guidance

* **Delete both policy and reference trainers** when running GRPO (which uses 2 RLOR jobs).
* **Register cleanup on `atexit`** in your training scripts for automatic cleanup on Ctrl+C or exceptions.
* **Don't delete a trainer** while a `save_weights_for_sampler_ext` operation is in progress — wait for it to complete first.

## Related Guides

* [TrainerJobManager](/fine-tuning/training-api/reference/trainer-job-manager)
* [DeploymentManager](/fine-tuning/training-api/reference/deployment-manager)
