cjm-substrate-torch-utils

PyTorch helpers for cjm-substrate capabilities: GPU memory release, typed CUDA-OOM handling, and device selection.

Install

pip install cjm_substrate_torch_utils

Project Structure

nbs/
├── device.ipynb # Resolve a device spec ("auto" / "cpu" / "cuda" / "cuda:N") to a concrete torch device string.
├── memory.ipynb # Robust move-to-CPU + drop-references + gc + CUDA-cache cleanup for releasing models, factored out of the per-capability reimplementations.
└── oom.ipynb    # Convert torch CUDA out-of-memory exceptions into the substrate's typed `CapabilityResourceError` (SG-47 Track B) so CR-7 reactive retry can evict and reload.

Total: 3 notebooks

Module Dependencies

graph LR
    device["device<br/>Device resolution"]
    memory["memory<br/>GPU model release"]
    oom["oom<br/>CUDA OOM handling"]

No cross-module dependencies detected.

CLI Reference

No CLI commands found in this project.

Module Overview

Detailed documentation for each module in the project:

Device resolution (`device.ipynb`)

Resolve a device spec (“auto” / “cpu” / “cuda” / “cuda:N”) to a concrete torch device string.

Import

from cjm_substrate_torch_utils.device import (
    resolve_torch_device
)

Functions

def resolve_torch_device(
    spec: str = "auto",  # Requested device: "auto", "cpu", "cuda", or "cuda:N"
) -> str:                # Concrete device string
    """
    Resolve a device spec to a concrete torch device string.
    
    `"auto"` resolves to `"cuda"` when CUDA is available, else `"cpu"`. Any
    explicit spec (`"cpu"`, `"cuda"`, `"cuda:0"`, ...) is returned unchanged.
    """

GPU model release (`memory.ipynb`)

Robust move-to-CPU + drop-references + gc + CUDA-cache cleanup for releasing models, factored out of the per-capability reimplementations.

Import

from cjm_substrate_torch_utils.memory import (
    release_model
)

Functions

def release_model(
    obj: Any,                     # The capability instance holding the model attribute(s)
    model_attr_names: List[str],  # Names of the attributes to release, in release order
    device: str = "cuda",         # Device the model is on; gates the CUDA-specific cleanup
    *,
    logger: logging.Logger,       # Logger for best-effort failure reporting
) -> None
    """
    Release one or more model objects: move to CPU, drop references, gc, free CUDA cache.
    
    For each name in `model_attr_names`, if `obj` has a non-None attribute:
      1. when on CUDA, best-effort `.to('cpu')` (frees GPU tensors; skipped for
         objects without a `.to` method, e.g. processors/tokenizers),
      2. `setattr(obj, name, None)` and drop the local reference.
    Then a single `gc.collect()` and — on CUDA — `empty_cache()` + `synchronize()`.
    
    Best-effort throughout: failures are logged and swallowed. Missing or
    already-None attributes are skipped, so the call is idempotent.
    """

CUDA OOM handling (`oom.ipynb`)

Convert torch CUDA out-of-memory exceptions into the substrate’s typed CapabilityResourceError (SG-47 Track B) so CR-7 reactive retry can evict and reload.

Import

from cjm_substrate_torch_utils.oom import (
    cuda_oom_to_capability_resource_error
)

Functions

def cuda_oom_to_capability_resource_error(
    exc: BaseException,          # The caught CUDA OOM exception (e.g. torch.cuda.OutOfMemoryError)
    *,
    label: str,                  # Context for the message, e.g. "loading model 'X'" or "inference"
    headroom_mb: float = 100.0,  # Best-effort margin added to `available` to estimate `needed`
) -> CapabilityResourceError:        # Typed error for the substrate's CR-7 reactive-retry path
    """
    Convert a CUDA out-of-memory exception into a substrate-typed `CapabilityResourceError`.
    
    SG-47 Track B: a capability's GPU inference / model-load site catches
    `torch.cuda.OutOfMemoryError` and re-raises the result of this helper so the
    substrate sees a typed resource error (evict + reload + retry via CR-7)
    instead of an opaque crash.
    
    `needed` is a best-effort estimate (`available + headroom_mb`): the true
    required VRAM is unknowable from the exception, and CR-7 triggers eviction
    regardless of magnitude, so an approximation above `available` is sufficient.
    
    The caller raises the returned error, preserving the original cause:
    
        try:
            model = Model.from_pretrained(repo_id, ...)
        except torch.cuda.OutOfMemoryError as e:
            raise cuda_oom_to_capability_resource_error(e, label=f"loading {repo_id!r}") from e
    """

Install

Project Structure

Module Dependencies

CLI Reference

Module Overview

Device resolution (device.ipynb)

Import

Functions

GPU model release (memory.ipynb)

Import

Functions

CUDA OOM handling (oom.ipynb)

Import

Functions

Device resolution (`device.ipynb`)

GPU model release (`memory.ipynb`)

CUDA OOM handling (`oom.ipynb`)