cjm-substrate-torch-utils
PyTorch helpers for cjm-substrate capabilities: GPU memory release, typed CUDA-OOM handling, and device selection.
Install
pip install cjm_substrate_torch_utilsProject Structure
nbs/
├── device.ipynb # Resolve a device spec ("auto" / "cpu" / "cuda" / "cuda:N") to a concrete torch device string.
├── memory.ipynb # Robust move-to-CPU + drop-references + gc + CUDA-cache cleanup for releasing models, factored out of the per-capability reimplementations.
└── oom.ipynb # Convert torch CUDA out-of-memory exceptions into the substrate's typed `CapabilityResourceError` (SG-47 Track B) so CR-7 reactive retry can evict and reload.
Total: 3 notebooks
Module Dependencies
graph LR
device["device<br/>Device resolution"]
memory["memory<br/>GPU model release"]
oom["oom<br/>CUDA OOM handling"]
No cross-module dependencies detected.
CLI Reference
No CLI commands found in this project.
Module Overview
Detailed documentation for each module in the project:
Device resolution (device.ipynb)
Resolve a device spec (“auto” / “cpu” / “cuda” / “cuda:N”) to a concrete torch device string.
Import
from cjm_substrate_torch_utils.device import (
resolve_torch_device
)Functions
def resolve_torch_device(
spec: str = "auto", # Requested device: "auto", "cpu", "cuda", or "cuda:N"
) -> str: # Concrete device string
"""
Resolve a device spec to a concrete torch device string.
`"auto"` resolves to `"cuda"` when CUDA is available, else `"cpu"`. Any
explicit spec (`"cpu"`, `"cuda"`, `"cuda:0"`, ...) is returned unchanged.
"""GPU model release (memory.ipynb)
Robust move-to-CPU + drop-references + gc + CUDA-cache cleanup for releasing models, factored out of the per-capability reimplementations.
Import
from cjm_substrate_torch_utils.memory import (
release_model
)Functions
def release_model(
obj: Any, # The capability instance holding the model attribute(s)
model_attr_names: List[str], # Names of the attributes to release, in release order
device: str = "cuda", # Device the model is on; gates the CUDA-specific cleanup
*,
logger: logging.Logger, # Logger for best-effort failure reporting
) -> None
"""
Release one or more model objects: move to CPU, drop references, gc, free CUDA cache.
For each name in `model_attr_names`, if `obj` has a non-None attribute:
1. when on CUDA, best-effort `.to('cpu')` (frees GPU tensors; skipped for
objects without a `.to` method, e.g. processors/tokenizers),
2. `setattr(obj, name, None)` and drop the local reference.
Then a single `gc.collect()` and — on CUDA — `empty_cache()` + `synchronize()`.
Best-effort throughout: failures are logged and swallowed. Missing or
already-None attributes are skipped, so the call is idempotent.
"""CUDA OOM handling (oom.ipynb)
Convert torch CUDA out-of-memory exceptions into the substrate’s typed
CapabilityResourceError(SG-47 Track B) so CR-7 reactive retry can evict and reload.
Import
from cjm_substrate_torch_utils.oom import (
cuda_oom_to_capability_resource_error
)Functions
def cuda_oom_to_capability_resource_error(
exc: BaseException, # The caught CUDA OOM exception (e.g. torch.cuda.OutOfMemoryError)
*,
label: str, # Context for the message, e.g. "loading model 'X'" or "inference"
headroom_mb: float = 100.0, # Best-effort margin added to `available` to estimate `needed`
) -> CapabilityResourceError: # Typed error for the substrate's CR-7 reactive-retry path
"""
Convert a CUDA out-of-memory exception into a substrate-typed `CapabilityResourceError`.
SG-47 Track B: a capability's GPU inference / model-load site catches
`torch.cuda.OutOfMemoryError` and re-raises the result of this helper so the
substrate sees a typed resource error (evict + reload + retry via CR-7)
instead of an opaque crash.
`needed` is a best-effort estimate (`available + headroom_mb`): the true
required VRAM is unknowable from the exception, and CR-7 triggers eviction
regardless of magnitude, so an approximation above `available` is sufficient.
The caller raises the returned error, preserving the original cause:
try:
model = Model.from_pretrained(repo_id, ...)
except torch.cuda.OutOfMemoryError as e:
raise cuda_oom_to_capability_resource_error(e, label=f"loading {repo_id!r}") from e
"""