CUDA OOM handling

Convert torch CUDA out-of-memory exceptions into the substrate’s typed CapabilityResourceError (SG-47 Track B) so CR-7 reactive retry can evict and reload.

A torch GPU capability wraps its risky inference / model-load call in a try/except torch.cuda.OutOfMemoryError and converts the caught error with this helper, then re-raises it preserving the original cause. The substrate’s CR-7 reactive-retry path (always-reload-then-retry on CapabilityResourceError) consumes the typed error and its ResourceShortfall.

This is the single source of truth for the ~8-line OOM-conversion block that Voxtral-HF (3 sites), Qwen3-FA, LavaSR, Whisper, and Demucs each reimplement.


cuda_oom_to_capability_resource_error


def cuda_oom_to_capability_resource_error(
    exc:BaseException, # The caught CUDA OOM exception (e.g. torch.cuda.OutOfMemoryError)
    label:str, # Context for the message, e.g. "loading model 'X'" or "inference"
    headroom_mb:float=100.0, # Best-effort margin added to `available` to estimate `needed`
)->CapabilityResourceError: # Typed error for the substrate's CR-7 reactive-retry path

Convert a CUDA out-of-memory exception into a substrate-typed CapabilityResourceError.

SG-47 Track B: a capability’s GPU inference / model-load site catches torch.cuda.OutOfMemoryError and re-raises the result of this helper so the substrate sees a typed resource error (evict + reload + retry via CR-7) instead of an opaque crash.

needed is a best-effort estimate (available + headroom_mb): the true required VRAM is unknowable from the exception, and CR-7 triggers eviction regardless of magnitude, so an approximation above available is sufficient.

The caller raises the returned error, preserving the original cause:

try:
    model = Model.from_pretrained(repo_id, ...)
except torch.cuda.OutOfMemoryError as e:
    raise cuda_oom_to_capability_resource_error(e, label=f"loading {repo_id!r}") from e

Tests.

_fake_oom = RuntimeError("CUDA out of memory. Tried to allocate 2.00 GiB")

# Returns a typed CapabilityResourceError with a populated gpu_vram_mb shortfall.
_err = cuda_oom_to_capability_resource_error(_fake_oom, label="loading test model")
assert isinstance(_err, CapabilityResourceError)
assert _err.resource_shortfall is not None
assert _err.resource_shortfall.resource == "gpu_vram_mb"
assert _err.resource_shortfall.available >= 0.0
# needed == available + default headroom (100.0), regardless of GPU presence.
assert _err.resource_shortfall.needed == _err.resource_shortfall.available + 100.0
assert "loading test model" in str(_err)

# Custom headroom is honored.
_err2 = cuda_oom_to_capability_resource_error(_fake_oom, label="x", headroom_mb=250.0)
assert _err2.resource_shortfall.needed == _err2.resource_shortfall.available + 250.0

# The raise-from idiom works and preserves the cause.
try:
    raise cuda_oom_to_capability_resource_error(_fake_oom, label="infer") from _fake_oom
except CapabilityResourceError as _e:
    assert _e.__cause__ is _fake_oom