# Capability Error Taxonomy


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Category model

Every substrate-recognized exception carries a `category` ClassVar that
tells the JobQueue / scheduler / operator UI which retry-or-not-retry
treatment is appropriate. Four categories:

<table>
<colgroup>
<col style="width: 24%" />
<col style="width: 53%" />
<col style="width: 21%" />
</colgroup>
<thead>
<tr>
<th>Category</th>
<th>Retriable by default</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>user_input</code></td>
<td>Yes (the user can fix and resubmit)</td>
<td>Bad config, missing file, invalid argument</td>
</tr>
<tr>
<td><code>transient</code></td>
<td>Yes (retry may succeed)</td>
<td>Network blip, timeout, temporary resource lock</td>
</tr>
<tr>
<td><code>resource</code></td>
<td>Yes (after eviction)</td>
<td>Out of GPU VRAM, disk, system RAM</td>
</tr>
<tr>
<td><code>fatal</code></td>
<td>No</td>
<td>Bug, broken capability install, irrecoverable state</td>
</tr>
</tbody>
</table>

**MRO discipline**: `CapabilityInputError` is the only category that
multiply inherits `ValueError`. The semantic argument:
`except ValueError:` expresses intent to catch invalid-argument errors.
Letting transient / resource / fatal errors be caught by a bare
`except ValueError:` would silently broaden that intent, which we
specifically do not want.

------------------------------------------------------------------------

### CapabilityFatalError

``` python

def CapabilityFatalError(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

```

*Bug / irrecoverable state. The capability cannot complete this job;
retrying won’t help.*

Capability authors raise this when they know the failure is permanent
for the given inputs. The substrate does NOT retry fatal errors.

------------------------------------------------------------------------

### CapabilityResourceError

``` python

def CapabilityResourceError(
    message:str, # Human-readable description
    resource_shortfall:Optional=None, # Quantitative gap
):

```

*Resource exhaustion: GPU VRAM, system RAM, disk full.*

JobQueue’s reactive-eviction flow (CR-7) routes resource errors to retry
after attempting to free the named resource. Capability authors set
`resource_shortfall` so the substrate knows what to evict.

------------------------------------------------------------------------

### CapabilityTransientError

``` python

def CapabilityTransientError(
    message:str, # Human-readable description
    retry_after_seconds:Optional=None, # Hint for backoff strategies
):

```

*Temporary failure: timeout, network blip, brief resource contention.*

Substrate / JobQueue may retry on its own initiative. Capability authors
raise this when they know the failure is recoverable.

------------------------------------------------------------------------

### CapabilityInputError

``` python

def CapabilityInputError(
    message:str, # Human-readable description
    fields_invalid:Optional=None, # Names of inputs that failed validation
):

```

*User-fixable error: bad config, invalid argument, missing file.*

Like the other category bases (`CapabilityTransientError`,
`CapabilityResourceError`, `CapabilityFatalError`), it extends only
`CapabilityError`; the right reader intent is
`except CapabilityInputError:` (or the broader
`except CapabilityError:`).

------------------------------------------------------------------------

### CapabilityError

``` python

def CapabilityError(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

```

*Base for substrate-recognized capability exceptions.*

Subclasses declare a `category` and `default_retriable` ClassVar so the
JobQueue + scheduler can route the failure without sniffing exception
text. Bare Python exceptions raised by capability code go through
`map_bare_exception_to_job_error` to acquire a default category.

## Substrate-raised typed exceptions

These concrete exception types are *defined* in CR-5 (this file) so CR-2
/ CR-6 / SG-14 can raise them when they land. Each anchors under the
appropriate category base so its catch behavior is correct from day one.

------------------------------------------------------------------------

### CapabilityDisabledError

``` python

def CapabilityDisabledError(
    capability_name:str
):

```

*JobQueue / execute_capability rejected: the capability is currently
disabled.*

User-fixable (re-enable the capability). Raised by CR-2’s enable/disable
wiring once that lands.

------------------------------------------------------------------------

### CapabilityNotLoadedError

``` python

def CapabilityNotLoadedError(
    capability_name:str
):

```

*Caller submitted to a capability that was never loaded.*

Fatal category because this is a programmer / orchestration bug, not a
user-fixable condition. The right reader intent is
`except CapabilityNotLoadedError:` (or the broader
`except CapabilityError:`).

------------------------------------------------------------------------

### CapabilityTimeoutError

``` python

def CapabilityTimeoutError(
    capability_name:str, timeout_seconds:float, retry_after_seconds:Optional=None
):

```

*A per-job timeout fired before the capability finished.*

Transient category — retry may succeed if the slow operation completes
faster next time. Carries `retry_after_seconds` from
`CapabilityTransientError`. Raised by SG-14’s per-job timeout primitive
when that lands.

------------------------------------------------------------------------

### CapabilityCancelledError

``` python

def CapabilityCancelledError(
    capability_name:str
):

```

*Cooperative cancellation signal raised from
`ToolCapability.check_cancel()`.*

Anchors under `CapabilityTransientError` because cancellation is
in-principle re-runnable — a future attempt with the same inputs won’t
auto-fail if the cancel flag isn’t set. But `default_retriable` is
False: cancellation was a deliberate operator action, so the substrate
should NOT auto-retry. Job-monitor / JobQueue render cancelled jobs with
their own state (separate from “failed”); the JobError category remains
`transient` so consumers reading the typed taxonomy can group
recoverable signals.

Capability authors raise this implicitly via `self.check_cancel()`
inside `execute()`; substrate sets the underlying `_cancel_requested`
flag via `cancel()`. See CR-4’s cancellation primitives for the
cooperative-cancel protocol.

------------------------------------------------------------------------

### WorkerOOMError

``` python

def WorkerOOMError(
    capability_name:str, process_returncode:Optional=None, message:Optional=None
):

```

*The worker subprocess died with a kill-signal during an active execute
call.*

CR-7 Track A — substrate-side OOM detection: when an HTTP call to the
worker faults and the subprocess has died with
`returncode == -signal.SIGKILL` (or the platform equivalent), the
substrate raises this. The kernel OOM-killer is the most common cause of
SIGKILL during normal execute paths, so the substrate treats
SIGKILL-during-call as “assume OOM” and surfaces a typed resource error
for the reactive retry path.

`resource_shortfall` is `None` for Track A — the substrate only saw
“worker died from kill-signal” and has no per-resource needed/available
numbers. Track B (per SG-47’s sub-task: capability-side wrapping of
`torch.cuda.OutOfMemoryError` et al.) raises `CapabilityResourceError`
directly with a populated `ResourceShortfall` because the capability had
the context. Both land at the same `except CapabilityResourceError` site
in CR-7’s reactive retry loop.

`process_returncode` carries the observed exit code for debugging /
classification (e.g. operators can distinguish kernel-OOM SIGKILL from
other signals if they read it). Defaults to `None` for callers that
don’t have it on hand.

## CapabilityConfigError (reparented from `utils.validation` per CR-5)

Originally defined in `utils/validation.py` by SG-8 as a `ValueError`
subclass. CR-5 reparents it under `CapabilityInputError`, preserving
`except ValueError:` compat (via `CapabilityInputError`’s ValueError
MRO) and using `fields_invalid` as the canonical attribute name,
consistent with the rest of the input-error hierarchy.

------------------------------------------------------------------------

### CapabilityConfigError

``` python

def CapabilityConfigError(
    message:str, # Human-readable description
    fields_invalid:Optional=None, # Canonical: list of bad config keys
    config_class_name:str='', # Dataclass / capability name for the schema
):

```

*Unknown / invalid keys in a config dict against a capability’s config
schema.*

Reparented from `cjm_substrate.utils.validation` (Wave 2 / SG-8) under
CR-5. Inherits `CapabilityInputError`’s ValueError MRO automatically.
`config_class_name` is the dataclass / capability name whose schema was
violated.

## JobError + ResourceShortfall + TracebackPolicy

When a capability job fails, the JobQueue (CR-6) records a `JobError`
summary on the completed `Job`. The summary captures everything a
frontend / operator needs to understand and (optionally) retry the
failure without re-running the capability:

- `category` lets UI decide retry button affordances.
- `retriable` carries the substrate’s policy on whether to auto-retry.
- `original_exc_repr` + optional `traceback` give post-mortem context.
- `fields_invalid` / `resource_shortfall` are category-specific
  structured data.

`TracebackPolicy` controls how much detail the substrate records.
Default `FULL` is what dev mode wants; `REPR_ONLY` and `NONE` are future
opt-outs for security-sensitive multi-user deployments.

------------------------------------------------------------------------

### JobError

``` python

def JobError(
    category:Literal, message:str, retriable:bool, original_exc_repr:str, traceback:Optional=None,
    retry_after_seconds:Optional=None, fields_invalid:Optional=None, resource_shortfall:Optional=None,
    capability_name:Optional=None, capability_instance_id:Optional=None, occurred_at:Optional=None
)->None:

```

*Structured failure summary recorded on a completed Job.*

Populated by the JobQueue when a capability execution fails (CR-6 owns
the population logic; CR-5 owns the shape). Sufficient for UI to render
a failure card + retry affordance without re-running the capability.

------------------------------------------------------------------------

### TracebackPolicy

``` python

def TracebackPolicy(
    args:VAR_POSITIONAL, kwds:VAR_KEYWORD
):

```

*How much exception detail the substrate records on a JobError.*

------------------------------------------------------------------------

### ResourceShortfall

``` python

def ResourceShortfall(
    resource:Literal, needed:float, available:float
)->None:

```

*Quantitative gap between what a capability needed and what was
available.*

## Default classification of bare Python exceptions

Capability authors will gradually migrate to raising `CapabilityError`
subclasses (SG-47 cascade). Until then, the JobQueue still needs to
classify bare `ValueError` / `TimeoutError` / etc. into one of the four
categories so retry policy is correct from day one.

The mapping walks the exception’s `__mro__` against a substrate-provided
lookup. First MRO ancestor that matches wins. Default for everything
else is `fatal` — conservative: don’t auto-retry an exception we can’t
classify.

------------------------------------------------------------------------

### classify_exception

``` python

def classify_exception(
    exc:BaseException, # The exception to classify
)->Literal['user_input', 'transient', 'resource', 'fatal']: # Category

```

*Return the substrate category for any exception.*

CapabilityError subclasses report their own declared `category`. Bare
Python exceptions are mapped via `__mro__` walk against
`_BARE_EXCEPTION_CATEGORY_MAP`; the first ancestor in the table wins.
Unrecognized exceptions classify as `fatal` (don’t auto-retry the
unknown).

------------------------------------------------------------------------

### map_bare_exception_to_job_error

``` python

def map_bare_exception_to_job_error(
    exc:BaseException, # The raised exception
    capability_name:Optional=None, # Name of the capability that raised
    capability_instance_id:Optional=None, # Per CR-10
    traceback_policy:TracebackPolicy=<TracebackPolicy.FULL: 'full'>, # How much detail to record
    occurred_at:Optional=None, # Override; defaults to datetime.now(timezone.utc)
)->JobError:

```

*Convert any exception into a structured `JobError`.*

CapabilityError subclasses contribute their category-specific structured
data (`fields_invalid` for input errors, `resource_shortfall` for
resource errors, `retry_after_seconds` for transient errors). Bare
exceptions get the default category-based retriable flag and no
structured side-channel.

## Regression tests

These exercises pin the MRO discipline, the backward-compat shim
behavior, and the default classification. The MRO assertions are
particularly load-bearing — future refactors that accidentally broaden
`CapabilityError(ValueError)` would catch transient/resource/fatal
errors via `except ValueError:`, which we explicitly do not want.

``` python
# MRO discipline: capability errors do NOT extend ValueError (SG-48 dropped
# the CapabilityInputError ValueError base).
input_err = CapabilityInputError("bad", fields_invalid=["foo"])
assert not isinstance(input_err, ValueError), "SG-48 dropped the ValueError base"
assert isinstance(input_err, CapabilityError)
assert isinstance(input_err, Exception)
assert input_err.category == 'user_input'
assert input_err.default_retriable is True
assert input_err.fields_invalid == ["foo"]

transient_err = CapabilityTransientError("slow", retry_after_seconds=5.0)
assert not isinstance(transient_err, ValueError), \
    "CapabilityTransientError must NOT inherit ValueError (semantic discipline)"
assert isinstance(transient_err, CapabilityError)
assert transient_err.category == 'transient'
assert transient_err.retry_after_seconds == 5.0

resource_err = CapabilityResourceError(
    "oom",
    resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=8000, available=4000),
)
assert not isinstance(resource_err, ValueError)
assert resource_err.category == 'resource'
assert resource_err.resource_shortfall.needed == 8000

fatal_err = CapabilityFatalError("crashed")
assert not isinstance(fatal_err, ValueError)
assert fatal_err.category == 'fatal'
assert fatal_err.default_retriable is False

print("✓ MRO discipline: no capability error extends ValueError")
```

    ✓ MRO discipline: only PluginInputError tree extends ValueError

``` python
# Substrate-side typed exceptions anchor under the correct category.
disabled = CapabilityDisabledError("whisper")
assert isinstance(disabled, CapabilityInputError)
assert not isinstance(disabled, ValueError), "SG-48 dropped the ValueError base"
assert disabled.category == 'user_input'
assert disabled.capability_name == "whisper"

not_loaded = CapabilityNotLoadedError("whisper")
assert isinstance(not_loaded, CapabilityFatalError)
assert not isinstance(not_loaded, ValueError), \
    "CapabilityNotLoadedError must NOT be catchable as ValueError (it's a fatal bug)"
assert not_loaded.category == 'fatal'

timeout = CapabilityTimeoutError("whisper", timeout_seconds=30.0, retry_after_seconds=60.0)
assert isinstance(timeout, CapabilityTransientError)
assert not isinstance(timeout, ValueError)
assert timeout.category == 'transient'
assert timeout.timeout_seconds == 30.0
assert timeout.retry_after_seconds == 60.0

# CR-4: CapabilityCancelledError extends CapabilityTransientError but is non-retriable
# (deliberate operator action — substrate should not auto-retry cancelled jobs).
cancelled = CapabilityCancelledError("whisper")
assert isinstance(cancelled, CapabilityTransientError)
assert isinstance(cancelled, CapabilityError)
assert not isinstance(cancelled, ValueError), \
    "CapabilityCancelledError must NOT be catchable as ValueError (it's a control-flow signal, not a value error)"
assert cancelled.category == 'transient', "category=transient: cancellation is in-principle re-runnable"
assert cancelled.default_retriable is False, \
    "default_retriable=False: substrate must not auto-retry operator-cancelled jobs"
assert cancelled.capability_name == "whisper"
assert "cancelled by operator" in str(cancelled)

# CR-7 Track A: WorkerOOMError extends CapabilityResourceError with default_retriable=True
# inherited; carries process_returncode for operator debugging; no ResourceShortfall.
oom = WorkerOOMError("whisper", process_returncode=-9)
assert isinstance(oom, CapabilityResourceError), "must catch under CapabilityResourceError"
assert isinstance(oom, CapabilityError)
assert not isinstance(oom, ValueError), "resource errors are not ValueErrors"
assert oom.category == 'resource', "CR-7 reactive retry dispatches on category=resource"
assert oom.default_retriable is True, \
    "default_retriable=True: OOM is retriable after eviction (the whole point of CR-7)"
assert oom.capability_name == "whisper"
assert oom.process_returncode == -9
assert oom.resource_shortfall is None, \
    "Track A: substrate doesn't know needed/available; only Track B (capability-side raise) does"
assert "whisper" in str(oom)
assert "returncode=-9" in str(oom)

# CR-7: WorkerOOMError catches at the CapabilityResourceError site (the shared
# CR-7 reactive retry catch-point). Track A + Track B converge here.
def fake_track_a_raise():
    raise WorkerOOMError("voxtral", process_returncode=-9)

def fake_track_b_raise():
    raise CapabilityResourceError(
        "voxtral: CUDA OOM",
        resource_shortfall=ResourceShortfall(
            resource='gpu_vram_mb', needed=24000, available=8000,
        ),
    )

for raiser in (fake_track_a_raise, fake_track_b_raise):
    caught = False
    try:
        raiser()
    except CapabilityResourceError:
        caught = True
    assert caught, f"{raiser.__name__} must catch under CapabilityResourceError"

# Custom message override path
oom_custom = WorkerOOMError("whisper", message="custom diagnostic")
assert str(oom_custom) == "custom diagnostic"
assert oom_custom.capability_name == "whisper"
assert oom_custom.process_returncode is None

print("✓ Substrate-side typed exceptions anchor under the right category")
```

``` python
# CapabilityConfigError reparenting under CapabilityInputError; fields_invalid canonical.
err = CapabilityConfigError(
    "unknown keys",
    fields_invalid=["foo", "bar"],
    config_class_name="WhisperConfig",
)
assert isinstance(err, CapabilityInputError)
assert not isinstance(err, ValueError), "SG-48 dropped the ValueError base"
assert err.fields_invalid == ["foo", "bar"]
assert err.config_class_name == "WhisperConfig"

print("✓ CapabilityConfigError reparenting works")
```

    ✓ PluginConfigError reparenting + backward-compat shims work

``` python
# Default classification of bare Python exceptions.
assert classify_exception(ValueError("bad")) == 'user_input'
assert classify_exception(TypeError("bad")) == 'user_input'
assert classify_exception(FileNotFoundError("missing")) == 'user_input'
assert classify_exception(TimeoutError("slow")) == 'transient'
assert classify_exception(ConnectionError("net")) == 'transient'
assert classify_exception(MemoryError("oom")) == 'resource'
assert classify_exception(RuntimeError("unknown")) == 'fatal'

# CapabilityError subclasses report their own declared category, not the
# inherited-builtin's category. CapabilityInputError extends ValueError but its
# category is 'user_input' (the declared value), not derived from ValueError.
assert classify_exception(CapabilityInputError("x")) == 'user_input'
assert classify_exception(CapabilityTransientError("x")) == 'transient'
assert classify_exception(CapabilityResourceError("x")) == 'resource'
assert classify_exception(CapabilityFatalError("x")) == 'fatal'

# CapabilityNotLoadedError is fatal even though no built-in maps to fatal by default.
assert classify_exception(CapabilityNotLoadedError("whisper")) == 'fatal'

print("✓ Default exception classification correct")
```

    ✓ Default exception classification correct

``` python
# map_bare_exception_to_job_error captures category + retriable + structured data.
try:
    raise CapabilityConfigError("bad config", fields_invalid=["model"])
except Exception as e:
    err = map_bare_exception_to_job_error(e, capability_name="whisper")

assert err.category == 'user_input'
assert err.retriable is True
assert err.fields_invalid == ["model"]
assert err.capability_name == "whisper"
assert err.traceback is not None and "CapabilityConfigError" in err.traceback
assert err.occurred_at is not None
# Python 3.12+ compat: occurred_at must be timezone-aware (datetime.utcnow()
# is deprecated and returns naive datetime; we use datetime.now(timezone.utc)).
assert err.occurred_at.tzinfo is not None, \
    "occurred_at should be timezone-aware (CR-5 Python 3.12+ future-proof form)"

# Resource error: resource_shortfall propagates.
try:
    raise CapabilityResourceError(
        "oom",
        resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=16000, available=8000),
    )
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'resource'
assert err.retriable is True
assert err.resource_shortfall.needed == 16000

# Bare ValueError gets default user_input + retriable=True.
try:
    raise ValueError("unmapped bare")
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'user_input'
assert err.retriable is True
assert err.fields_invalid is None  # bare ValueError has no fields_invalid attribute

# Bare RuntimeError gets default fatal + retriable=False.
try:
    raise RuntimeError("unknown")
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'fatal'
assert err.retriable is False

# TracebackPolicy.NONE suppresses traceback + message.
try:
    raise ValueError("secret")
except Exception as e:
    err = map_bare_exception_to_job_error(e, traceback_policy=TracebackPolicy.NONE)

assert err.traceback is None
assert err.message == ""
assert err.original_exc_repr  # repr is always kept (debug breadcrumb)

print("✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy")
```

    ✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy
