# Plugin Error Taxonomy


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Category model

Every substrate-recognized exception carries a `category` ClassVar that
tells the JobQueue / scheduler / operator UI which retry-or-not-retry
treatment is appropriate. Four categories:

<table>
<colgroup>
<col style="width: 24%" />
<col style="width: 53%" />
<col style="width: 21%" />
</colgroup>
<thead>
<tr>
<th>Category</th>
<th>Retriable by default</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>user_input</code></td>
<td>Yes (the user can fix and resubmit)</td>
<td>Bad config, missing file, invalid argument</td>
</tr>
<tr>
<td><code>transient</code></td>
<td>Yes (retry may succeed)</td>
<td>Network blip, timeout, temporary resource lock</td>
</tr>
<tr>
<td><code>resource</code></td>
<td>Yes (after eviction)</td>
<td>Out of GPU VRAM, disk, system RAM</td>
</tr>
<tr>
<td><code>fatal</code></td>
<td>No</td>
<td>Bug, broken plugin install, irrecoverable state</td>
</tr>
</tbody>
</table>

**MRO discipline**: `PluginInputError` is the only category that
multiply inherits `ValueError`. The semantic argument:
`except ValueError:` expresses intent to catch invalid-argument errors.
Letting transient / resource / fatal errors be caught by a bare
`except ValueError:` would silently broaden that intent, which we
specifically do not want.

------------------------------------------------------------------------

### PluginFatalError

``` python

def PluginFatalError(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

```

*Bug / irrecoverable state. The plugin cannot complete this job;
retrying won’t help.*

Plugin authors raise this when they know the failure is permanent for
the given inputs. The substrate does NOT retry fatal errors.

------------------------------------------------------------------------

### PluginResourceError

``` python

def PluginResourceError(
    message:str, # Human-readable description
    resource_shortfall:Optional=None, # Quantitative gap
):

```

*Resource exhaustion: GPU VRAM, system RAM, disk full.*

JobQueue’s reactive-eviction flow (CR-7) routes resource errors to retry
after attempting to free the named resource. Plugin authors set
`resource_shortfall` so the substrate knows what to evict.

------------------------------------------------------------------------

### PluginTransientError

``` python

def PluginTransientError(
    message:str, # Human-readable description
    retry_after_seconds:Optional=None, # Hint for backoff strategies
):

```

*Temporary failure: timeout, network blip, brief resource contention.*

Substrate / JobQueue may retry on its own initiative. Plugin authors
raise this when they know the failure is recoverable.

------------------------------------------------------------------------

### PluginInputError

``` python

def PluginInputError(
    message:str, # Human-readable description
    fields_invalid:Optional=None, # Names of inputs that failed validation
):

```

*User-fixable error: bad config, invalid argument, missing file.*

Multi-inherits `ValueError` so SG-8-era `except ValueError:` catch sites
that legitimately want input errors keep working through the SG-47
migration window. The MRO is
`PluginInputError → PluginError → ValueError → Exception`; other
category bases (`PluginTransientError`, `PluginResourceError`,
`PluginFatalError`) deliberately do NOT extend `ValueError` because
their failure modes are not semantically value errors.

------------------------------------------------------------------------

### PluginError

``` python

def PluginError(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

```

*Base for substrate-recognized plugin exceptions.*

Subclasses declare a `category` and `default_retriable` ClassVar so the
JobQueue + scheduler can route the failure without sniffing exception
text. Bare Python exceptions raised by plugin code go through
`map_bare_exception_to_job_error` to acquire a default category.

## Substrate-raised typed exceptions

These concrete exception types are *defined* in CR-5 (this file) so CR-2
/ CR-6 / SG-14 can raise them when they land. Each anchors under the
appropriate category base so its catch behavior is correct from day one.

------------------------------------------------------------------------

### PluginDisabledError

``` python

def PluginDisabledError(
    plugin_name:str
):

```

*JobQueue / execute_plugin rejected: the plugin is currently disabled.*

User-fixable (re-enable the plugin). Inherits `PluginInputError`’s
ValueError MRO so existing `except ValueError:` callers see it as an
input error. Raised by CR-2’s enable/disable wiring once that lands.

------------------------------------------------------------------------

### PluginNotLoadedError

``` python

def PluginNotLoadedError(
    plugin_name:str
):

```

*Caller submitted to a plugin that was never loaded.*

Fatal category because this is a programmer / orchestration bug, not a
user-fixable condition. NOT a ValueError — the right reader intent is
`except PluginNotLoadedError:` (or the broader `except PluginError:`),
not a blanket `except ValueError:`.

------------------------------------------------------------------------

### PluginTimeoutError

``` python

def PluginTimeoutError(
    plugin_name:str, timeout_seconds:float, retry_after_seconds:Optional=None
):

```

*A per-job timeout fired before the plugin finished.*

Transient category — retry may succeed if the slow operation completes
faster next time. Carries `retry_after_seconds` from
`PluginTransientError`. Raised by SG-14’s per-job timeout primitive when
that lands.

------------------------------------------------------------------------

### PluginCancelledError

``` python

def PluginCancelledError(
    plugin_name:str
):

```

*Cooperative cancellation signal raised from
`PluginInterface.check_cancel()`.*

Anchors under `PluginTransientError` because cancellation is
in-principle re-runnable — a future attempt with the same inputs won’t
auto-fail if the cancel flag isn’t set. But `default_retriable` is
False: cancellation was a deliberate operator action, so the substrate
should NOT auto-retry. Job-monitor / JobQueue render cancelled jobs with
their own state (separate from “failed”); the JobError category remains
`transient` so consumers reading the typed taxonomy can group
recoverable signals.

Plugin authors raise this implicitly via `self.check_cancel()` inside
`execute()`; substrate sets the underlying `_cancel_requested` flag via
`cancel()`. See CR-4’s cancellation primitives for the
cooperative-cancel protocol.

------------------------------------------------------------------------

### WorkerOOMError

``` python

def WorkerOOMError(
    plugin_name:str, process_returncode:Optional=None, message:Optional=None
):

```

*The worker subprocess died with a kill-signal during an active execute
call.*

CR-7 Track A — substrate-side OOM detection: when an HTTP call to the
worker faults and the subprocess has died with
`returncode == -signal.SIGKILL` (or the platform equivalent), the
substrate raises this. The kernel OOM-killer is the most common cause of
SIGKILL during normal execute paths, so the substrate treats
SIGKILL-during-call as “assume OOM” and surfaces a typed resource error
for the reactive retry path.

`resource_shortfall` is `None` for Track A — the substrate only saw
“worker died from kill-signal” and has no per-resource needed/available
numbers. Track B (per SG-47’s sub-task: plugin-side wrapping of
`torch.cuda.OutOfMemoryError` et al.) raises `PluginResourceError`
directly with a populated `ResourceShortfall` because the plugin had the
context. Both land at the same `except PluginResourceError` site in
CR-7’s reactive retry loop.

`process_returncode` carries the observed exit code for debugging /
classification (e.g. operators can distinguish kernel-OOM SIGKILL from
other signals if they read it). Defaults to `None` for callers that
don’t have it on hand.

## PluginConfigError (reparented from `utils.validation` per CR-5)

Originally defined in `utils/validation.py` by SG-8 as a `ValueError`
subclass. CR-5 reparents it under `PluginInputError`. The reparenting
preserves `except ValueError:` compat (via `PluginInputError`’s
ValueError MRO) and unifies the field-validation attribute name with the
rest of the input-error hierarchy: SG-8’s `unknown_keys` becomes
`fields_invalid` (canonical).

Backward-compat handling for the SG-8-era kwarg + attribute:

- `unknown_keys=` keyword in `__init__` is accepted but emits
  `DeprecationWarning`.
- `unknown_keys` is a read-only property aliasing `fields_invalid`.
- Both are tagged `# REMOVE-AFTER-OVERHAUL` for SG-48 sweep.

Why two REMOVE-AFTER-OVERHAUL tags rather than one: the kwarg and the
property address different migration paths. The kwarg shim helps code
that constructs the exception; the property shim helps code that
inspects the exception after catching it. Either can be removed
independently once SG-47 cascades.

------------------------------------------------------------------------

### PluginConfigError

``` python

def PluginConfigError(
    message:str, # Human-readable description
    fields_invalid:Optional=None, # Canonical: list of bad config keys
    config_class_name:str='', # Dataclass / plugin name for the schema
    unknown_keys:Optional=None, # REMOVE-AFTER-OVERHAUL: drop unknown_keys kwarg after SG-47 cascade completes
):

```

*Unknown / invalid keys in a config dict against a plugin’s config
schema.*

Reparented from `cjm_plugin_system.utils.validation` (Wave 2 / SG-8)
under CR-5. Inherits `PluginInputError`’s ValueError MRO automatically.
`config_class_name` is the dataclass / plugin name whose schema was
violated.

## JobError + ResourceShortfall + TracebackPolicy

When a plugin job fails, the JobQueue (CR-6) records a `JobError`
summary on the completed `Job`. The summary captures everything a
frontend / operator needs to understand and (optionally) retry the
failure without re-running the plugin:

- `category` lets UI decide retry button affordances.
- `retriable` carries the substrate’s policy on whether to auto-retry.
- `original_exc_repr` + optional `traceback` give post-mortem context.
- `fields_invalid` / `resource_shortfall` are category-specific
  structured data.

`TracebackPolicy` controls how much detail the substrate records.
Default `FULL` is what dev mode wants; `REPR_ONLY` and `NONE` are future
opt-outs for security-sensitive multi-user deployments.

------------------------------------------------------------------------

### JobError

``` python

def JobError(
    category:Literal, message:str, retriable:bool, original_exc_repr:str, traceback:Optional=None,
    retry_after_seconds:Optional=None, fields_invalid:Optional=None, resource_shortfall:Optional=None,
    plugin_name:Optional=None, plugin_instance_id:Optional=None, occurred_at:Optional=None
)->None:

```

*Structured failure summary recorded on a completed Job.*

Populated by the JobQueue when a plugin execution fails (CR-6 owns the
population logic; CR-5 owns the shape). Sufficient for UI to render a
failure card + retry affordance without re-running the plugin.

------------------------------------------------------------------------

### TracebackPolicy

``` python

def TracebackPolicy(
    args:VAR_POSITIONAL, kwds:VAR_KEYWORD
):

```

*How much exception detail the substrate records on a JobError.*

------------------------------------------------------------------------

### ResourceShortfall

``` python

def ResourceShortfall(
    resource:Literal, needed:float, available:float
)->None:

```

*Quantitative gap between what a plugin needed and what was available.*

## Default classification of bare Python exceptions

Plugin authors will gradually migrate to raising `PluginError`
subclasses (SG-47 cascade). Until then, the JobQueue still needs to
classify bare `ValueError` / `TimeoutError` / etc. into one of the four
categories so retry policy is correct from day one.

The mapping walks the exception’s `__mro__` against a substrate-provided
lookup. First MRO ancestor that matches wins. Default for everything
else is `fatal` — conservative: don’t auto-retry an exception we can’t
classify.

------------------------------------------------------------------------

### classify_exception

``` python

def classify_exception(
    exc:BaseException, # The exception to classify
)->Literal['user_input', 'transient', 'resource', 'fatal']: # Category

```

*Return the substrate category for any exception.*

PluginError subclasses report their own declared `category`. Bare Python
exceptions are mapped via `__mro__` walk against
`_BARE_EXCEPTION_CATEGORY_MAP`; the first ancestor in the table wins.
Unrecognized exceptions classify as `fatal` (don’t auto-retry the
unknown).

------------------------------------------------------------------------

### map_bare_exception_to_job_error

``` python

def map_bare_exception_to_job_error(
    exc:BaseException, # The raised exception
    plugin_name:Optional=None, # Name of the plugin that raised
    plugin_instance_id:Optional=None, # Per CR-10
    traceback_policy:TracebackPolicy=<TracebackPolicy.FULL: 'full'>, # How much detail to record
    occurred_at:Optional=None, # Override; defaults to datetime.now(timezone.utc)
)->JobError:

```

*Convert any exception into a structured `JobError`.*

PluginError subclasses contribute their category-specific structured
data (`fields_invalid` for input errors, `resource_shortfall` for
resource errors, `retry_after_seconds` for transient errors). Bare
exceptions get the default category-based retriable flag and no
structured side-channel.

## Regression tests

These exercises pin the MRO discipline, the backward-compat shim
behavior, and the default classification. The MRO assertions are
particularly load-bearing — future refactors that accidentally broaden
`PluginError(ValueError)` would catch transient/resource/fatal errors
via `except ValueError:`, which we explicitly do not want.

``` python
# MRO discipline: only PluginInputError tree is catchable as ValueError.
input_err = PluginInputError("bad", fields_invalid=["foo"])
assert isinstance(input_err, ValueError)
assert isinstance(input_err, PluginError)
assert isinstance(input_err, Exception)
assert input_err.category == 'user_input'
assert input_err.default_retriable is True
assert input_err.fields_invalid == ["foo"]

transient_err = PluginTransientError("slow", retry_after_seconds=5.0)
assert not isinstance(transient_err, ValueError), \
    "PluginTransientError must NOT inherit ValueError (semantic discipline)"
assert isinstance(transient_err, PluginError)
assert transient_err.category == 'transient'
assert transient_err.retry_after_seconds == 5.0

resource_err = PluginResourceError(
    "oom",
    resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=8000, available=4000),
)
assert not isinstance(resource_err, ValueError)
assert resource_err.category == 'resource'
assert resource_err.resource_shortfall.needed == 8000

fatal_err = PluginFatalError("crashed")
assert not isinstance(fatal_err, ValueError)
assert fatal_err.category == 'fatal'
assert fatal_err.default_retriable is False

print("✓ MRO discipline: only PluginInputError tree extends ValueError")
```

    ✓ MRO discipline: only PluginInputError tree extends ValueError

``` python
# Substrate-side typed exceptions anchor under the correct category.
disabled = PluginDisabledError("whisper")
assert isinstance(disabled, PluginInputError)
assert isinstance(disabled, ValueError), "PluginDisabledError must be catchable as ValueError"
assert disabled.category == 'user_input'
assert disabled.plugin_name == "whisper"

not_loaded = PluginNotLoadedError("whisper")
assert isinstance(not_loaded, PluginFatalError)
assert not isinstance(not_loaded, ValueError), \
    "PluginNotLoadedError must NOT be catchable as ValueError (it's a fatal bug)"
assert not_loaded.category == 'fatal'

timeout = PluginTimeoutError("whisper", timeout_seconds=30.0, retry_after_seconds=60.0)
assert isinstance(timeout, PluginTransientError)
assert not isinstance(timeout, ValueError)
assert timeout.category == 'transient'
assert timeout.timeout_seconds == 30.0
assert timeout.retry_after_seconds == 60.0

# CR-4: PluginCancelledError extends PluginTransientError but is non-retriable
# (deliberate operator action — substrate should not auto-retry cancelled jobs).
cancelled = PluginCancelledError("whisper")
assert isinstance(cancelled, PluginTransientError)
assert isinstance(cancelled, PluginError)
assert not isinstance(cancelled, ValueError), \
    "PluginCancelledError must NOT be catchable as ValueError (it's a control-flow signal, not a value error)"
assert cancelled.category == 'transient', "category=transient: cancellation is in-principle re-runnable"
assert cancelled.default_retriable is False, \
    "default_retriable=False: substrate must not auto-retry operator-cancelled jobs"
assert cancelled.plugin_name == "whisper"
assert "cancelled by operator" in str(cancelled)

# CR-7 Track A: WorkerOOMError extends PluginResourceError with default_retriable=True
# inherited; carries process_returncode for operator debugging; no ResourceShortfall.
oom = WorkerOOMError("whisper", process_returncode=-9)
assert isinstance(oom, PluginResourceError), "must catch under PluginResourceError"
assert isinstance(oom, PluginError)
assert not isinstance(oom, ValueError), "resource errors are not ValueErrors"
assert oom.category == 'resource', "CR-7 reactive retry dispatches on category=resource"
assert oom.default_retriable is True, \
    "default_retriable=True: OOM is retriable after eviction (the whole point of CR-7)"
assert oom.plugin_name == "whisper"
assert oom.process_returncode == -9
assert oom.resource_shortfall is None, \
    "Track A: substrate doesn't know needed/available; only Track B (plugin-side raise) does"
assert "whisper" in str(oom)
assert "returncode=-9" in str(oom)

# CR-7: WorkerOOMError catches at the PluginResourceError site (the shared
# CR-7 reactive retry catch-point). Track A + Track B converge here.
def fake_track_a_raise():
    raise WorkerOOMError("voxtral", process_returncode=-9)

def fake_track_b_raise():
    raise PluginResourceError(
        "voxtral: CUDA OOM",
        resource_shortfall=ResourceShortfall(
            resource='gpu_vram_mb', needed=24000, available=8000,
        ),
    )

for raiser in (fake_track_a_raise, fake_track_b_raise):
    caught = False
    try:
        raiser()
    except PluginResourceError:
        caught = True
    assert caught, f"{raiser.__name__} must catch under PluginResourceError"

# Custom message override path
oom_custom = WorkerOOMError("whisper", message="custom diagnostic")
assert str(oom_custom) == "custom diagnostic"
assert oom_custom.plugin_name == "whisper"
assert oom_custom.process_returncode is None

print("✓ Substrate-side typed exceptions anchor under the right category")
```

``` python
# PluginConfigError reparenting: ValueError MRO preserved, fields_invalid canonical.
err = PluginConfigError(
    "unknown keys",
    fields_invalid=["foo", "bar"],
    config_class_name="WhisperConfig",
)
assert isinstance(err, PluginInputError)
assert isinstance(err, ValueError), "SG-8 era except ValueError: must still catch this"
assert err.fields_invalid == ["foo", "bar"]
assert err.config_class_name == "WhisperConfig"
assert err.unknown_keys == ["foo", "bar"], "property alias must mirror fields_invalid"

# Deprecated unknown_keys kwarg still works but warns.
with warnings.catch_warnings(record=True) as caught:
    warnings.simplefilter("always")
    legacy_err = PluginConfigError(
        "legacy call",
        unknown_keys=["x"],
        config_class_name="WhisperConfig",
    )
    assert any(
        issubclass(w.category, DeprecationWarning) and "unknown_keys" in str(w.message)
        for w in caught
    ), "deprecated unknown_keys= kwarg must emit DeprecationWarning"
assert legacy_err.fields_invalid == ["x"]
assert legacy_err.unknown_keys == ["x"]

# Both kwargs together: fields_invalid wins, still warns.
with warnings.catch_warnings(record=True) as caught:
    warnings.simplefilter("always")
    both_err = PluginConfigError(
        "both",
        fields_invalid=["a"],
        unknown_keys=["b"],
    )
    assert any(issubclass(w.category, DeprecationWarning) for w in caught)
assert both_err.fields_invalid == ["a"], "fields_invalid wins when both kwargs provided"

print("✓ PluginConfigError reparenting + backward-compat shims work")
```

    ✓ PluginConfigError reparenting + backward-compat shims work

``` python
# Default classification of bare Python exceptions.
assert classify_exception(ValueError("bad")) == 'user_input'
assert classify_exception(TypeError("bad")) == 'user_input'
assert classify_exception(FileNotFoundError("missing")) == 'user_input'
assert classify_exception(TimeoutError("slow")) == 'transient'
assert classify_exception(ConnectionError("net")) == 'transient'
assert classify_exception(MemoryError("oom")) == 'resource'
assert classify_exception(RuntimeError("unknown")) == 'fatal'

# PluginError subclasses report their own declared category, not the
# inherited-builtin's category. PluginInputError extends ValueError but its
# category is 'user_input' (the declared value), not derived from ValueError.
assert classify_exception(PluginInputError("x")) == 'user_input'
assert classify_exception(PluginTransientError("x")) == 'transient'
assert classify_exception(PluginResourceError("x")) == 'resource'
assert classify_exception(PluginFatalError("x")) == 'fatal'

# PluginNotLoadedError is fatal even though no built-in maps to fatal by default.
assert classify_exception(PluginNotLoadedError("whisper")) == 'fatal'

print("✓ Default exception classification correct")
```

    ✓ Default exception classification correct

``` python
# map_bare_exception_to_job_error captures category + retriable + structured data.
try:
    raise PluginConfigError("bad config", fields_invalid=["model"])
except Exception as e:
    err = map_bare_exception_to_job_error(e, plugin_name="whisper")

assert err.category == 'user_input'
assert err.retriable is True
assert err.fields_invalid == ["model"]
assert err.plugin_name == "whisper"
assert err.traceback is not None and "PluginConfigError" in err.traceback
assert err.occurred_at is not None
# Python 3.12+ compat: occurred_at must be timezone-aware (datetime.utcnow()
# is deprecated and returns naive datetime; we use datetime.now(timezone.utc)).
assert err.occurred_at.tzinfo is not None, \
    "occurred_at should be timezone-aware (CR-5 Python 3.12+ future-proof form)"

# Resource error: resource_shortfall propagates.
try:
    raise PluginResourceError(
        "oom",
        resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=16000, available=8000),
    )
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'resource'
assert err.retriable is True
assert err.resource_shortfall.needed == 16000

# Bare ValueError gets default user_input + retriable=True.
try:
    raise ValueError("unmapped bare")
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'user_input'
assert err.retriable is True
assert err.fields_invalid is None  # bare ValueError has no fields_invalid attribute

# Bare RuntimeError gets default fatal + retriable=False.
try:
    raise RuntimeError("unknown")
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'fatal'
assert err.retriable is False

# TracebackPolicy.NONE suppresses traceback + message.
try:
    raise ValueError("secret")
except Exception as e:
    err = map_bare_exception_to_job_error(e, traceback_policy=TracebackPolicy.NONE)

assert err.traceback is None
assert err.message == ""
assert err.original_exc_repr  # repr is always kept (debug breadcrumb)

print("✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy")
```

    ✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy
