Plugin Error Taxonomy

Typed exception hierarchy + JobError dataclass + default classification of bare Python exceptions. The substrate’s CR-5 implementation per the 2026-05-19 substrate audit.

Category model

Every substrate-recognized exception carries a category ClassVar that tells the JobQueue / scheduler / operator UI which retry-or-not-retry treatment is appropriate. Four categories:

Category Retriable by default Meaning
user_input Yes (the user can fix and resubmit) Bad config, missing file, invalid argument
transient Yes (retry may succeed) Network blip, timeout, temporary resource lock
resource Yes (after eviction) Out of GPU VRAM, disk, system RAM
fatal No Bug, broken plugin install, irrecoverable state

MRO discipline: PluginInputError is the only category that multiply inherits ValueError. The semantic argument: except ValueError: expresses intent to catch invalid-argument errors. Letting transient / resource / fatal errors be caught by a bare except ValueError: would silently broaden that intent, which we specifically do not want.


PluginFatalError


def PluginFatalError(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

Bug / irrecoverable state. The plugin cannot complete this job; retrying won’t help.

Plugin authors raise this when they know the failure is permanent for the given inputs. The substrate does NOT retry fatal errors.


PluginResourceError


def PluginResourceError(
    message:str, # Human-readable description
    resource_shortfall:Optional=None, # Quantitative gap
):

Resource exhaustion: GPU VRAM, system RAM, disk full.

JobQueue’s reactive-eviction flow (CR-7) routes resource errors to retry after attempting to free the named resource. Plugin authors set resource_shortfall so the substrate knows what to evict.


PluginTransientError


def PluginTransientError(
    message:str, # Human-readable description
    retry_after_seconds:Optional=None, # Hint for backoff strategies
):

Temporary failure: timeout, network blip, brief resource contention.

Substrate / JobQueue may retry on its own initiative. Plugin authors raise this when they know the failure is recoverable.


PluginInputError


def PluginInputError(
    message:str, # Human-readable description
    fields_invalid:Optional=None, # Names of inputs that failed validation
):

User-fixable error: bad config, invalid argument, missing file.

Multi-inherits ValueError so SG-8-era except ValueError: catch sites that legitimately want input errors keep working through the SG-47 migration window. The MRO is PluginInputError → PluginError → ValueError → Exception; other category bases (PluginTransientError, PluginResourceError, PluginFatalError) deliberately do NOT extend ValueError because their failure modes are not semantically value errors.


PluginError


def PluginError(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

Base for substrate-recognized plugin exceptions.

Subclasses declare a category and default_retriable ClassVar so the JobQueue + scheduler can route the failure without sniffing exception text. Bare Python exceptions raised by plugin code go through map_bare_exception_to_job_error to acquire a default category.

Substrate-raised typed exceptions

These concrete exception types are defined in CR-5 (this file) so CR-2 / CR-6 / SG-14 can raise them when they land. Each anchors under the appropriate category base so its catch behavior is correct from day one.


PluginDisabledError


def PluginDisabledError(
    plugin_name:str
):

JobQueue / execute_plugin rejected: the plugin is currently disabled.

User-fixable (re-enable the plugin). Inherits PluginInputError’s ValueError MRO so existing except ValueError: callers see it as an input error. Raised by CR-2’s enable/disable wiring once that lands.


PluginNotLoadedError


def PluginNotLoadedError(
    plugin_name:str
):

Caller submitted to a plugin that was never loaded.

Fatal category because this is a programmer / orchestration bug, not a user-fixable condition. NOT a ValueError — the right reader intent is except PluginNotLoadedError: (or the broader except PluginError:), not a blanket except ValueError:.


PluginTimeoutError


def PluginTimeoutError(
    plugin_name:str, timeout_seconds:float, retry_after_seconds:Optional=None
):

A per-job timeout fired before the plugin finished.

Transient category — retry may succeed if the slow operation completes faster next time. Carries retry_after_seconds from PluginTransientError. Raised by SG-14’s per-job timeout primitive when that lands.


PluginCancelledError


def PluginCancelledError(
    plugin_name:str
):

Cooperative cancellation signal raised from PluginInterface.check_cancel().

Anchors under PluginTransientError because cancellation is in-principle re-runnable — a future attempt with the same inputs won’t auto-fail if the cancel flag isn’t set. But default_retriable is False: cancellation was a deliberate operator action, so the substrate should NOT auto-retry. Job-monitor / JobQueue render cancelled jobs with their own state (separate from “failed”); the JobError category remains transient so consumers reading the typed taxonomy can group recoverable signals.

Plugin authors raise this implicitly via self.check_cancel() inside execute(); substrate sets the underlying _cancel_requested flag via cancel(). See CR-4’s cancellation primitives for the cooperative-cancel protocol.


WorkerOOMError


def WorkerOOMError(
    plugin_name:str, process_returncode:Optional=None, message:Optional=None
):

The worker subprocess died with a kill-signal during an active execute call.

CR-7 Track A — substrate-side OOM detection: when an HTTP call to the worker faults and the subprocess has died with returncode == -signal.SIGKILL (or the platform equivalent), the substrate raises this. The kernel OOM-killer is the most common cause of SIGKILL during normal execute paths, so the substrate treats SIGKILL-during-call as “assume OOM” and surfaces a typed resource error for the reactive retry path.

resource_shortfall is None for Track A — the substrate only saw “worker died from kill-signal” and has no per-resource needed/available numbers. Track B (per SG-47’s sub-task: plugin-side wrapping of torch.cuda.OutOfMemoryError et al.) raises PluginResourceError directly with a populated ResourceShortfall because the plugin had the context. Both land at the same except PluginResourceError site in CR-7’s reactive retry loop.

process_returncode carries the observed exit code for debugging / classification (e.g. operators can distinguish kernel-OOM SIGKILL from other signals if they read it). Defaults to None for callers that don’t have it on hand.

PluginConfigError (reparented from utils.validation per CR-5)

Originally defined in utils/validation.py by SG-8 as a ValueError subclass. CR-5 reparents it under PluginInputError. The reparenting preserves except ValueError: compat (via PluginInputError’s ValueError MRO) and unifies the field-validation attribute name with the rest of the input-error hierarchy: SG-8’s unknown_keys becomes fields_invalid (canonical).

Backward-compat handling for the SG-8-era kwarg + attribute:

  • unknown_keys= keyword in __init__ is accepted but emits DeprecationWarning.
  • unknown_keys is a read-only property aliasing fields_invalid.
  • Both are tagged # REMOVE-AFTER-OVERHAUL for SG-48 sweep.

Why two REMOVE-AFTER-OVERHAUL tags rather than one: the kwarg and the property address different migration paths. The kwarg shim helps code that constructs the exception; the property shim helps code that inspects the exception after catching it. Either can be removed independently once SG-47 cascades.


PluginConfigError


def PluginConfigError(
    message:str, # Human-readable description
    fields_invalid:Optional=None, # Canonical: list of bad config keys
    config_class_name:str='', # Dataclass / plugin name for the schema
    unknown_keys:Optional=None, # REMOVE-AFTER-OVERHAUL: drop unknown_keys kwarg after SG-47 cascade completes
):

Unknown / invalid keys in a config dict against a plugin’s config schema.

Reparented from cjm_plugin_system.utils.validation (Wave 2 / SG-8) under CR-5. Inherits PluginInputError’s ValueError MRO automatically. config_class_name is the dataclass / plugin name whose schema was violated.

JobError + ResourceShortfall + TracebackPolicy

When a plugin job fails, the JobQueue (CR-6) records a JobError summary on the completed Job. The summary captures everything a frontend / operator needs to understand and (optionally) retry the failure without re-running the plugin:

  • category lets UI decide retry button affordances.
  • retriable carries the substrate’s policy on whether to auto-retry.
  • original_exc_repr + optional traceback give post-mortem context.
  • fields_invalid / resource_shortfall are category-specific structured data.

TracebackPolicy controls how much detail the substrate records. Default FULL is what dev mode wants; REPR_ONLY and NONE are future opt-outs for security-sensitive multi-user deployments.


JobError


def JobError(
    category:Literal, message:str, retriable:bool, original_exc_repr:str, traceback:Optional=None,
    retry_after_seconds:Optional=None, fields_invalid:Optional=None, resource_shortfall:Optional=None,
    plugin_name:Optional=None, plugin_instance_id:Optional=None, occurred_at:Optional=None
)->None:

Structured failure summary recorded on a completed Job.

Populated by the JobQueue when a plugin execution fails (CR-6 owns the population logic; CR-5 owns the shape). Sufficient for UI to render a failure card + retry affordance without re-running the plugin.


TracebackPolicy


def TracebackPolicy(
    args:VAR_POSITIONAL, kwds:VAR_KEYWORD
):

How much exception detail the substrate records on a JobError.


ResourceShortfall


def ResourceShortfall(
    resource:Literal, needed:float, available:float
)->None:

Quantitative gap between what a plugin needed and what was available.

Default classification of bare Python exceptions

Plugin authors will gradually migrate to raising PluginError subclasses (SG-47 cascade). Until then, the JobQueue still needs to classify bare ValueError / TimeoutError / etc. into one of the four categories so retry policy is correct from day one.

The mapping walks the exception’s __mro__ against a substrate-provided lookup. First MRO ancestor that matches wins. Default for everything else is fatal — conservative: don’t auto-retry an exception we can’t classify.


classify_exception


def classify_exception(
    exc:BaseException, # The exception to classify
)->Literal['user_input', 'transient', 'resource', 'fatal']: # Category

Return the substrate category for any exception.

PluginError subclasses report their own declared category. Bare Python exceptions are mapped via __mro__ walk against _BARE_EXCEPTION_CATEGORY_MAP; the first ancestor in the table wins. Unrecognized exceptions classify as fatal (don’t auto-retry the unknown).


map_bare_exception_to_job_error


def map_bare_exception_to_job_error(
    exc:BaseException, # The raised exception
    plugin_name:Optional=None, # Name of the plugin that raised
    plugin_instance_id:Optional=None, # Per CR-10
    traceback_policy:TracebackPolicy=<TracebackPolicy.FULL: 'full'>, # How much detail to record
    occurred_at:Optional=None, # Override; defaults to datetime.now(timezone.utc)
)->JobError:

Convert any exception into a structured JobError.

PluginError subclasses contribute their category-specific structured data (fields_invalid for input errors, resource_shortfall for resource errors, retry_after_seconds for transient errors). Bare exceptions get the default category-based retriable flag and no structured side-channel.

Regression tests

These exercises pin the MRO discipline, the backward-compat shim behavior, and the default classification. The MRO assertions are particularly load-bearing — future refactors that accidentally broaden PluginError(ValueError) would catch transient/resource/fatal errors via except ValueError:, which we explicitly do not want.

# MRO discipline: only PluginInputError tree is catchable as ValueError.
input_err = PluginInputError("bad", fields_invalid=["foo"])
assert isinstance(input_err, ValueError)
assert isinstance(input_err, PluginError)
assert isinstance(input_err, Exception)
assert input_err.category == 'user_input'
assert input_err.default_retriable is True
assert input_err.fields_invalid == ["foo"]

transient_err = PluginTransientError("slow", retry_after_seconds=5.0)
assert not isinstance(transient_err, ValueError), \
    "PluginTransientError must NOT inherit ValueError (semantic discipline)"
assert isinstance(transient_err, PluginError)
assert transient_err.category == 'transient'
assert transient_err.retry_after_seconds == 5.0

resource_err = PluginResourceError(
    "oom",
    resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=8000, available=4000),
)
assert not isinstance(resource_err, ValueError)
assert resource_err.category == 'resource'
assert resource_err.resource_shortfall.needed == 8000

fatal_err = PluginFatalError("crashed")
assert not isinstance(fatal_err, ValueError)
assert fatal_err.category == 'fatal'
assert fatal_err.default_retriable is False

print("✓ MRO discipline: only PluginInputError tree extends ValueError")
✓ MRO discipline: only PluginInputError tree extends ValueError
# Substrate-side typed exceptions anchor under the correct category.
disabled = PluginDisabledError("whisper")
assert isinstance(disabled, PluginInputError)
assert isinstance(disabled, ValueError), "PluginDisabledError must be catchable as ValueError"
assert disabled.category == 'user_input'
assert disabled.plugin_name == "whisper"

not_loaded = PluginNotLoadedError("whisper")
assert isinstance(not_loaded, PluginFatalError)
assert not isinstance(not_loaded, ValueError), \
    "PluginNotLoadedError must NOT be catchable as ValueError (it's a fatal bug)"
assert not_loaded.category == 'fatal'

timeout = PluginTimeoutError("whisper", timeout_seconds=30.0, retry_after_seconds=60.0)
assert isinstance(timeout, PluginTransientError)
assert not isinstance(timeout, ValueError)
assert timeout.category == 'transient'
assert timeout.timeout_seconds == 30.0
assert timeout.retry_after_seconds == 60.0

# CR-4: PluginCancelledError extends PluginTransientError but is non-retriable
# (deliberate operator action — substrate should not auto-retry cancelled jobs).
cancelled = PluginCancelledError("whisper")
assert isinstance(cancelled, PluginTransientError)
assert isinstance(cancelled, PluginError)
assert not isinstance(cancelled, ValueError), \
    "PluginCancelledError must NOT be catchable as ValueError (it's a control-flow signal, not a value error)"
assert cancelled.category == 'transient', "category=transient: cancellation is in-principle re-runnable"
assert cancelled.default_retriable is False, \
    "default_retriable=False: substrate must not auto-retry operator-cancelled jobs"
assert cancelled.plugin_name == "whisper"
assert "cancelled by operator" in str(cancelled)

# CR-7 Track A: WorkerOOMError extends PluginResourceError with default_retriable=True
# inherited; carries process_returncode for operator debugging; no ResourceShortfall.
oom = WorkerOOMError("whisper", process_returncode=-9)
assert isinstance(oom, PluginResourceError), "must catch under PluginResourceError"
assert isinstance(oom, PluginError)
assert not isinstance(oom, ValueError), "resource errors are not ValueErrors"
assert oom.category == 'resource', "CR-7 reactive retry dispatches on category=resource"
assert oom.default_retriable is True, \
    "default_retriable=True: OOM is retriable after eviction (the whole point of CR-7)"
assert oom.plugin_name == "whisper"
assert oom.process_returncode == -9
assert oom.resource_shortfall is None, \
    "Track A: substrate doesn't know needed/available; only Track B (plugin-side raise) does"
assert "whisper" in str(oom)
assert "returncode=-9" in str(oom)

# CR-7: WorkerOOMError catches at the PluginResourceError site (the shared
# CR-7 reactive retry catch-point). Track A + Track B converge here.
def fake_track_a_raise():
    raise WorkerOOMError("voxtral", process_returncode=-9)

def fake_track_b_raise():
    raise PluginResourceError(
        "voxtral: CUDA OOM",
        resource_shortfall=ResourceShortfall(
            resource='gpu_vram_mb', needed=24000, available=8000,
        ),
    )

for raiser in (fake_track_a_raise, fake_track_b_raise):
    caught = False
    try:
        raiser()
    except PluginResourceError:
        caught = True
    assert caught, f"{raiser.__name__} must catch under PluginResourceError"

# Custom message override path
oom_custom = WorkerOOMError("whisper", message="custom diagnostic")
assert str(oom_custom) == "custom diagnostic"
assert oom_custom.plugin_name == "whisper"
assert oom_custom.process_returncode is None

print("✓ Substrate-side typed exceptions anchor under the right category")
# PluginConfigError reparenting: ValueError MRO preserved, fields_invalid canonical.
err = PluginConfigError(
    "unknown keys",
    fields_invalid=["foo", "bar"],
    config_class_name="WhisperConfig",
)
assert isinstance(err, PluginInputError)
assert isinstance(err, ValueError), "SG-8 era except ValueError: must still catch this"
assert err.fields_invalid == ["foo", "bar"]
assert err.config_class_name == "WhisperConfig"
assert err.unknown_keys == ["foo", "bar"], "property alias must mirror fields_invalid"

# Deprecated unknown_keys kwarg still works but warns.
with warnings.catch_warnings(record=True) as caught:
    warnings.simplefilter("always")
    legacy_err = PluginConfigError(
        "legacy call",
        unknown_keys=["x"],
        config_class_name="WhisperConfig",
    )
    assert any(
        issubclass(w.category, DeprecationWarning) and "unknown_keys" in str(w.message)
        for w in caught
    ), "deprecated unknown_keys= kwarg must emit DeprecationWarning"
assert legacy_err.fields_invalid == ["x"]
assert legacy_err.unknown_keys == ["x"]

# Both kwargs together: fields_invalid wins, still warns.
with warnings.catch_warnings(record=True) as caught:
    warnings.simplefilter("always")
    both_err = PluginConfigError(
        "both",
        fields_invalid=["a"],
        unknown_keys=["b"],
    )
    assert any(issubclass(w.category, DeprecationWarning) for w in caught)
assert both_err.fields_invalid == ["a"], "fields_invalid wins when both kwargs provided"

print("✓ PluginConfigError reparenting + backward-compat shims work")
✓ PluginConfigError reparenting + backward-compat shims work
# Default classification of bare Python exceptions.
assert classify_exception(ValueError("bad")) == 'user_input'
assert classify_exception(TypeError("bad")) == 'user_input'
assert classify_exception(FileNotFoundError("missing")) == 'user_input'
assert classify_exception(TimeoutError("slow")) == 'transient'
assert classify_exception(ConnectionError("net")) == 'transient'
assert classify_exception(MemoryError("oom")) == 'resource'
assert classify_exception(RuntimeError("unknown")) == 'fatal'

# PluginError subclasses report their own declared category, not the
# inherited-builtin's category. PluginInputError extends ValueError but its
# category is 'user_input' (the declared value), not derived from ValueError.
assert classify_exception(PluginInputError("x")) == 'user_input'
assert classify_exception(PluginTransientError("x")) == 'transient'
assert classify_exception(PluginResourceError("x")) == 'resource'
assert classify_exception(PluginFatalError("x")) == 'fatal'

# PluginNotLoadedError is fatal even though no built-in maps to fatal by default.
assert classify_exception(PluginNotLoadedError("whisper")) == 'fatal'

print("✓ Default exception classification correct")
✓ Default exception classification correct
# map_bare_exception_to_job_error captures category + retriable + structured data.
try:
    raise PluginConfigError("bad config", fields_invalid=["model"])
except Exception as e:
    err = map_bare_exception_to_job_error(e, plugin_name="whisper")

assert err.category == 'user_input'
assert err.retriable is True
assert err.fields_invalid == ["model"]
assert err.plugin_name == "whisper"
assert err.traceback is not None and "PluginConfigError" in err.traceback
assert err.occurred_at is not None
# Python 3.12+ compat: occurred_at must be timezone-aware (datetime.utcnow()
# is deprecated and returns naive datetime; we use datetime.now(timezone.utc)).
assert err.occurred_at.tzinfo is not None, \
    "occurred_at should be timezone-aware (CR-5 Python 3.12+ future-proof form)"

# Resource error: resource_shortfall propagates.
try:
    raise PluginResourceError(
        "oom",
        resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=16000, available=8000),
    )
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'resource'
assert err.retriable is True
assert err.resource_shortfall.needed == 16000

# Bare ValueError gets default user_input + retriable=True.
try:
    raise ValueError("unmapped bare")
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'user_input'
assert err.retriable is True
assert err.fields_invalid is None  # bare ValueError has no fields_invalid attribute

# Bare RuntimeError gets default fatal + retriable=False.
try:
    raise RuntimeError("unknown")
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'fatal'
assert err.retriable is False

# TracebackPolicy.NONE suppresses traceback + message.
try:
    raise ValueError("secret")
except Exception as e:
    err = map_bare_exception_to_job_error(e, traceback_policy=TracebackPolicy.NONE)

assert err.traceback is None
assert err.message == ""
assert err.original_exc_repr  # repr is always kept (debug breadcrumb)

print("✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy")
✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy