Typed exception hierarchy + JobError dataclass + default classification of bare Python exceptions. The substrate’s CR-5 implementation per the 2026-05-19 substrate audit.
Category model
Every substrate-recognized exception carries a category ClassVar that tells the JobQueue / scheduler / operator UI which retry-or-not-retry treatment is appropriate. Four categories:
Category
Retriable by default
Meaning
user_input
Yes (the user can fix and resubmit)
Bad config, missing file, invalid argument
transient
Yes (retry may succeed)
Network blip, timeout, temporary resource lock
resource
Yes (after eviction)
Out of GPU VRAM, disk, system RAM
fatal
No
Bug, broken plugin install, irrecoverable state
MRO discipline: PluginInputError is the only category that multiply inherits ValueError. The semantic argument: except ValueError: expresses intent to catch invalid-argument errors. Letting transient / resource / fatal errors be caught by a bare except ValueError: would silently broaden that intent, which we specifically do not want.
Resource exhaustion: GPU VRAM, system RAM, disk full.
JobQueue’s reactive-eviction flow (CR-7) routes resource errors to retry after attempting to free the named resource. Plugin authors set resource_shortfall so the substrate knows what to evict.
PluginTransientError
def PluginTransientError( message:str, # Human-readable description retry_after_seconds:Optional=None, # Hint for backoff strategies):
Substrate / JobQueue may retry on its own initiative. Plugin authors raise this when they know the failure is recoverable.
PluginInputError
def PluginInputError( message:str, # Human-readable description fields_invalid:Optional=None, # Names of inputs that failed validation):
User-fixable error: bad config, invalid argument, missing file.
Multi-inherits ValueError so SG-8-era except ValueError: catch sites that legitimately want input errors keep working through the SG-47 migration window. The MRO is PluginInputError → PluginError → ValueError → Exception; other category bases (PluginTransientError, PluginResourceError, PluginFatalError) deliberately do NOT extend ValueError because their failure modes are not semantically value errors.
Subclasses declare a category and default_retriable ClassVar so the JobQueue + scheduler can route the failure without sniffing exception text. Bare Python exceptions raised by plugin code go through map_bare_exception_to_job_error to acquire a default category.
Substrate-raised typed exceptions
These concrete exception types are defined in CR-5 (this file) so CR-2 / CR-6 / SG-14 can raise them when they land. Each anchors under the appropriate category base so its catch behavior is correct from day one.
PluginDisabledError
def PluginDisabledError( plugin_name:str):
JobQueue / execute_plugin rejected: the plugin is currently disabled.
User-fixable (re-enable the plugin). Inherits PluginInputError’s ValueError MRO so existing except ValueError: callers see it as an input error. Raised by CR-2’s enable/disable wiring once that lands.
PluginNotLoadedError
def PluginNotLoadedError( plugin_name:str):
Caller submitted to a plugin that was never loaded.
Fatal category because this is a programmer / orchestration bug, not a user-fixable condition. NOT a ValueError — the right reader intent is except PluginNotLoadedError: (or the broader except PluginError:), not a blanket except ValueError:.
A per-job timeout fired before the plugin finished.
Transient category — retry may succeed if the slow operation completes faster next time. Carries retry_after_seconds from PluginTransientError. Raised by SG-14’s per-job timeout primitive when that lands.
PluginCancelledError
def PluginCancelledError( plugin_name:str):
Cooperative cancellation signal raised from PluginInterface.check_cancel().
Anchors under PluginTransientError because cancellation is in-principle re-runnable — a future attempt with the same inputs won’t auto-fail if the cancel flag isn’t set. But default_retriable is False: cancellation was a deliberate operator action, so the substrate should NOT auto-retry. Job-monitor / JobQueue render cancelled jobs with their own state (separate from “failed”); the JobError category remains transient so consumers reading the typed taxonomy can group recoverable signals.
Plugin authors raise this implicitly via self.check_cancel() inside execute(); substrate sets the underlying _cancel_requested flag via cancel(). See CR-4’s cancellation primitives for the cooperative-cancel protocol.
The worker subprocess died with a kill-signal during an active execute call.
CR-7 Track A — substrate-side OOM detection: when an HTTP call to the worker faults and the subprocess has died with returncode == -signal.SIGKILL (or the platform equivalent), the substrate raises this. The kernel OOM-killer is the most common cause of SIGKILL during normal execute paths, so the substrate treats SIGKILL-during-call as “assume OOM” and surfaces a typed resource error for the reactive retry path.
resource_shortfall is None for Track A — the substrate only saw “worker died from kill-signal” and has no per-resource needed/available numbers. Track B (per SG-47’s sub-task: plugin-side wrapping of torch.cuda.OutOfMemoryError et al.) raises PluginResourceError directly with a populated ResourceShortfall because the plugin had the context. Both land at the same except PluginResourceError site in CR-7’s reactive retry loop.
process_returncode carries the observed exit code for debugging / classification (e.g. operators can distinguish kernel-OOM SIGKILL from other signals if they read it). Defaults to None for callers that don’t have it on hand.
PluginConfigError (reparented from utils.validation per CR-5)
Originally defined in utils/validation.py by SG-8 as a ValueError subclass. CR-5 reparents it under PluginInputError. The reparenting preserves except ValueError: compat (via PluginInputError’s ValueError MRO) and unifies the field-validation attribute name with the rest of the input-error hierarchy: SG-8’s unknown_keys becomes fields_invalid (canonical).
Backward-compat handling for the SG-8-era kwarg + attribute:
unknown_keys= keyword in __init__ is accepted but emits DeprecationWarning.
unknown_keys is a read-only property aliasing fields_invalid.
Both are tagged # REMOVE-AFTER-OVERHAUL for SG-48 sweep.
Why two REMOVE-AFTER-OVERHAUL tags rather than one: the kwarg and the property address different migration paths. The kwarg shim helps code that constructs the exception; the property shim helps code that inspects the exception after catching it. Either can be removed independently once SG-47 cascades.
PluginConfigError
def PluginConfigError( message:str, # Human-readable description fields_invalid:Optional=None, # Canonical: list of bad config keys config_class_name:str='', # Dataclass / plugin name for the schema unknown_keys:Optional=None, # REMOVE-AFTER-OVERHAUL: drop unknown_keys kwarg after SG-47 cascade completes):
Unknown / invalid keys in a config dict against a plugin’s config schema.
Reparented from cjm_plugin_system.utils.validation (Wave 2 / SG-8) under CR-5. Inherits PluginInputError’s ValueError MRO automatically. config_class_name is the dataclass / plugin name whose schema was violated.
JobError + ResourceShortfall + TracebackPolicy
When a plugin job fails, the JobQueue (CR-6) records a JobError summary on the completed Job. The summary captures everything a frontend / operator needs to understand and (optionally) retry the failure without re-running the plugin:
category lets UI decide retry button affordances.
retriable carries the substrate’s policy on whether to auto-retry.
original_exc_repr + optional traceback give post-mortem context.
fields_invalid / resource_shortfall are category-specific structured data.
TracebackPolicy controls how much detail the substrate records. Default FULL is what dev mode wants; REPR_ONLY and NONE are future opt-outs for security-sensitive multi-user deployments.
Structured failure summary recorded on a completed Job.
Populated by the JobQueue when a plugin execution fails (CR-6 owns the population logic; CR-5 owns the shape). Sufficient for UI to render a failure card + retry affordance without re-running the plugin.
Quantitative gap between what a plugin needed and what was available.
Default classification of bare Python exceptions
Plugin authors will gradually migrate to raising PluginError subclasses (SG-47 cascade). Until then, the JobQueue still needs to classify bare ValueError / TimeoutError / etc. into one of the four categories so retry policy is correct from day one.
The mapping walks the exception’s __mro__ against a substrate-provided lookup. First MRO ancestor that matches wins. Default for everything else is fatal — conservative: don’t auto-retry an exception we can’t classify.
classify_exception
def classify_exception( exc:BaseException, # The exception to classify)->Literal['user_input', 'transient', 'resource', 'fatal']: # Category
Return the substrate category for any exception.
PluginError subclasses report their own declared category. Bare Python exceptions are mapped via __mro__ walk against _BARE_EXCEPTION_CATEGORY_MAP; the first ancestor in the table wins. Unrecognized exceptions classify as fatal (don’t auto-retry the unknown).
map_bare_exception_to_job_error
def map_bare_exception_to_job_error( exc:BaseException, # The raised exception plugin_name:Optional=None, # Name of the plugin that raised plugin_instance_id:Optional=None, # Per CR-10 traceback_policy:TracebackPolicy=<TracebackPolicy.FULL: 'full'>, # How much detail to record occurred_at:Optional=None, # Override; defaults to datetime.now(timezone.utc))->JobError:
Convert any exception into a structured JobError.
PluginError subclasses contribute their category-specific structured data (fields_invalid for input errors, resource_shortfall for resource errors, retry_after_seconds for transient errors). Bare exceptions get the default category-based retriable flag and no structured side-channel.
Regression tests
These exercises pin the MRO discipline, the backward-compat shim behavior, and the default classification. The MRO assertions are particularly load-bearing — future refactors that accidentally broaden PluginError(ValueError) would catch transient/resource/fatal errors via except ValueError:, which we explicitly do not want.
# MRO discipline: only PluginInputError tree is catchable as ValueError.input_err = PluginInputError("bad", fields_invalid=["foo"])assertisinstance(input_err, ValueError)assertisinstance(input_err, PluginError)assertisinstance(input_err, Exception)assert input_err.category =='user_input'assert input_err.default_retriable isTrueassert input_err.fields_invalid == ["foo"]transient_err = PluginTransientError("slow", retry_after_seconds=5.0)assertnotisinstance(transient_err, ValueError), \"PluginTransientError must NOT inherit ValueError (semantic discipline)"assertisinstance(transient_err, PluginError)assert transient_err.category =='transient'assert transient_err.retry_after_seconds ==5.0resource_err = PluginResourceError("oom", resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=8000, available=4000),)assertnotisinstance(resource_err, ValueError)assert resource_err.category =='resource'assert resource_err.resource_shortfall.needed ==8000fatal_err = PluginFatalError("crashed")assertnotisinstance(fatal_err, ValueError)assert fatal_err.category =='fatal'assert fatal_err.default_retriable isFalseprint("✓ MRO discipline: only PluginInputError tree extends ValueError")
✓ MRO discipline: only PluginInputError tree extends ValueError
# Substrate-side typed exceptions anchor under the correct category.disabled = PluginDisabledError("whisper")assertisinstance(disabled, PluginInputError)assertisinstance(disabled, ValueError), "PluginDisabledError must be catchable as ValueError"assert disabled.category =='user_input'assert disabled.plugin_name =="whisper"not_loaded = PluginNotLoadedError("whisper")assertisinstance(not_loaded, PluginFatalError)assertnotisinstance(not_loaded, ValueError), \"PluginNotLoadedError must NOT be catchable as ValueError (it's a fatal bug)"assert not_loaded.category =='fatal'timeout = PluginTimeoutError("whisper", timeout_seconds=30.0, retry_after_seconds=60.0)assertisinstance(timeout, PluginTransientError)assertnotisinstance(timeout, ValueError)assert timeout.category =='transient'assert timeout.timeout_seconds ==30.0assert timeout.retry_after_seconds ==60.0# CR-4: PluginCancelledError extends PluginTransientError but is non-retriable# (deliberate operator action — substrate should not auto-retry cancelled jobs).cancelled = PluginCancelledError("whisper")assertisinstance(cancelled, PluginTransientError)assertisinstance(cancelled, PluginError)assertnotisinstance(cancelled, ValueError), \"PluginCancelledError must NOT be catchable as ValueError (it's a control-flow signal, not a value error)"assert cancelled.category =='transient', "category=transient: cancellation is in-principle re-runnable"assert cancelled.default_retriable isFalse, \"default_retriable=False: substrate must not auto-retry operator-cancelled jobs"assert cancelled.plugin_name =="whisper"assert"cancelled by operator"instr(cancelled)# CR-7 Track A: WorkerOOMError extends PluginResourceError with default_retriable=True# inherited; carries process_returncode for operator debugging; no ResourceShortfall.oom = WorkerOOMError("whisper", process_returncode=-9)assertisinstance(oom, PluginResourceError), "must catch under PluginResourceError"assertisinstance(oom, PluginError)assertnotisinstance(oom, ValueError), "resource errors are not ValueErrors"assert oom.category =='resource', "CR-7 reactive retry dispatches on category=resource"assert oom.default_retriable isTrue, \"default_retriable=True: OOM is retriable after eviction (the whole point of CR-7)"assert oom.plugin_name =="whisper"assert oom.process_returncode ==-9assert oom.resource_shortfall isNone, \"Track A: substrate doesn't know needed/available; only Track B (plugin-side raise) does"assert"whisper"instr(oom)assert"returncode=-9"instr(oom)# CR-7: WorkerOOMError catches at the PluginResourceError site (the shared# CR-7 reactive retry catch-point). Track A + Track B converge here.def fake_track_a_raise():raise WorkerOOMError("voxtral", process_returncode=-9)def fake_track_b_raise():raise PluginResourceError("voxtral: CUDA OOM", resource_shortfall=ResourceShortfall( resource='gpu_vram_mb', needed=24000, available=8000, ), )for raiser in (fake_track_a_raise, fake_track_b_raise): caught =Falsetry: raiser()except PluginResourceError: caught =Trueassert caught, f"{raiser.__name__} must catch under PluginResourceError"# Custom message override pathoom_custom = WorkerOOMError("whisper", message="custom diagnostic")assertstr(oom_custom) =="custom diagnostic"assert oom_custom.plugin_name =="whisper"assert oom_custom.process_returncode isNoneprint("✓ Substrate-side typed exceptions anchor under the right category")
# PluginConfigError reparenting: ValueError MRO preserved, fields_invalid canonical.err = PluginConfigError("unknown keys", fields_invalid=["foo", "bar"], config_class_name="WhisperConfig",)assertisinstance(err, PluginInputError)assertisinstance(err, ValueError), "SG-8 era except ValueError: must still catch this"assert err.fields_invalid == ["foo", "bar"]assert err.config_class_name =="WhisperConfig"assert err.unknown_keys == ["foo", "bar"], "property alias must mirror fields_invalid"# Deprecated unknown_keys kwarg still works but warns.with warnings.catch_warnings(record=True) as caught: warnings.simplefilter("always") legacy_err = PluginConfigError("legacy call", unknown_keys=["x"], config_class_name="WhisperConfig", )assertany(issubclass(w.category, DeprecationWarning) and"unknown_keys"instr(w.message)for w in caught ), "deprecated unknown_keys= kwarg must emit DeprecationWarning"assert legacy_err.fields_invalid == ["x"]assert legacy_err.unknown_keys == ["x"]# Both kwargs together: fields_invalid wins, still warns.with warnings.catch_warnings(record=True) as caught: warnings.simplefilter("always") both_err = PluginConfigError("both", fields_invalid=["a"], unknown_keys=["b"], )assertany(issubclass(w.category, DeprecationWarning) for w in caught)assert both_err.fields_invalid == ["a"], "fields_invalid wins when both kwargs provided"print("✓ PluginConfigError reparenting + backward-compat shims work")
✓ PluginConfigError reparenting + backward-compat shims work
# Default classification of bare Python exceptions.assert classify_exception(ValueError("bad")) =='user_input'assert classify_exception(TypeError("bad")) =='user_input'assert classify_exception(FileNotFoundError("missing")) =='user_input'assert classify_exception(TimeoutError("slow")) =='transient'assert classify_exception(ConnectionError("net")) =='transient'assert classify_exception(MemoryError("oom")) =='resource'assert classify_exception(RuntimeError("unknown")) =='fatal'# PluginError subclasses report their own declared category, not the# inherited-builtin's category. PluginInputError extends ValueError but its# category is 'user_input' (the declared value), not derived from ValueError.assert classify_exception(PluginInputError("x")) =='user_input'assert classify_exception(PluginTransientError("x")) =='transient'assert classify_exception(PluginResourceError("x")) =='resource'assert classify_exception(PluginFatalError("x")) =='fatal'# PluginNotLoadedError is fatal even though no built-in maps to fatal by default.assert classify_exception(PluginNotLoadedError("whisper")) =='fatal'print("✓ Default exception classification correct")
✓ Default exception classification correct
# map_bare_exception_to_job_error captures category + retriable + structured data.try:raise PluginConfigError("bad config", fields_invalid=["model"])exceptExceptionas e: err = map_bare_exception_to_job_error(e, plugin_name="whisper")assert err.category =='user_input'assert err.retriable isTrueassert err.fields_invalid == ["model"]assert err.plugin_name =="whisper"assert err.traceback isnotNoneand"PluginConfigError"in err.tracebackassert err.occurred_at isnotNone# Python 3.12+ compat: occurred_at must be timezone-aware (datetime.utcnow()# is deprecated and returns naive datetime; we use datetime.now(timezone.utc)).assert err.occurred_at.tzinfo isnotNone, \"occurred_at should be timezone-aware (CR-5 Python 3.12+ future-proof form)"# Resource error: resource_shortfall propagates.try:raise PluginResourceError("oom", resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=16000, available=8000), )exceptExceptionas e: err = map_bare_exception_to_job_error(e)assert err.category =='resource'assert err.retriable isTrueassert err.resource_shortfall.needed ==16000# Bare ValueError gets default user_input + retriable=True.try:raiseValueError("unmapped bare")exceptExceptionas e: err = map_bare_exception_to_job_error(e)assert err.category =='user_input'assert err.retriable isTrueassert err.fields_invalid isNone# bare ValueError has no fields_invalid attribute# Bare RuntimeError gets default fatal + retriable=False.try:raiseRuntimeError("unknown")exceptExceptionas e: err = map_bare_exception_to_job_error(e)assert err.category =='fatal'assert err.retriable isFalse# TracebackPolicy.NONE suppresses traceback + message.try:raiseValueError("secret")exceptExceptionas e: err = map_bare_exception_to_job_error(e, traceback_policy=TracebackPolicy.NONE)assert err.traceback isNoneassert err.message ==""assert err.original_exc_repr # repr is always kept (debug breadcrumb)print("✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy")
✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy