Typed exception hierarchy + JobError dataclass + default classification of bare Python exceptions. The substrate’s CR-5 implementation per the 2026-05-19 substrate audit.
Category model
Every substrate-recognized exception carries a category ClassVar that tells the JobQueue / scheduler / operator UI which retry-or-not-retry treatment is appropriate. Four categories:
Category
Retriable by default
Meaning
user_input
Yes (the user can fix and resubmit)
Bad config, missing file, invalid argument
transient
Yes (retry may succeed)
Network blip, timeout, temporary resource lock
resource
Yes (after eviction)
Out of GPU VRAM, disk, system RAM
fatal
No
Bug, broken capability install, irrecoverable state
MRO discipline: CapabilityInputError is the only category that multiply inherits ValueError. The semantic argument: except ValueError: expresses intent to catch invalid-argument errors. Letting transient / resource / fatal errors be caught by a bare except ValueError: would silently broaden that intent, which we specifically do not want.
Resource exhaustion: GPU VRAM, system RAM, disk full.
JobQueue’s reactive-eviction flow (CR-7) routes resource errors to retry after attempting to free the named resource. Capability authors set resource_shortfall so the substrate knows what to evict.
CapabilityTransientError
def CapabilityTransientError( message:str, # Human-readable description retry_after_seconds:Optional=None, # Hint for backoff strategies):
Substrate / JobQueue may retry on its own initiative. Capability authors raise this when they know the failure is recoverable.
CapabilityInputError
def CapabilityInputError( message:str, # Human-readable description fields_invalid:Optional=None, # Names of inputs that failed validation):
User-fixable error: bad config, invalid argument, missing file.
Like the other category bases (CapabilityTransientError, CapabilityResourceError, CapabilityFatalError), it extends only CapabilityError; the right reader intent is except CapabilityInputError: (or the broader except CapabilityError:).
Base for substrate-recognized capability exceptions.
Subclasses declare a category and default_retriable ClassVar so the JobQueue + scheduler can route the failure without sniffing exception text. Bare Python exceptions raised by capability code go through map_bare_exception_to_job_error to acquire a default category.
Substrate-raised typed exceptions
These concrete exception types are defined in CR-5 (this file) so CR-2 / CR-6 / SG-14 can raise them when they land. Each anchors under the appropriate category base so its catch behavior is correct from day one.
Caller submitted to a capability that was never loaded.
Fatal category because this is a programmer / orchestration bug, not a user-fixable condition. The right reader intent is except CapabilityNotLoadedError: (or the broader except CapabilityError:).
A per-job timeout fired before the capability finished.
Transient category — retry may succeed if the slow operation completes faster next time. Carries retry_after_seconds from CapabilityTransientError. Raised by SG-14’s per-job timeout primitive when that lands.
Cooperative cancellation signal raised from ToolCapability.check_cancel().
Anchors under CapabilityTransientError because cancellation is in-principle re-runnable — a future attempt with the same inputs won’t auto-fail if the cancel flag isn’t set. But default_retriable is False: cancellation was a deliberate operator action, so the substrate should NOT auto-retry. Job-monitor / JobQueue render cancelled jobs with their own state (separate from “failed”); the JobError category remains transient so consumers reading the typed taxonomy can group recoverable signals.
Capability authors raise this implicitly via self.check_cancel() inside execute(); substrate sets the underlying _cancel_requested flag via cancel(). See CR-4’s cancellation primitives for the cooperative-cancel protocol.
The worker subprocess died with a kill-signal during an active execute call.
CR-7 Track A — substrate-side OOM detection: when an HTTP call to the worker faults and the subprocess has died with returncode == -signal.SIGKILL (or the platform equivalent), the substrate raises this. The kernel OOM-killer is the most common cause of SIGKILL during normal execute paths, so the substrate treats SIGKILL-during-call as “assume OOM” and surfaces a typed resource error for the reactive retry path.
resource_shortfall is None for Track A — the substrate only saw “worker died from kill-signal” and has no per-resource needed/available numbers. Track B (per SG-47’s sub-task: capability-side wrapping of torch.cuda.OutOfMemoryError et al.) raises CapabilityResourceError directly with a populated ResourceShortfall because the capability had the context. Both land at the same except CapabilityResourceError site in CR-7’s reactive retry loop.
process_returncode carries the observed exit code for debugging / classification (e.g. operators can distinguish kernel-OOM SIGKILL from other signals if they read it). Defaults to None for callers that don’t have it on hand.
CapabilityConfigError (reparented from utils.validation per CR-5)
Originally defined in utils/validation.py by SG-8 as a ValueError subclass. CR-5 reparents it under CapabilityInputError, preserving except ValueError: compat (via CapabilityInputError’s ValueError MRO) and using fields_invalid as the canonical attribute name, consistent with the rest of the input-error hierarchy.
CapabilityConfigError
def CapabilityConfigError( message:str, # Human-readable description fields_invalid:Optional=None, # Canonical: list of bad config keys config_class_name:str='', # Dataclass / capability name for the schema):
Unknown / invalid keys in a config dict against a capability’s config schema.
Reparented from cjm_substrate.utils.validation (Wave 2 / SG-8) under CR-5. Inherits CapabilityInputError’s ValueError MRO automatically. config_class_name is the dataclass / capability name whose schema was violated.
JobError + ResourceShortfall + TracebackPolicy
When a capability job fails, the JobQueue (CR-6) records a JobError summary on the completed Job. The summary captures everything a frontend / operator needs to understand and (optionally) retry the failure without re-running the capability:
category lets UI decide retry button affordances.
retriable carries the substrate’s policy on whether to auto-retry.
original_exc_repr + optional traceback give post-mortem context.
fields_invalid / resource_shortfall are category-specific structured data.
TracebackPolicy controls how much detail the substrate records. Default FULL is what dev mode wants; REPR_ONLY and NONE are future opt-outs for security-sensitive multi-user deployments.
Structured failure summary recorded on a completed Job.
Populated by the JobQueue when a capability execution fails (CR-6 owns the population logic; CR-5 owns the shape). Sufficient for UI to render a failure card + retry affordance without re-running the capability.
Quantitative gap between what a capability needed and what was available.
Default classification of bare Python exceptions
Capability authors will gradually migrate to raising CapabilityError subclasses (SG-47 cascade). Until then, the JobQueue still needs to classify bare ValueError / TimeoutError / etc. into one of the four categories so retry policy is correct from day one.
The mapping walks the exception’s __mro__ against a substrate-provided lookup. First MRO ancestor that matches wins. Default for everything else is fatal — conservative: don’t auto-retry an exception we can’t classify.
classify_exception
def classify_exception( exc:BaseException, # The exception to classify)->Literal['user_input', 'transient', 'resource', 'fatal']: # Category
Return the substrate category for any exception.
CapabilityError subclasses report their own declared category. Bare Python exceptions are mapped via __mro__ walk against _BARE_EXCEPTION_CATEGORY_MAP; the first ancestor in the table wins. Unrecognized exceptions classify as fatal (don’t auto-retry the unknown).
map_bare_exception_to_job_error
def map_bare_exception_to_job_error( exc:BaseException, # The raised exception capability_name:Optional=None, # Name of the capability that raised capability_instance_id:Optional=None, # Per CR-10 traceback_policy:TracebackPolicy=<TracebackPolicy.FULL: 'full'>, # How much detail to record occurred_at:Optional=None, # Override; defaults to datetime.now(timezone.utc))->JobError:
Convert any exception into a structured JobError.
CapabilityError subclasses contribute their category-specific structured data (fields_invalid for input errors, resource_shortfall for resource errors, retry_after_seconds for transient errors). Bare exceptions get the default category-based retriable flag and no structured side-channel.
Regression tests
These exercises pin the MRO discipline, the backward-compat shim behavior, and the default classification. The MRO assertions are particularly load-bearing — future refactors that accidentally broaden CapabilityError(ValueError) would catch transient/resource/fatal errors via except ValueError:, which we explicitly do not want.
✓ MRO discipline: only PluginInputError tree extends ValueError
# Substrate-side typed exceptions anchor under the correct category.disabled = CapabilityDisabledError("whisper")assertisinstance(disabled, CapabilityInputError)assertnotisinstance(disabled, ValueError), "SG-48 dropped the ValueError base"assert disabled.category =='user_input'assert disabled.capability_name =="whisper"not_loaded = CapabilityNotLoadedError("whisper")assertisinstance(not_loaded, CapabilityFatalError)assertnotisinstance(not_loaded, ValueError), \"CapabilityNotLoadedError must NOT be catchable as ValueError (it's a fatal bug)"assert not_loaded.category =='fatal'timeout = CapabilityTimeoutError("whisper", timeout_seconds=30.0, retry_after_seconds=60.0)assertisinstance(timeout, CapabilityTransientError)assertnotisinstance(timeout, ValueError)assert timeout.category =='transient'assert timeout.timeout_seconds ==30.0assert timeout.retry_after_seconds ==60.0# CR-4: CapabilityCancelledError extends CapabilityTransientError but is non-retriable# (deliberate operator action — substrate should not auto-retry cancelled jobs).cancelled = CapabilityCancelledError("whisper")assertisinstance(cancelled, CapabilityTransientError)assertisinstance(cancelled, CapabilityError)assertnotisinstance(cancelled, ValueError), \"CapabilityCancelledError must NOT be catchable as ValueError (it's a control-flow signal, not a value error)"assert cancelled.category =='transient', "category=transient: cancellation is in-principle re-runnable"assert cancelled.default_retriable isFalse, \"default_retriable=False: substrate must not auto-retry operator-cancelled jobs"assert cancelled.capability_name =="whisper"assert"cancelled by operator"instr(cancelled)# CR-7 Track A: WorkerOOMError extends CapabilityResourceError with default_retriable=True# inherited; carries process_returncode for operator debugging; no ResourceShortfall.oom = WorkerOOMError("whisper", process_returncode=-9)assertisinstance(oom, CapabilityResourceError), "must catch under CapabilityResourceError"assertisinstance(oom, CapabilityError)assertnotisinstance(oom, ValueError), "resource errors are not ValueErrors"assert oom.category =='resource', "CR-7 reactive retry dispatches on category=resource"assert oom.default_retriable isTrue, \"default_retriable=True: OOM is retriable after eviction (the whole point of CR-7)"assert oom.capability_name =="whisper"assert oom.process_returncode ==-9assert oom.resource_shortfall isNone, \"Track A: substrate doesn't know needed/available; only Track B (capability-side raise) does"assert"whisper"instr(oom)assert"returncode=-9"instr(oom)# CR-7: WorkerOOMError catches at the CapabilityResourceError site (the shared# CR-7 reactive retry catch-point). Track A + Track B converge here.def fake_track_a_raise():raise WorkerOOMError("voxtral", process_returncode=-9)def fake_track_b_raise():raise CapabilityResourceError("voxtral: CUDA OOM", resource_shortfall=ResourceShortfall( resource='gpu_vram_mb', needed=24000, available=8000, ), )for raiser in (fake_track_a_raise, fake_track_b_raise): caught =Falsetry: raiser()except CapabilityResourceError: caught =Trueassert caught, f"{raiser.__name__} must catch under CapabilityResourceError"# Custom message override pathoom_custom = WorkerOOMError("whisper", message="custom diagnostic")assertstr(oom_custom) =="custom diagnostic"assert oom_custom.capability_name =="whisper"assert oom_custom.process_returncode isNoneprint("✓ Substrate-side typed exceptions anchor under the right category")
✓ PluginConfigError reparenting + backward-compat shims work
# Default classification of bare Python exceptions.assert classify_exception(ValueError("bad")) =='user_input'assert classify_exception(TypeError("bad")) =='user_input'assert classify_exception(FileNotFoundError("missing")) =='user_input'assert classify_exception(TimeoutError("slow")) =='transient'assert classify_exception(ConnectionError("net")) =='transient'assert classify_exception(MemoryError("oom")) =='resource'assert classify_exception(RuntimeError("unknown")) =='fatal'# CapabilityError subclasses report their own declared category, not the# inherited-builtin's category. CapabilityInputError extends ValueError but its# category is 'user_input' (the declared value), not derived from ValueError.assert classify_exception(CapabilityInputError("x")) =='user_input'assert classify_exception(CapabilityTransientError("x")) =='transient'assert classify_exception(CapabilityResourceError("x")) =='resource'assert classify_exception(CapabilityFatalError("x")) =='fatal'# CapabilityNotLoadedError is fatal even though no built-in maps to fatal by default.assert classify_exception(CapabilityNotLoadedError("whisper")) =='fatal'print("✓ Default exception classification correct")
✓ Default exception classification correct
# map_bare_exception_to_job_error captures category + retriable + structured data.try:raise CapabilityConfigError("bad config", fields_invalid=["model"])exceptExceptionas e: err = map_bare_exception_to_job_error(e, capability_name="whisper")assert err.category =='user_input'assert err.retriable isTrueassert err.fields_invalid == ["model"]assert err.capability_name =="whisper"assert err.traceback isnotNoneand"CapabilityConfigError"in err.tracebackassert err.occurred_at isnotNone# Python 3.12+ compat: occurred_at must be timezone-aware (datetime.utcnow()# is deprecated and returns naive datetime; we use datetime.now(timezone.utc)).assert err.occurred_at.tzinfo isnotNone, \"occurred_at should be timezone-aware (CR-5 Python 3.12+ future-proof form)"# Resource error: resource_shortfall propagates.try:raise CapabilityResourceError("oom", resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=16000, available=8000), )exceptExceptionas e: err = map_bare_exception_to_job_error(e)assert err.category =='resource'assert err.retriable isTrueassert err.resource_shortfall.needed ==16000# Bare ValueError gets default user_input + retriable=True.try:raiseValueError("unmapped bare")exceptExceptionas e: err = map_bare_exception_to_job_error(e)assert err.category =='user_input'assert err.retriable isTrueassert err.fields_invalid isNone# bare ValueError has no fields_invalid attribute# Bare RuntimeError gets default fatal + retriable=False.try:raiseRuntimeError("unknown")exceptExceptionas e: err = map_bare_exception_to_job_error(e)assert err.category =='fatal'assert err.retriable isFalse# TracebackPolicy.NONE suppresses traceback + message.try:raiseValueError("secret")exceptExceptionas e: err = map_bare_exception_to_job_error(e, traceback_policy=TracebackPolicy.NONE)assert err.traceback isNoneassert err.message ==""assert err.original_exc_repr # repr is always kept (debug breadcrumb)print("✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy")
✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy