Capability Error Taxonomy

Typed exception hierarchy + JobError dataclass + default classification of bare Python exceptions. The substrate’s CR-5 implementation per the 2026-05-19 substrate audit.

Category model

Every substrate-recognized exception carries a category ClassVar that tells the JobQueue / scheduler / operator UI which retry-or-not-retry treatment is appropriate. Four categories:

Category Retriable by default Meaning
user_input Yes (the user can fix and resubmit) Bad config, missing file, invalid argument
transient Yes (retry may succeed) Network blip, timeout, temporary resource lock
resource Yes (after eviction) Out of GPU VRAM, disk, system RAM
fatal No Bug, broken capability install, irrecoverable state

MRO discipline: CapabilityInputError is the only category that multiply inherits ValueError. The semantic argument: except ValueError: expresses intent to catch invalid-argument errors. Letting transient / resource / fatal errors be caught by a bare except ValueError: would silently broaden that intent, which we specifically do not want.


CapabilityFatalError


def CapabilityFatalError(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

Bug / irrecoverable state. The capability cannot complete this job; retrying won’t help.

Capability authors raise this when they know the failure is permanent for the given inputs. The substrate does NOT retry fatal errors.


CapabilityResourceError


def CapabilityResourceError(
    message:str, # Human-readable description
    resource_shortfall:Optional=None, # Quantitative gap
):

Resource exhaustion: GPU VRAM, system RAM, disk full.

JobQueue’s reactive-eviction flow (CR-7) routes resource errors to retry after attempting to free the named resource. Capability authors set resource_shortfall so the substrate knows what to evict.


CapabilityTransientError


def CapabilityTransientError(
    message:str, # Human-readable description
    retry_after_seconds:Optional=None, # Hint for backoff strategies
):

Temporary failure: timeout, network blip, brief resource contention.

Substrate / JobQueue may retry on its own initiative. Capability authors raise this when they know the failure is recoverable.


CapabilityInputError


def CapabilityInputError(
    message:str, # Human-readable description
    fields_invalid:Optional=None, # Names of inputs that failed validation
):

User-fixable error: bad config, invalid argument, missing file.

Like the other category bases (CapabilityTransientError, CapabilityResourceError, CapabilityFatalError), it extends only CapabilityError; the right reader intent is except CapabilityInputError: (or the broader except CapabilityError:).


CapabilityError


def CapabilityError(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

Base for substrate-recognized capability exceptions.

Subclasses declare a category and default_retriable ClassVar so the JobQueue + scheduler can route the failure without sniffing exception text. Bare Python exceptions raised by capability code go through map_bare_exception_to_job_error to acquire a default category.

Substrate-raised typed exceptions

These concrete exception types are defined in CR-5 (this file) so CR-2 / CR-6 / SG-14 can raise them when they land. Each anchors under the appropriate category base so its catch behavior is correct from day one.


CapabilityDisabledError


def CapabilityDisabledError(
    capability_name:str
):

JobQueue / execute_capability rejected: the capability is currently disabled.

User-fixable (re-enable the capability). Raised by CR-2’s enable/disable wiring once that lands.


CapabilityNotLoadedError


def CapabilityNotLoadedError(
    capability_name:str
):

Caller submitted to a capability that was never loaded.

Fatal category because this is a programmer / orchestration bug, not a user-fixable condition. The right reader intent is except CapabilityNotLoadedError: (or the broader except CapabilityError:).


CapabilityTimeoutError


def CapabilityTimeoutError(
    capability_name:str, timeout_seconds:float, retry_after_seconds:Optional=None
):

A per-job timeout fired before the capability finished.

Transient category — retry may succeed if the slow operation completes faster next time. Carries retry_after_seconds from CapabilityTransientError. Raised by SG-14’s per-job timeout primitive when that lands.


CapabilityCancelledError


def CapabilityCancelledError(
    capability_name:str
):

Cooperative cancellation signal raised from ToolCapability.check_cancel().

Anchors under CapabilityTransientError because cancellation is in-principle re-runnable — a future attempt with the same inputs won’t auto-fail if the cancel flag isn’t set. But default_retriable is False: cancellation was a deliberate operator action, so the substrate should NOT auto-retry. Job-monitor / JobQueue render cancelled jobs with their own state (separate from “failed”); the JobError category remains transient so consumers reading the typed taxonomy can group recoverable signals.

Capability authors raise this implicitly via self.check_cancel() inside execute(); substrate sets the underlying _cancel_requested flag via cancel(). See CR-4’s cancellation primitives for the cooperative-cancel protocol.


WorkerOOMError


def WorkerOOMError(
    capability_name:str, process_returncode:Optional=None, message:Optional=None
):

The worker subprocess died with a kill-signal during an active execute call.

CR-7 Track A — substrate-side OOM detection: when an HTTP call to the worker faults and the subprocess has died with returncode == -signal.SIGKILL (or the platform equivalent), the substrate raises this. The kernel OOM-killer is the most common cause of SIGKILL during normal execute paths, so the substrate treats SIGKILL-during-call as “assume OOM” and surfaces a typed resource error for the reactive retry path.

resource_shortfall is None for Track A — the substrate only saw “worker died from kill-signal” and has no per-resource needed/available numbers. Track B (per SG-47’s sub-task: capability-side wrapping of torch.cuda.OutOfMemoryError et al.) raises CapabilityResourceError directly with a populated ResourceShortfall because the capability had the context. Both land at the same except CapabilityResourceError site in CR-7’s reactive retry loop.

process_returncode carries the observed exit code for debugging / classification (e.g. operators can distinguish kernel-OOM SIGKILL from other signals if they read it). Defaults to None for callers that don’t have it on hand.

CapabilityConfigError (reparented from utils.validation per CR-5)

Originally defined in utils/validation.py by SG-8 as a ValueError subclass. CR-5 reparents it under CapabilityInputError, preserving except ValueError: compat (via CapabilityInputError’s ValueError MRO) and using fields_invalid as the canonical attribute name, consistent with the rest of the input-error hierarchy.


CapabilityConfigError


def CapabilityConfigError(
    message:str, # Human-readable description
    fields_invalid:Optional=None, # Canonical: list of bad config keys
    config_class_name:str='', # Dataclass / capability name for the schema
):

Unknown / invalid keys in a config dict against a capability’s config schema.

Reparented from cjm_substrate.utils.validation (Wave 2 / SG-8) under CR-5. Inherits CapabilityInputError’s ValueError MRO automatically. config_class_name is the dataclass / capability name whose schema was violated.

JobError + ResourceShortfall + TracebackPolicy

When a capability job fails, the JobQueue (CR-6) records a JobError summary on the completed Job. The summary captures everything a frontend / operator needs to understand and (optionally) retry the failure without re-running the capability:

  • category lets UI decide retry button affordances.
  • retriable carries the substrate’s policy on whether to auto-retry.
  • original_exc_repr + optional traceback give post-mortem context.
  • fields_invalid / resource_shortfall are category-specific structured data.

TracebackPolicy controls how much detail the substrate records. Default FULL is what dev mode wants; REPR_ONLY and NONE are future opt-outs for security-sensitive multi-user deployments.


JobError


def JobError(
    category:Literal, message:str, retriable:bool, original_exc_repr:str, traceback:Optional=None,
    retry_after_seconds:Optional=None, fields_invalid:Optional=None, resource_shortfall:Optional=None,
    capability_name:Optional=None, capability_instance_id:Optional=None, occurred_at:Optional=None
)->None:

Structured failure summary recorded on a completed Job.

Populated by the JobQueue when a capability execution fails (CR-6 owns the population logic; CR-5 owns the shape). Sufficient for UI to render a failure card + retry affordance without re-running the capability.


TracebackPolicy


def TracebackPolicy(
    args:VAR_POSITIONAL, kwds:VAR_KEYWORD
):

How much exception detail the substrate records on a JobError.


ResourceShortfall


def ResourceShortfall(
    resource:Literal, needed:float, available:float
)->None:

Quantitative gap between what a capability needed and what was available.

Default classification of bare Python exceptions

Capability authors will gradually migrate to raising CapabilityError subclasses (SG-47 cascade). Until then, the JobQueue still needs to classify bare ValueError / TimeoutError / etc. into one of the four categories so retry policy is correct from day one.

The mapping walks the exception’s __mro__ against a substrate-provided lookup. First MRO ancestor that matches wins. Default for everything else is fatal — conservative: don’t auto-retry an exception we can’t classify.


classify_exception


def classify_exception(
    exc:BaseException, # The exception to classify
)->Literal['user_input', 'transient', 'resource', 'fatal']: # Category

Return the substrate category for any exception.

CapabilityError subclasses report their own declared category. Bare Python exceptions are mapped via __mro__ walk against _BARE_EXCEPTION_CATEGORY_MAP; the first ancestor in the table wins. Unrecognized exceptions classify as fatal (don’t auto-retry the unknown).


map_bare_exception_to_job_error


def map_bare_exception_to_job_error(
    exc:BaseException, # The raised exception
    capability_name:Optional=None, # Name of the capability that raised
    capability_instance_id:Optional=None, # Per CR-10
    traceback_policy:TracebackPolicy=<TracebackPolicy.FULL: 'full'>, # How much detail to record
    occurred_at:Optional=None, # Override; defaults to datetime.now(timezone.utc)
)->JobError:

Convert any exception into a structured JobError.

CapabilityError subclasses contribute their category-specific structured data (fields_invalid for input errors, resource_shortfall for resource errors, retry_after_seconds for transient errors). Bare exceptions get the default category-based retriable flag and no structured side-channel.

Regression tests

These exercises pin the MRO discipline, the backward-compat shim behavior, and the default classification. The MRO assertions are particularly load-bearing — future refactors that accidentally broaden CapabilityError(ValueError) would catch transient/resource/fatal errors via except ValueError:, which we explicitly do not want.

# MRO discipline: capability errors do NOT extend ValueError (SG-48 dropped
# the CapabilityInputError ValueError base).
input_err = CapabilityInputError("bad", fields_invalid=["foo"])
assert not isinstance(input_err, ValueError), "SG-48 dropped the ValueError base"
assert isinstance(input_err, CapabilityError)
assert isinstance(input_err, Exception)
assert input_err.category == 'user_input'
assert input_err.default_retriable is True
assert input_err.fields_invalid == ["foo"]

transient_err = CapabilityTransientError("slow", retry_after_seconds=5.0)
assert not isinstance(transient_err, ValueError), \
    "CapabilityTransientError must NOT inherit ValueError (semantic discipline)"
assert isinstance(transient_err, CapabilityError)
assert transient_err.category == 'transient'
assert transient_err.retry_after_seconds == 5.0

resource_err = CapabilityResourceError(
    "oom",
    resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=8000, available=4000),
)
assert not isinstance(resource_err, ValueError)
assert resource_err.category == 'resource'
assert resource_err.resource_shortfall.needed == 8000

fatal_err = CapabilityFatalError("crashed")
assert not isinstance(fatal_err, ValueError)
assert fatal_err.category == 'fatal'
assert fatal_err.default_retriable is False

print("✓ MRO discipline: no capability error extends ValueError")
✓ MRO discipline: only PluginInputError tree extends ValueError
# Substrate-side typed exceptions anchor under the correct category.
disabled = CapabilityDisabledError("whisper")
assert isinstance(disabled, CapabilityInputError)
assert not isinstance(disabled, ValueError), "SG-48 dropped the ValueError base"
assert disabled.category == 'user_input'
assert disabled.capability_name == "whisper"

not_loaded = CapabilityNotLoadedError("whisper")
assert isinstance(not_loaded, CapabilityFatalError)
assert not isinstance(not_loaded, ValueError), \
    "CapabilityNotLoadedError must NOT be catchable as ValueError (it's a fatal bug)"
assert not_loaded.category == 'fatal'

timeout = CapabilityTimeoutError("whisper", timeout_seconds=30.0, retry_after_seconds=60.0)
assert isinstance(timeout, CapabilityTransientError)
assert not isinstance(timeout, ValueError)
assert timeout.category == 'transient'
assert timeout.timeout_seconds == 30.0
assert timeout.retry_after_seconds == 60.0

# CR-4: CapabilityCancelledError extends CapabilityTransientError but is non-retriable
# (deliberate operator action — substrate should not auto-retry cancelled jobs).
cancelled = CapabilityCancelledError("whisper")
assert isinstance(cancelled, CapabilityTransientError)
assert isinstance(cancelled, CapabilityError)
assert not isinstance(cancelled, ValueError), \
    "CapabilityCancelledError must NOT be catchable as ValueError (it's a control-flow signal, not a value error)"
assert cancelled.category == 'transient', "category=transient: cancellation is in-principle re-runnable"
assert cancelled.default_retriable is False, \
    "default_retriable=False: substrate must not auto-retry operator-cancelled jobs"
assert cancelled.capability_name == "whisper"
assert "cancelled by operator" in str(cancelled)

# CR-7 Track A: WorkerOOMError extends CapabilityResourceError with default_retriable=True
# inherited; carries process_returncode for operator debugging; no ResourceShortfall.
oom = WorkerOOMError("whisper", process_returncode=-9)
assert isinstance(oom, CapabilityResourceError), "must catch under CapabilityResourceError"
assert isinstance(oom, CapabilityError)
assert not isinstance(oom, ValueError), "resource errors are not ValueErrors"
assert oom.category == 'resource', "CR-7 reactive retry dispatches on category=resource"
assert oom.default_retriable is True, \
    "default_retriable=True: OOM is retriable after eviction (the whole point of CR-7)"
assert oom.capability_name == "whisper"
assert oom.process_returncode == -9
assert oom.resource_shortfall is None, \
    "Track A: substrate doesn't know needed/available; only Track B (capability-side raise) does"
assert "whisper" in str(oom)
assert "returncode=-9" in str(oom)

# CR-7: WorkerOOMError catches at the CapabilityResourceError site (the shared
# CR-7 reactive retry catch-point). Track A + Track B converge here.
def fake_track_a_raise():
    raise WorkerOOMError("voxtral", process_returncode=-9)

def fake_track_b_raise():
    raise CapabilityResourceError(
        "voxtral: CUDA OOM",
        resource_shortfall=ResourceShortfall(
            resource='gpu_vram_mb', needed=24000, available=8000,
        ),
    )

for raiser in (fake_track_a_raise, fake_track_b_raise):
    caught = False
    try:
        raiser()
    except CapabilityResourceError:
        caught = True
    assert caught, f"{raiser.__name__} must catch under CapabilityResourceError"

# Custom message override path
oom_custom = WorkerOOMError("whisper", message="custom diagnostic")
assert str(oom_custom) == "custom diagnostic"
assert oom_custom.capability_name == "whisper"
assert oom_custom.process_returncode is None

print("✓ Substrate-side typed exceptions anchor under the right category")
# CapabilityConfigError reparenting under CapabilityInputError; fields_invalid canonical.
err = CapabilityConfigError(
    "unknown keys",
    fields_invalid=["foo", "bar"],
    config_class_name="WhisperConfig",
)
assert isinstance(err, CapabilityInputError)
assert not isinstance(err, ValueError), "SG-48 dropped the ValueError base"
assert err.fields_invalid == ["foo", "bar"]
assert err.config_class_name == "WhisperConfig"

print("✓ CapabilityConfigError reparenting works")
✓ PluginConfigError reparenting + backward-compat shims work
# Default classification of bare Python exceptions.
assert classify_exception(ValueError("bad")) == 'user_input'
assert classify_exception(TypeError("bad")) == 'user_input'
assert classify_exception(FileNotFoundError("missing")) == 'user_input'
assert classify_exception(TimeoutError("slow")) == 'transient'
assert classify_exception(ConnectionError("net")) == 'transient'
assert classify_exception(MemoryError("oom")) == 'resource'
assert classify_exception(RuntimeError("unknown")) == 'fatal'

# CapabilityError subclasses report their own declared category, not the
# inherited-builtin's category. CapabilityInputError extends ValueError but its
# category is 'user_input' (the declared value), not derived from ValueError.
assert classify_exception(CapabilityInputError("x")) == 'user_input'
assert classify_exception(CapabilityTransientError("x")) == 'transient'
assert classify_exception(CapabilityResourceError("x")) == 'resource'
assert classify_exception(CapabilityFatalError("x")) == 'fatal'

# CapabilityNotLoadedError is fatal even though no built-in maps to fatal by default.
assert classify_exception(CapabilityNotLoadedError("whisper")) == 'fatal'

print("✓ Default exception classification correct")
✓ Default exception classification correct
# map_bare_exception_to_job_error captures category + retriable + structured data.
try:
    raise CapabilityConfigError("bad config", fields_invalid=["model"])
except Exception as e:
    err = map_bare_exception_to_job_error(e, capability_name="whisper")

assert err.category == 'user_input'
assert err.retriable is True
assert err.fields_invalid == ["model"]
assert err.capability_name == "whisper"
assert err.traceback is not None and "CapabilityConfigError" in err.traceback
assert err.occurred_at is not None
# Python 3.12+ compat: occurred_at must be timezone-aware (datetime.utcnow()
# is deprecated and returns naive datetime; we use datetime.now(timezone.utc)).
assert err.occurred_at.tzinfo is not None, \
    "occurred_at should be timezone-aware (CR-5 Python 3.12+ future-proof form)"

# Resource error: resource_shortfall propagates.
try:
    raise CapabilityResourceError(
        "oom",
        resource_shortfall=ResourceShortfall(resource='gpu_vram_mb', needed=16000, available=8000),
    )
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'resource'
assert err.retriable is True
assert err.resource_shortfall.needed == 16000

# Bare ValueError gets default user_input + retriable=True.
try:
    raise ValueError("unmapped bare")
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'user_input'
assert err.retriable is True
assert err.fields_invalid is None  # bare ValueError has no fields_invalid attribute

# Bare RuntimeError gets default fatal + retriable=False.
try:
    raise RuntimeError("unknown")
except Exception as e:
    err = map_bare_exception_to_job_error(e)

assert err.category == 'fatal'
assert err.retriable is False

# TracebackPolicy.NONE suppresses traceback + message.
try:
    raise ValueError("secret")
except Exception as e:
    err = map_bare_exception_to_job_error(e, traceback_policy=TracebackPolicy.NONE)

assert err.traceback is None
assert err.message == ""
assert err.original_exc_repr  # repr is always kept (debug breadcrumb)

print("✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy")
✓ map_bare_exception_to_job_error preserves structured data + honors TracebackPolicy