Cache Paths

Per-(input-content, config) deterministic cache directories for plugin outputs

_sanitize_stem

Replace filesystem-unsafe characters in an input file’s stem so the resulting directory name is portable across Linux / macOS / Windows. Length-cap mitigates pathologically long source filenames.

Stat-cache for content hashes

Computing a SHA-256 over a multi-hour podcast WAV (~1–2 GB) takes a few seconds even with hash_file’s streaming reader. Per-cache-lookup hashing would be untenable for chained-plugin workflows where the same input might be referenced dozens of times in a single workflow run.

The substrate maintains a small SQLite stat-cache at <substrate_data_dir>/input_hash_cache.db mapping (absolute_path, mtime_ns, size)content_hash. Lookups that hit the cache return in microseconds; cold lookups compute the hash once and write it back.

The cache uses a module-level threading.Lock to serialize SQLite writes (SQLite handles concurrent reads fine, but writes from multiple threads need coordination at the Python layer to avoid database is locked errors). Reads are still fast because SQLite’s WAL mode (set on connect) allows concurrent readers + one writer.

mtime_ns is preferred over mtime (float seconds) — nanosecond precision distinguishes a fast write-twice operation that a 1-second mtime resolution would conflate. size is the secondary check — paranoid defense against filesystem mtime resolution issues.

cache_dir_for_config

The main entry point. Returns (and optionally creates) a deterministic per-(input-content, config) cache directory.

Plugins use this in lieu of hand-rolled <plugin_data_dir>/<action>/<stem> output-path derivation. The user’s ffmpeg segmentation bug (segment_audio with different max_segment_duration values overwriting each other in the same directory) is the canonical motivating example — fixing it requires the config to enter the cache key, which this helper makes mandatory by construction.


cache_dir_for_config


def cache_dir_for_config(
    plugin_data_dir:Union, # The plugin's own data subdirectory (typically <cfg.plugin_data_dir>/<plugin_name>)
    input_path:Union, # The input file the plugin operates on
    action:str, # The plugin action name (e.g., "segment_audio", "convert", "execute")
    config_dict:Dict, # The plugin's effective config for this action
    input_hash_length:int=6, # Truncation length for the input content hash in the directory name
    config_hash_length:int=12, # Truncation length for the config hash in the directory name
    create:bool=True, # Auto-create the directory (parents=True, exist_ok=True)
    hash_input_content:bool=True, # If False, hash str(input_path) instead (e.g., URL inputs)
    skip_input_cache:bool=False, # If True, bypass the stat-cache (always recompute content hash)
)->Path: # The deterministic cache directory path

Return (and optionally create) a per-(input-content, config) cache directory.

Path layout::

<plugin_data_dir>/<action>/<sanitized-stem>/<input_hash[:N]>_<config_hash[:M]>/

The same (input_content, action, config_dict) always resolves to the same path; any change to input content OR config produces a different path. This means:

  1. Different configs go to different directories — no silent overwrite.
  2. Stale-artifact accumulation is impossible — each unique (input_content, config) tuple has its OWN directory.
  3. For chained plugin sequences, upstream config changes propagate through content changes: if plugin A’s output content depends on A’s config and plugin B reads that output, B’s cache key automatically reflects A’s config indirectly.

hash_input_content=False switches to hashing the string form of input_path instead of file content — for plugins whose “input” is a URL, a database row ID, or another non-file identifier. Sequence chaining via content propagation only works for true file inputs.

skip_input_cache=True recomputes the input content hash even if the stat-cache has a record. Useful for plugins that just wrote the input file and want to record its canonical hash without stale-cache risk.

Raises FileNotFoundError if input_path doesn’t exist and hash_input_content=True. Raises OSError on directory-create failure when create=True.

list_cache_entries + prune_cache_for_input

Operator-facing affordances for inspecting and cleaning up the cache. The <plugin_data_dir>/<action>/<stem>/ parent contains one directory per unique config variant the plugin has been run with. list_cache_entries enumerates them; prune_cache_for_input deletes them (optionally preserving a specified set).


prune_cache_for_input


def prune_cache_for_input(
    plugin_data_dir:Union, # The plugin's own data subdirectory
    input_path:Union, # The input file whose cache entries to prune
    action:str, # The plugin action name
    keep:Optional=None, # Paths to preserve through the sweep (returns by list_cache_entries)
    dry_run:bool=False, # If True, return what WOULD be deleted without touching filesystem
)->List: # Paths that were (or would be) deleted

Delete per-config cache directories for (input, action), optionally preserving a keep set.

Pairs with list_cache_entries for inspect-then-prune workflows: list candidates, choose which to keep, then call prune with the keep set. keep=None deletes ALL entries.

dry_run=True returns the would-delete list without touching the filesystem — useful for operator confirmation before destructive ops.

Returns the list of deleted (or would-delete) paths.


list_cache_entries


def list_cache_entries(
    plugin_data_dir:Union, # The plugin's own data subdirectory
    input_path:Union, # The input file whose cache entries to list
    action:str, # The plugin action name
)->List: # All config-hash directories for this (input, action)

Enumerate all per-config cache directories for a given (input, action).

Returns the paths of every <input_hash>_<config_hash> directory under <plugin_data_dir>/<action>/<sanitized-stem>/. Each entry corresponds to a unique (input_content, config) tuple — operators can inspect their contents, diff them, or pass selected ones to prune_cache_for_input to keep them through a sweep.

Returns an empty list if the parent directory doesn’t exist (plugin never ran this action for this input).

Tests

Exercise the helpers end-to-end against a tempdir + tempfile so the cache_paths module doesn’t depend on any specific plugin’s data dir.