Content Hashing Utilities

Shared cryptographic hashing primitives for content integrity verification

hash_bytes

Computes a cryptographic hash of byte content, returning a self-describing string in "algo:hexdigest" format. This format embeds the algorithm name, making hashes forward-compatible if the algorithm changes.


hash_bytes


def hash_bytes(
    content:bytes, # Byte content to hash
    algo:str='sha256', # Hash algorithm name (e.g., "sha256", "sha3_256")
)->str: # Hash string in "algo:hexdigest" format

Compute a hash of byte content.

result = hash_bytes(b"hello world")
print(f"hash_bytes result: {result}")

# Check format
algo, digest = result.split(":", 1)
assert algo == "sha256"
assert len(digest) == 64  # SHA-256 produces 64 hex chars

# Deterministic
assert hash_bytes(b"hello world") == hash_bytes(b"hello world")

# Different content produces different hash
assert hash_bytes(b"hello world") != hash_bytes(b"hello World")

print("hash_bytes tests passed")
hash_bytes result: sha256:b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9
hash_bytes tests passed
# Custom algorithm
sha512_result = hash_bytes(b"test", algo="sha512")
print(f"SHA-512 result: {sha512_result[:30]}...")
assert sha512_result.startswith("sha512:")
assert len(sha512_result.split(":")[1]) == 128  # SHA-512 produces 128 hex chars

print("Custom algorithm test passed")
SHA-512 result: sha512:ee26b0dd4af7e749aa1a8ee...
Custom algorithm test passed

hash_file

Stream-hashes a file without loading it entirely into memory. Uses chunked reads suitable for large files (audio, video, etc.).


hash_file


def hash_file(
    path:Union, # Path to file to hash
    algo:str='sha256', # Hash algorithm name
    chunk_size:int=8192, # Read chunk size in bytes
)->str: # Hash string in "algo:hexdigest" format

Stream-hash a file without loading it entirely into memory.

import tempfile
import os

# Create a temp file with known content
with tempfile.NamedTemporaryFile(delete=False, mode='wb') as tmp:
    tmp.write(b"hello world")
    tmp_path = tmp.name

# Hash the file
file_hash = hash_file(tmp_path)
print(f"hash_file result: {file_hash}")

# Should match hash_bytes of the same content
assert file_hash == hash_bytes(b"hello world")

# Test with Path object
assert hash_file(Path(tmp_path)) == file_hash

# Cleanup
os.unlink(tmp_path)
print("hash_file tests passed")
hash_file result: sha256:b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9
hash_file tests passed

verify_hash

Verifies byte content against an expected hash string. Automatically extracts the algorithm from the "algo:hexdigest" format.


verify_hash


def verify_hash(
    content:bytes, # Content to verify
    expected:str, # Expected hash in "algo:hexdigest" format
)->bool: # True if content matches expected hash

Verify content against an expected hash string.

original = b"hello world"
h = hash_bytes(original)

# Matching content
assert verify_hash(original, h) == True

# Modified content
assert verify_hash(b"hello World", h) == False

# Works with different algorithms
h_sha512 = hash_bytes(original, algo="sha512")
assert verify_hash(original, h_sha512) == True
assert verify_hash(b"tampered", h_sha512) == False

print("verify_hash tests passed")
verify_hash tests passed

hash_dict_canonical


def hash_dict_canonical(
    data:Optional, # Dict to hash (or None — treated as {})
    algo:str='sha256', # Hash algorithm name
)->str: # Hash string in "algo:hexdigest" format

Hash a dict via canonical JSON encoding.

Canonicalization: json.dumps(data, sort_keys=True, separators=(",", ":")). Sorted keys eliminate dict-insertion-order variance; minimal separators eliminate whitespace variance. Result is deterministic across Python versions and machines.

# Insertion-order independence — sorted-keys canonicalization eliminates variance
d_a = {"model": "base", "device": "cuda"}
d_b = {"device": "cuda", "model": "base"}
assert hash_dict_canonical(d_a) == hash_dict_canonical(d_b)

# Different values produce different hashes
d_c = {"model": "base", "device": "cpu"}
assert hash_dict_canonical(d_a) != hash_dict_canonical(d_c)

# None and {} hash identically (canonical-empty)
assert hash_dict_canonical(None) == hash_dict_canonical({})

# Algorithm-tagged format
result = hash_dict_canonical(d_a)
assert result.startswith("sha256:")
print(f"hash_dict_canonical(d_a) = {result}")

# Nested dicts also canonicalize stably
nested_a = {"outer": {"a": 1, "b": 2}, "trailing": True}
nested_b = {"trailing": True, "outer": {"b": 2, "a": 1}}
assert hash_dict_canonical(nested_a) == hash_dict_canonical(nested_b)

print("hash_dict_canonical tests passed")

hash_dict_canonical

Hashes a dict via deterministic JSON canonicalization (sorted keys, no whitespace) before delegating to hash_bytes. Two callers in the substrate:

  • CR-8 compute_config_schema_hash — hashes a plugin’s JSON Schema for drift detection (manifest-stored vs live worker).
  • CR-7 compute_config_hash — hashes a plugin instance’s effective config to key EmpiricalResourceRecord storage by (instance_id, config_hash).

Same canonicalization rules → same stability guarantees across Python versions, dict insertion orders, and machines. None is treated as {} so a missing schema/config still produces a deterministic hash rather than raising.