Standardized SQLite storage for transcription results with content hashing
TranscriptionRow
A dataclass representing a single row in the standardized transcriptions table. This provides a type-safe way to work with stored transcription results.
Row: job_id=job_abc123, text=Hello world...
Audio hash: sha256:aaaaaaaaaaaaa...
Text hash: sha256:bbbbbbbbbbbbb...
TranscriptionStorage
Standardized SQLite storage that all transcription plugins should use. Defines the canonical schema for the transcriptions table with content hash columns for traceability.
Schema:
CREATETABLEIFNOTEXISTS transcriptions (idINTEGERPRIMARYKEY AUTOINCREMENT, job_id TEXT UNIQUENOTNULL, audio_path TEXT NOTNULL, audio_hash TEXT NOTNULL, text TEXT NOTNULL, text_hash TEXT NOTNULL, segments JSON, metadata JSON, created_at REALNOTNULL);
The audio_hash and text_hash columns use the self-describing "algo:hexdigest" format (e.g., "sha256:a3f2b8..."), enabling downstream consumers to verify content integrity.
# Save a transcription result with hashestest_text ="Laying Plans Sun Tzu said, The art of war is of vital importance to the state."text_hash = hash_bytes(test_text.encode())audio_hash ="sha256:"+"e3b0c44298"*6+"e3b0"# Simulated audio hashstorage.save( job_id="job_test_001", audio_path="/tmp/test_audio.mp3", audio_hash=audio_hash, text=test_text, text_hash=text_hash, segments=[{"start": 0.0, "end": 5.0, "text": test_text}], metadata={"model": "whisper-large-v3", "language": "en"})print(f"Saved job_test_001")print(f"Text hash: {text_hash}")
Saved job_test_001
Text hash: sha256:83efd1674de9fcf20e5c2edacf9246f7f34ad04bf07ddcb2b4e2765269e1edd1
Retrieved: job_test_001
Text: Laying Plans Sun Tzu said, The art of wa...
Audio hash: sha256:e3b0c44298e3b0c44298e3b...
Text hash: sha256:83efd1674de9fcf20e5c2ed...
Created at: 1770425259.7641876
# Save another and test list_jobsstorage.save( job_id="job_test_002", audio_path="/tmp/test_audio_2.mp3", audio_hash="sha256:"+"f"*64, text="Second transcription.", text_hash=hash_bytes(b"Second transcription."))jobs = storage.list_jobs()assertlen(jobs) ==2# Newest firstassert jobs[0].job_id =="job_test_002"assert jobs[1].job_id =="job_test_001"print(f"list_jobs returned {len(jobs)} rows (newest first): {[j.job_id for j in jobs]}")
list_jobs returned 2 rows (newest first): ['job_test_002', 'job_test_001']
# Test text verificationassert storage.verify_text("job_test_001") ==Trueprint("verify_text with unchanged text: True")# Tamper with text directly in DBwith sqlite3.connect(tmp_db.name) as con: con.execute("UPDATE transcriptions SET text = 'TAMPERED' WHERE job_id = 'job_test_001'")assert storage.verify_text("job_test_001") ==Falseprint("verify_text after tampering: False")# Missing job returns Noneassert storage.verify_text("nonexistent") isNoneprint("verify_text for missing job: None")
verify_text with unchanged text: True
verify_text after tampering: False
verify_text for missing job: None