Skip to content

ADR-005 — Real persisted embeddings with durable symbol identity

Date: 29/03/2026 Status: Accepted

Context

Issue #1 asks codira to move from its current placeholder local embedding backend to real persisted embeddings while preserving the explicit manual indexing model.

The repository already persists semantic artifacts in SQLite and already uses explicit invalidation when the embedding backend version changes. That is not yet sufficient for efficient reuse within a changed file:

  • unchanged files can already reuse all stored embeddings
  • changed files currently force regeneration of all symbol embeddings owned by the file
  • small local edits therefore discard many still-valid symbol embeddings

The root cause is architectural rather than model-specific:

  • embedding rows are tied to transient symbol-row identifiers
  • the analyzer border contract does not expose a durable symbol identity
  • the indexer cannot diff old and new semantic units within one changed file

The issue therefore requires two linked changes rather than only a backend swap:

  • replace the placeholder embedding backend with a real local backend
  • extend analyzer output with durable symbol identity so symbol-level reuse becomes deterministic

The solution must still preserve these invariants:

  • indexing remains explicit through codira index
  • no background indexing or query-time mutation is introduced
  • embeddings are invalidated when semantic input changes
  • embeddings are invalidated when backend or backend-version metadata changes
  • contributor-facing contracts, docs, and tests remain first-class

Decision

Adopt a real persisted embedding path together with analyzer-owned durable symbol identity.

Real Embedding Backend

codira will replace the current placeholder hash backend with a real local embedding backend.

The active backend contract will continue to expose explicit metadata used for deterministic invalidation:

  • backend name
  • backend version
  • embedding dimension
  • any fixed model identity required by the chosen backend contract

The dependency stack and local model provisioning rules will be documented explicitly. Indexing must fail fast when the configured backend cannot be used locally; implicit remote APIs or hidden background downloads are out of scope.

Durable Symbol Identity at the Analyzer Boundary

The analyzer contract will evolve so normalized artifacts carry a stable symbol identity produced by the analyzer itself.

That identity must be:

  • deterministic
  • language-aware
  • independent of transient database row ids
  • independent of source line numbers and parse-node byte offsets
  • stable under unrelated edits elsewhere in the file
  • changed when the symbol's semantic identity changes

This contract extension belongs at the analyzer boundary rather than inside the backend because only analyzers have the language-specific knowledge needed to define symbol sameness correctly.

Symbol-Level Reuse for Changed Files

When a file changes, the indexer will no longer treat the file as an all-or-nothing embedding unit.

Instead, for that file it will:

  • compare the old persisted stable-id set with the new analyzer output
  • delete symbols present only in the old set
  • insert symbols present only in the new set
  • reuse stored vectors for symbols whose stable identity and semantic payload hash are unchanged
  • recompute embeddings only for symbols whose semantic payload changed or whose backend metadata no longer matches

This preserves determinism while avoiding unnecessary regeneration for large files with small edits.

Storage Strategy

The storage schema will persist:

  • stable symbol identity for indexed symbols
  • content hashes for the exact embedding text payloads
  • backend metadata required for deterministic invalidation
  • embedding vectors as binary float32 payloads

The existing explicit indexing model remains unchanged:

  • embeddings are computed only during indexing
  • queries read persisted vectors only
  • no background service is introduced

Consequences

Positive

  • real embeddings become a durable indexed artifact rather than a placeholder semantic channel
  • changed files can preserve unchanged symbol embeddings deterministically
  • symbol disappearance and rename handling become explicit set-diff operations rather than side effects of file-wide replacement
  • the analyzer/backend separation remains intact because the language-aware identity logic lives in analyzer output
  • future analyzers can participate in symbol-level reuse by implementing the same stable-id contract

Negative

  • the analyzer contract becomes stricter and existing analyzers and analyzer tests must be updated
  • schema and persistence logic become more involved than the current file-scoped replacement model
  • dependency management and local model provisioning add operator overhead

Neutral / Trade-offs

  • the first implementation may still delete and recreate non-semantic file-owned rows while preserving embeddings through stable-id reuse
  • stable identity is intentionally semantic rather than source-location-based, so symbol moves within a file do not imply invalidation by themselves
  • later work may further optimize candidate selection or partial persistence, but that is not required to establish the contract

Execution Rules

  • Use the dedicated issue branch issue/1-real-embeddings.
  • Keep the execution ledger current as work lands.
  • Make multiple commits, with at least one commit per phase.
  • Split large phases into smaller atomic commits when needed.
  • Keep tests, docstrings, and documentation in scope for every phase.
  • Merge back to main with a squash commit that closes issue #1.

Phase Ledger

Mark each phase as work lands.

  • [x] Phase 1 — Branch bootstrap and execution scaffold
  • [x] Phase 2 — ADR and execution ledger
  • [x] Phase 3 — Dependency and local-model provisioning updates
  • [x] Phase 4 — Analyzer contract extension for durable symbol identity
  • [x] Phase 5 — Built-in analyzer stable-id implementation
  • [x] Phase 6 — Schema and storage migration for stable symbol reuse
  • [x] Phase 7 — Symbol-level reuse and invalidation in indexing
  • [x] Phase 8 — Real embedding backend integration
  • [x] Phase 9 — Query, explain, and inventory updates
  • [x] Phase 10 — Tests, docs, and merge preparation

Decision

Consequences