ADR-005 — Real persisted embeddings with durable symbol identity¶
Date: 29/03/2026 Status: Accepted
Context¶
Issue #1 asks codira to move from its current placeholder local
embedding backend to real persisted embeddings while preserving the explicit
manual indexing model.
The repository already persists semantic artifacts in SQLite and already uses explicit invalidation when the embedding backend version changes. That is not yet sufficient for efficient reuse within a changed file:
- unchanged files can already reuse all stored embeddings
- changed files currently force regeneration of all symbol embeddings owned by the file
- small local edits therefore discard many still-valid symbol embeddings
The root cause is architectural rather than model-specific:
- embedding rows are tied to transient symbol-row identifiers
- the analyzer border contract does not expose a durable symbol identity
- the indexer cannot diff old and new semantic units within one changed file
The issue therefore requires two linked changes rather than only a backend swap:
- replace the placeholder embedding backend with a real local backend
- extend analyzer output with durable symbol identity so symbol-level reuse becomes deterministic
The solution must still preserve these invariants:
- indexing remains explicit through
codira index - no background indexing or query-time mutation is introduced
- embeddings are invalidated when semantic input changes
- embeddings are invalidated when backend or backend-version metadata changes
- contributor-facing contracts, docs, and tests remain first-class
Decision¶
Adopt a real persisted embedding path together with analyzer-owned durable symbol identity.
Real Embedding Backend¶
codira will replace the current placeholder hash backend with a real
local embedding backend.
The active backend contract will continue to expose explicit metadata used for deterministic invalidation:
- backend name
- backend version
- embedding dimension
- any fixed model identity required by the chosen backend contract
The dependency stack and local model provisioning rules will be documented explicitly. Indexing must fail fast when the configured backend cannot be used locally; implicit remote APIs or hidden background downloads are out of scope.
Durable Symbol Identity at the Analyzer Boundary¶
The analyzer contract will evolve so normalized artifacts carry a stable symbol identity produced by the analyzer itself.
That identity must be:
- deterministic
- language-aware
- independent of transient database row ids
- independent of source line numbers and parse-node byte offsets
- stable under unrelated edits elsewhere in the file
- changed when the symbol's semantic identity changes
This contract extension belongs at the analyzer boundary rather than inside the backend because only analyzers have the language-specific knowledge needed to define symbol sameness correctly.
Symbol-Level Reuse for Changed Files¶
When a file changes, the indexer will no longer treat the file as an all-or-nothing embedding unit.
Instead, for that file it will:
- compare the old persisted stable-id set with the new analyzer output
- delete symbols present only in the old set
- insert symbols present only in the new set
- reuse stored vectors for symbols whose stable identity and semantic payload hash are unchanged
- recompute embeddings only for symbols whose semantic payload changed or whose backend metadata no longer matches
This preserves determinism while avoiding unnecessary regeneration for large files with small edits.
Storage Strategy¶
The storage schema will persist:
- stable symbol identity for indexed symbols
- content hashes for the exact embedding text payloads
- backend metadata required for deterministic invalidation
- embedding vectors as binary float32 payloads
The existing explicit indexing model remains unchanged:
- embeddings are computed only during indexing
- queries read persisted vectors only
- no background service is introduced
Consequences¶
Positive¶
- real embeddings become a durable indexed artifact rather than a placeholder semantic channel
- changed files can preserve unchanged symbol embeddings deterministically
- symbol disappearance and rename handling become explicit set-diff operations rather than side effects of file-wide replacement
- the analyzer/backend separation remains intact because the language-aware identity logic lives in analyzer output
- future analyzers can participate in symbol-level reuse by implementing the same stable-id contract
Negative¶
- the analyzer contract becomes stricter and existing analyzers and analyzer tests must be updated
- schema and persistence logic become more involved than the current file-scoped replacement model
- dependency management and local model provisioning add operator overhead
Neutral / Trade-offs¶
- the first implementation may still delete and recreate non-semantic file-owned rows while preserving embeddings through stable-id reuse
- stable identity is intentionally semantic rather than source-location-based, so symbol moves within a file do not imply invalidation by themselves
- later work may further optimize candidate selection or partial persistence, but that is not required to establish the contract
Execution Rules¶
- Use the dedicated issue branch
issue/1-real-embeddings. - Keep the execution ledger current as work lands.
- Make multiple commits, with at least one commit per phase.
- Split large phases into smaller atomic commits when needed.
- Keep tests, docstrings, and documentation in scope for every phase.
- Merge back to
mainwith a squash commit that closes issue#1.
Phase Ledger¶
Mark each phase as work lands.
- [x] Phase 1 — Branch bootstrap and execution scaffold
- [x] Phase 2 — ADR and execution ledger
- [x] Phase 3 — Dependency and local-model provisioning updates
- [x] Phase 4 — Analyzer contract extension for durable symbol identity
- [x] Phase 5 — Built-in analyzer stable-id implementation
- [x] Phase 6 — Schema and storage migration for stable symbol reuse
- [x] Phase 7 — Symbol-level reuse and invalidation in indexing
- [x] Phase 8 — Real embedding backend integration
- [x] Phase 9 — Query, explain, and inventory updates
- [x] Phase 10 — Tests, docs, and merge preparation