ADR-004 — Pluggable Backend and Analyzer Migration Plan¶
Date: 28/03/2026 Status: Accepted
Context¶
codira currently has a strong implicit coupling between:
- Python-specific analysis
- SQLite-specific persistence and query execution
- CLI/query surfaces that directly depend on SQLite-oriented helpers
Open issues make the next architectural direction explicit:
- issue
#1requires a cleaner persistence boundary for embeddings and their invalidation metadata - issue
#2requires pluggable language analyzers so multi-language repositories become a first-class target
The migration needs to preserve determinism and maintainability while expanding architecture documentation, not just implementation code.
Decision¶
Adopt a two-family plugin architecture and execute the migration on a dedicated branch through a sequence of small, reviewable commits.
Plugin Families¶
codira will distinguish two separate extension families:
IndexBackendExactly one storage/query backend is active for a given repository index.LanguageAnalyzerMultiple analyzers may be active in the same indexing run so one repository can be indexed across multiple languages and dialects.
This asymmetry is intentional:
- storage is an instance-level policy decision
- analyzers are repository-content capabilities
Documentation and Tests Are First-Class¶
The migration is not code-only work.
Each architectural step must include, where applicable:
- tests that freeze or extend behavior
- architecture documentation updates
- ADRs for durable decisions that would otherwise be lost in commit history
The documentation scope must expand beyond README usage notes to include:
- architecture overviews
- pipeline documentation
- plugin model documentation
- backend/analyzer extension guidance
- ADRs that preserve and enforce decision history
Rationale¶
One active backend per instance avoids a large class of unnecessary complexity:
- competing schema ownership
- inconsistent migration rules
- duplicate query semantics
- split incremental reuse logic
- ambiguous embedding persistence
- reduced determinism
Allowing multiple analyzers in one run is necessary for mixed-language repositories and is aligned with the explicit goal of supporting them without treating non-Python files as unsupported noise.
Treating tests and architecture documentation as first-class citizens reduces the risk that the refactor drifts into undocumented framework churn.
Consequences¶
Positive¶
- clear separation between language analysis and persistence concerns
- explicit support for mixed-language repositories
- deterministic architectural boundaries for future backend and analyzer work
- durable design history through ADRs
- smaller and safer implementation increments
Negative¶
- more upfront design and documentation work before user-visible feature expansion
- more commits and branch management overhead
- stronger discipline required to keep the execution ledger current
Neutral / Trade-offs¶
- README updates should follow architecture stabilization, not lead it
- some migration phases may need more than one commit to keep changes atomic
- additional ADRs may be created as the migration reveals narrower decisions
Migration Plan¶
The migration will proceed through the following phases.
Phase 1 — Branch and Architecture Skeleton¶
Create a dedicated branch for this migration.
Add an architecture documentation skeleton covering:
- system overview
- indexing pipeline
- query pipeline
- plugin model
- storage backends
- language analyzers
Add an ADR template if one does not already exist.
Phase 2 — Characterization Tests¶
Add or extend tests that freeze current behavior for:
indexsymbolcallsrefsembeddingsctx- incremental reuse
- embedding invalidation
- deterministic ordering
These tests are guardrails for the refactor, not optional cleanup.
Phase 3 — Core Contracts and Normalized Artifacts¶
Introduce backend-neutral contracts and data structures for:
LanguageAnalyzerAnalysisResultIndexBackend- normalized index artifacts
Document the responsibilities and invariants of those contracts.
Create additional ADRs if symbol identity, artifact ownership, or extension metadata boundaries require durable decisions.
Phase 4 — SQLite Backend Encapsulation¶
Wrap the current SQLite implementation behind a concrete SQLiteIndexBackend
without changing observable CLI behavior.
Keep schema semantics stable during this phase.
Add backend contract tests that SQLite must satisfy.
Phase 5 — Indexer Orchestration Refactor¶
Refactor index_repo into an orchestrator that:
- discovers files
- routes files to analyzers
- collects normalized artifacts
- delegates persistence to the selected backend
The orchestrator must stop depending directly on Python parser internals and raw storage implementation details.
Phase 6 — Python Analyzer Extraction¶
Extract the existing Python-specific logic into a PythonAnalyzer.
This includes:
- parsing
- symbol extraction
- call extraction
- callable-reference extraction
- import handling
- docstring audit integration
Phase 7 — Query Abstraction¶
Refactor exact-query and embedding-query paths so they depend on backend interfaces rather than raw SQLite access.
Preserve current CLI output contracts.
Add shared query contract tests where practical.
Phase 8 — Registries and Configuration¶
Introduce registry and configuration mechanisms so:
- one backend is selected for the index
- multiple analyzers can be registered and activated by file routing
Document defaults, selection rules, and failure behavior.
Create an ADR if configuration semantics become materially architectural.
Phase 9 — Second Analyzer Proof¶
Add one non-Python analyzer to validate the abstraction.
C is the preferred first candidate.
The first non-Python analyzer should prioritize:
- symbol extraction
- dependency extraction
- deterministic mixed-language indexing behavior
Phase 10 — Final Documentation Consolidation¶
Expand and reconcile the documentation set so contributors can reconstruct the architecture and the decisions behind it.
This phase should leave behind:
- stable architecture documents
- updated contributor guidance
- updated README references
- a complete ADR trail for the major choices made during the migration
Post-Phase-10 Retrieval Quality Roadmap¶
The migration phases above establish the architecture boundary. The following roadmap defines the preferred order for improving mixed-language retrieval quality without weakening determinism.
Phase 11 — C Analyzer Semantic Parity¶
Expand the C analyzer so it emits richer normalized artifacts.
This phase should prioritize:
- top-level call extraction
- struct, enum, typedef, macro, and global symbol extraction where deterministic
- normalized include artifacts with local versus system include classification
- header and source ownership hints
- nearby comment extraction for semantic text construction
This phase may record direct include edges, but the retrieval layer should not yet depend on them as a first-class graph.
Phase 12 — File-Role Classification¶
Introduce deterministic file-role classification for indexed files.
The initial role set should include:
- implementation
- header or interface
- test
- tooling or script
Prefer repository-structure and path-based rules before deeper heuristics.
Phase 13 — Evidence-Based Ranking Fusion¶
Replace flat or near-flat channel merging with typed evidence fusion.
The evidence families should include:
- lexical symbol evidence
- semantic text evidence
- graph evidence
- file-role evidence
- language-coverage evidence
Scoring must remain deterministic and explainable.
Phase 14 — Diversity-Aware Result Selection¶
Add deterministic diversification after raw ranking.
This phase should prevent one language, module family, or test bundle from crowding out stronger implementation evidence.
Phase 15 — Cross-Language Relationship Graph¶
Promote language-specific relationship artifacts into query-usable graph structures.
For C, this phase explicitly includes a first-class include graph covering:
- direct include edges
- reverse include edges
- deterministic transitive include expansion
- header-to-source pairing where ownership is resolvable
This phase should also consider:
- test-to-implementation links
- tooling or configuration to implementation links
- generated-source provenance when present
ctx and explain surfaces should be able to show when include-graph
neighbors were used to expand or justify mixed-language results.
Phase 16 — Language-Specific Semantic Text Units¶
Improve the semantic text indexed for each language family.
For C this should combine:
- signatures
- nearby comments
- include context
- header and source ownership context
For Python this should combine:
- docstrings
- assertions
- fixture or setup context
- symbolic ownership context
Phase 17 — Intent-Aware Retrieval Planning¶
Add a deterministic query planner that assembles retrieval bundles by intent.
The initial intents should include:
- behavior or implementation
- test or validation
- configuration
- API surface
- architecture or navigation
The planner should use the lower-layer evidence and graph structures rather than ad hoc string heuristics in the final rendering layer.
Post-Phase-17 Plugin-Coverage and Rebuild Roadmap¶
The migration and retrieval-quality phases above establish the plugin architecture, but they do not yet make plugin coverage complete at index time.
The following roadmap defines the next architectural steps required so third-party analyzers can participate in discovery, repository coverage can be audited deterministically, and the index can become stale when plugin availability changes.
Phase 18 — Analyzer-Declared Discovery Metadata¶
Replace hard-coded source-file discovery with analyzer-declared discovery metadata.
This phase should:
- extend the analyzer contract so each analyzer declares the file suffixes or globs it owns
- make scanner discovery derive supported files from the active analyzer set
rather than from a core-owned tuple such as
("*.py", "*.c", "*.h") - keep routing deterministic when multiple analyzers could plausibly accept the same path
- document the discovery contract for third-party analyzer authors
This phase should preserve current built-in behavior for Python and C while removing the hard-coded discovery limitation that prevents future analyzers from participating in indexing.
Phase 19 — Canonical-Directory Coverage Audit¶
Add deterministic repository coverage auditing against tracked files in canonical source directories.
The initial canonical directories should include:
src/tests/scripts/
This phase should:
- inspect tracked files under those directories even when no currently installed analyzer claims them
- classify each relevant file as covered, optionally coverable, or uncovered
- report missing analyzer families for uncovered suffixes or globs
- make
codira indexsurface this coverage state before or during indexing
The goal is for codira to say, deterministically, that a repository
appears to need analyzers for languages such as Rust, assembly, Lua, or Pascal
when tracked canonical-source files indicate that coverage is incomplete.
Phase 20 — Persisted Plugin Inventory and File Ownership¶
Persist the plugin inventory used for one indexing run and record analyzer ownership of indexed files.
This phase should add metadata for:
- active backend name and version
- active analyzer names and versions
- analyzer discovery metadata snapshot
- per-file analyzer ownership for indexed files
- whether the repository was fully covered at index time
This inventory becomes the durable source of truth for deciding whether an existing index still matches the currently installed plugin set.
Phase 21 — Plugin-Aware Staleness and Rebuild Policy¶
Make indexing detect when plugin availability changes the validity or completeness of the current index.
This phase should handle at least:
- a new analyzer becoming available for previously uncovered files
- an analyzer version change that should reindex files it owns
- an analyzer being removed after it previously indexed files
- a backend or analyzer inventory mismatch between the database and the current process
The first implementation may conservatively force a broader rebuild, but the policy must remain deterministic and explainable.
Phase 22 — Coverage Commands, Policy Flags, and Documentation¶
Expose the new coverage model clearly through CLI and documentation.
This phase should include:
- a dedicated coverage inspection command or equivalent explain surface
indexbehavior that can warn on incomplete coverage by default- a strict mode such as
--require-full-coveragethat fails when canonical directories contain uncovered tracked files - dedicated plugin-author and operator documentation describing:
- analyzer discovery metadata
- coverage semantics
- plugin-aware rebuild triggers
- the distinction between partial and full repository coverage
This phase should leave contributors and plugin authors with a direct route to understand how plugin installation affects indexing completeness.
Execution Rules¶
- Use a dedicated branch for the migration.
- Make multiple commits, with at least one commit per phase.
- Split large phases into smaller atomic commits when needed.
- Keep tests and documentation in-scope for every phase.
- Preserve deterministic behavior unless a later ADR explicitly changes it.
Phase Ledger¶
Mark each phase as work lands.
- [x] Phase 1 — Branch and Architecture Skeleton
- [x] Phase 2 — Characterization Tests
- [x] Phase 3 — Core Contracts and Normalized Artifacts
- [x] Phase 4 — SQLite Backend Encapsulation
- [x] Phase 5 — Indexer Orchestration Refactor
- [x] Phase 6 — Python Analyzer Extraction
- [x] Phase 7 — Query Abstraction
- [x] Phase 8 — Registries and Configuration
- [x] Phase 9 — Second Analyzer Proof
- [x] Phase 10 — Final Documentation Consolidation
- [x] Phase 11 — C Analyzer Semantic Parity
- [x] Phase 12 — File-Role Classification
- [x] Phase 13 — Evidence-Based Ranking Fusion
- [x] Phase 14 — Diversity-Aware Result Selection
- [x] Phase 15 — Cross-Language Relationship Graph
- [x] Phase 16 — Language-Specific Semantic Text Units
- [x] Phase 17 — Intent-Aware Retrieval Planning
- [x] Phase 18 — Analyzer-Declared Discovery Metadata
- [x] Phase 19 — Canonical-Directory Coverage Audit
- [x] Phase 20 — Persisted Plugin Inventory and File Ownership
- [x] Phase 21 — Plugin-Aware Staleness and Rebuild Policy
- [x] Phase 22 — Coverage Commands, Policy Flags, and Documentation
Notes¶
Expected follow-up ADR topics include:
- one active backend per repository instance
- multiple analyzers per indexing run
- C include-graph semantics and header-to-source ownership rules
- normalized artifact model and symbol identity
- embedding persistence and invalidation ownership
- query surfaces depending on backend contracts rather than backend internals
Phase 10 leaves the branch with:
- architecture pages updated to reflect the implemented registry, backend, and analyzer model
- contributor guidance reconciled with the architecture workflow
- README references updated to the current capability set
Phases 11 through 13 extend that baseline with:
- a tree-sitter-backed C analyzer with richer normalized call, declaration, include-kind, and semantic-text artifacts
- deterministic file-role classification used by retrieval and explain output
- explicit merge diagnostics for evidence families, reciprocal-rank fusion, merge-time role contribution, and final merged score
Phase 14 adds deterministic diversity selection across:
- per-file caps
- file-role caps
- mixed-language caps so one language family cannot monopolize the primary context block when another indexed language is also available
Phase 15 adds a first-class include-graph slice for C through:
- exact include-edge queries backed by persisted include artifacts
- deterministic direct and transitive local-include expansion in
ctx - explain-mode diagnostics showing when include-graph edges contributed to module expansion
Phase 16 completes language-specific semantic text units through:
- C embedding payloads that combine signatures, declaration comments, include context, and header-to-source pairing context
- Python callable embedding payloads that combine docstrings with module summaries, symbolic ownership, assertion presence, decorator names, and fixture or setup context
Phase 17 completes intent-aware retrieval planning through:
- deterministic primary intent families for behavior, test, configuration, API-surface, and architecture or navigation queries
- an explicit retrieval plan that owns channel routing and explain-mode diagnostics
- planner-driven gating for docstring issue enrichment, include-graph expansion, and reference collection while preserving earlier retrieval contracts
Phase 18 replaces hard-coded scanner discovery with analyzer-declared metadata through:
LanguageAnalyzer.discovery_globsas the stable discovery contract- scanner discovery derived from active analyzer metadata for both Git-backed and filesystem-backed indexing
- third-party analyzer validation that rejects entry points missing discovery metadata
Phase 19 adds canonical-directory coverage auditing through:
- deterministic inspection of tracked files under
src/,tests/, andscripts/ - uncovered-file reporting when no active analyzer claims a canonical file
- index summaries that surface partial repository coverage without yet making it fatal
Phase 20 persists plugin inventory and file ownership through:
- analyzer ownership columns on
filesrows - backend runtime metadata stored in the database
- analyzer inventory rows carrying version and discovery-glob snapshots
- coverage-complete state recorded alongside the backend runtime snapshot
Phase 21 activates that persisted metadata in rebuild policy through:
- unchanged-file reindexing when analyzer ownership no longer matches
- automatic rebuilds when stored backend runtime inventory changes
- automatic rebuilds when stored analyzer inventory changes
Phase 22 completes the operator-facing surface through:
- a dedicated
codira covinspection command - strict indexing preflight via
codira index --require-full-coverage - plugin and operator documentation describing partial versus full coverage
Phases 18 through 22 now provide the core indexing-side mechanics needed to make the plugin model index-aware rather than only discovery-aware:
- analyzer-driven file discovery instead of hard-coded core suffixes
- deterministic repository coverage auditing for canonical source directories
- persisted plugin inventory and analyzer ownership metadata in the index
- CLI and documentation surfaces that distinguish partial from full coverage