Skip to content

ADR-004 — Pluggable Backend and Analyzer Migration Plan

Date: 28/03/2026 Status: Accepted

Context

codira currently has a strong implicit coupling between:

  • Python-specific analysis
  • SQLite-specific persistence and query execution
  • CLI/query surfaces that directly depend on SQLite-oriented helpers

Open issues make the next architectural direction explicit:

  • issue #1 requires a cleaner persistence boundary for embeddings and their invalidation metadata
  • issue #2 requires pluggable language analyzers so multi-language repositories become a first-class target

The migration needs to preserve determinism and maintainability while expanding architecture documentation, not just implementation code.

Decision

Adopt a two-family plugin architecture and execute the migration on a dedicated branch through a sequence of small, reviewable commits.

Plugin Families

codira will distinguish two separate extension families:

  • IndexBackend Exactly one storage/query backend is active for a given repository index.
  • LanguageAnalyzer Multiple analyzers may be active in the same indexing run so one repository can be indexed across multiple languages and dialects.

This asymmetry is intentional:

  • storage is an instance-level policy decision
  • analyzers are repository-content capabilities

Documentation and Tests Are First-Class

The migration is not code-only work.

Each architectural step must include, where applicable:

  • tests that freeze or extend behavior
  • architecture documentation updates
  • ADRs for durable decisions that would otherwise be lost in commit history

The documentation scope must expand beyond README usage notes to include:

  • architecture overviews
  • pipeline documentation
  • plugin model documentation
  • backend/analyzer extension guidance
  • ADRs that preserve and enforce decision history

Rationale

One active backend per instance avoids a large class of unnecessary complexity:

  • competing schema ownership
  • inconsistent migration rules
  • duplicate query semantics
  • split incremental reuse logic
  • ambiguous embedding persistence
  • reduced determinism

Allowing multiple analyzers in one run is necessary for mixed-language repositories and is aligned with the explicit goal of supporting them without treating non-Python files as unsupported noise.

Treating tests and architecture documentation as first-class citizens reduces the risk that the refactor drifts into undocumented framework churn.

Consequences

Positive

  • clear separation between language analysis and persistence concerns
  • explicit support for mixed-language repositories
  • deterministic architectural boundaries for future backend and analyzer work
  • durable design history through ADRs
  • smaller and safer implementation increments

Negative

  • more upfront design and documentation work before user-visible feature expansion
  • more commits and branch management overhead
  • stronger discipline required to keep the execution ledger current

Neutral / Trade-offs

  • README updates should follow architecture stabilization, not lead it
  • some migration phases may need more than one commit to keep changes atomic
  • additional ADRs may be created as the migration reveals narrower decisions

Migration Plan

The migration will proceed through the following phases.

Phase 1 — Branch and Architecture Skeleton

Create a dedicated branch for this migration.

Add an architecture documentation skeleton covering:

  • system overview
  • indexing pipeline
  • query pipeline
  • plugin model
  • storage backends
  • language analyzers

Add an ADR template if one does not already exist.

Phase 2 — Characterization Tests

Add or extend tests that freeze current behavior for:

  • index
  • symbol
  • calls
  • refs
  • embeddings
  • ctx
  • incremental reuse
  • embedding invalidation
  • deterministic ordering

These tests are guardrails for the refactor, not optional cleanup.

Phase 3 — Core Contracts and Normalized Artifacts

Introduce backend-neutral contracts and data structures for:

  • LanguageAnalyzer
  • AnalysisResult
  • IndexBackend
  • normalized index artifacts

Document the responsibilities and invariants of those contracts.

Create additional ADRs if symbol identity, artifact ownership, or extension metadata boundaries require durable decisions.

Phase 4 — SQLite Backend Encapsulation

Wrap the current SQLite implementation behind a concrete SQLiteIndexBackend without changing observable CLI behavior.

Keep schema semantics stable during this phase.

Add backend contract tests that SQLite must satisfy.

Phase 5 — Indexer Orchestration Refactor

Refactor index_repo into an orchestrator that:

  • discovers files
  • routes files to analyzers
  • collects normalized artifacts
  • delegates persistence to the selected backend

The orchestrator must stop depending directly on Python parser internals and raw storage implementation details.

Phase 6 — Python Analyzer Extraction

Extract the existing Python-specific logic into a PythonAnalyzer.

This includes:

  • parsing
  • symbol extraction
  • call extraction
  • callable-reference extraction
  • import handling
  • docstring audit integration

Phase 7 — Query Abstraction

Refactor exact-query and embedding-query paths so they depend on backend interfaces rather than raw SQLite access.

Preserve current CLI output contracts.

Add shared query contract tests where practical.

Phase 8 — Registries and Configuration

Introduce registry and configuration mechanisms so:

  • one backend is selected for the index
  • multiple analyzers can be registered and activated by file routing

Document defaults, selection rules, and failure behavior.

Create an ADR if configuration semantics become materially architectural.

Phase 9 — Second Analyzer Proof

Add one non-Python analyzer to validate the abstraction.

C is the preferred first candidate.

The first non-Python analyzer should prioritize:

  • symbol extraction
  • dependency extraction
  • deterministic mixed-language indexing behavior

Phase 10 — Final Documentation Consolidation

Expand and reconcile the documentation set so contributors can reconstruct the architecture and the decisions behind it.

This phase should leave behind:

  • stable architecture documents
  • updated contributor guidance
  • updated README references
  • a complete ADR trail for the major choices made during the migration

Post-Phase-10 Retrieval Quality Roadmap

The migration phases above establish the architecture boundary. The following roadmap defines the preferred order for improving mixed-language retrieval quality without weakening determinism.

Phase 11 — C Analyzer Semantic Parity

Expand the C analyzer so it emits richer normalized artifacts.

This phase should prioritize:

  • top-level call extraction
  • struct, enum, typedef, macro, and global symbol extraction where deterministic
  • normalized include artifacts with local versus system include classification
  • header and source ownership hints
  • nearby comment extraction for semantic text construction

This phase may record direct include edges, but the retrieval layer should not yet depend on them as a first-class graph.

Phase 12 — File-Role Classification

Introduce deterministic file-role classification for indexed files.

The initial role set should include:

  • implementation
  • header or interface
  • test
  • tooling or script

Prefer repository-structure and path-based rules before deeper heuristics.

Phase 13 — Evidence-Based Ranking Fusion

Replace flat or near-flat channel merging with typed evidence fusion.

The evidence families should include:

  • lexical symbol evidence
  • semantic text evidence
  • graph evidence
  • file-role evidence
  • language-coverage evidence

Scoring must remain deterministic and explainable.

Phase 14 — Diversity-Aware Result Selection

Add deterministic diversification after raw ranking.

This phase should prevent one language, module family, or test bundle from crowding out stronger implementation evidence.

Phase 15 — Cross-Language Relationship Graph

Promote language-specific relationship artifacts into query-usable graph structures.

For C, this phase explicitly includes a first-class include graph covering:

  • direct include edges
  • reverse include edges
  • deterministic transitive include expansion
  • header-to-source pairing where ownership is resolvable

This phase should also consider:

  • test-to-implementation links
  • tooling or configuration to implementation links
  • generated-source provenance when present

ctx and explain surfaces should be able to show when include-graph neighbors were used to expand or justify mixed-language results.

Phase 16 — Language-Specific Semantic Text Units

Improve the semantic text indexed for each language family.

For C this should combine:

  • signatures
  • nearby comments
  • include context
  • header and source ownership context

For Python this should combine:

  • docstrings
  • assertions
  • fixture or setup context
  • symbolic ownership context

Phase 17 — Intent-Aware Retrieval Planning

Add a deterministic query planner that assembles retrieval bundles by intent.

The initial intents should include:

  • behavior or implementation
  • test or validation
  • configuration
  • API surface
  • architecture or navigation

The planner should use the lower-layer evidence and graph structures rather than ad hoc string heuristics in the final rendering layer.

Post-Phase-17 Plugin-Coverage and Rebuild Roadmap

The migration and retrieval-quality phases above establish the plugin architecture, but they do not yet make plugin coverage complete at index time.

The following roadmap defines the next architectural steps required so third-party analyzers can participate in discovery, repository coverage can be audited deterministically, and the index can become stale when plugin availability changes.

Phase 18 — Analyzer-Declared Discovery Metadata

Replace hard-coded source-file discovery with analyzer-declared discovery metadata.

This phase should:

  • extend the analyzer contract so each analyzer declares the file suffixes or globs it owns
  • make scanner discovery derive supported files from the active analyzer set rather than from a core-owned tuple such as ("*.py", "*.c", "*.h")
  • keep routing deterministic when multiple analyzers could plausibly accept the same path
  • document the discovery contract for third-party analyzer authors

This phase should preserve current built-in behavior for Python and C while removing the hard-coded discovery limitation that prevents future analyzers from participating in indexing.

Phase 19 — Canonical-Directory Coverage Audit

Add deterministic repository coverage auditing against tracked files in canonical source directories.

The initial canonical directories should include:

  • src/
  • tests/
  • scripts/

This phase should:

  • inspect tracked files under those directories even when no currently installed analyzer claims them
  • classify each relevant file as covered, optionally coverable, or uncovered
  • report missing analyzer families for uncovered suffixes or globs
  • make codira index surface this coverage state before or during indexing

The goal is for codira to say, deterministically, that a repository appears to need analyzers for languages such as Rust, assembly, Lua, or Pascal when tracked canonical-source files indicate that coverage is incomplete.

Phase 20 — Persisted Plugin Inventory and File Ownership

Persist the plugin inventory used for one indexing run and record analyzer ownership of indexed files.

This phase should add metadata for:

  • active backend name and version
  • active analyzer names and versions
  • analyzer discovery metadata snapshot
  • per-file analyzer ownership for indexed files
  • whether the repository was fully covered at index time

This inventory becomes the durable source of truth for deciding whether an existing index still matches the currently installed plugin set.

Phase 21 — Plugin-Aware Staleness and Rebuild Policy

Make indexing detect when plugin availability changes the validity or completeness of the current index.

This phase should handle at least:

  • a new analyzer becoming available for previously uncovered files
  • an analyzer version change that should reindex files it owns
  • an analyzer being removed after it previously indexed files
  • a backend or analyzer inventory mismatch between the database and the current process

The first implementation may conservatively force a broader rebuild, but the policy must remain deterministic and explainable.

Phase 22 — Coverage Commands, Policy Flags, and Documentation

Expose the new coverage model clearly through CLI and documentation.

This phase should include:

  • a dedicated coverage inspection command or equivalent explain surface
  • index behavior that can warn on incomplete coverage by default
  • a strict mode such as --require-full-coverage that fails when canonical directories contain uncovered tracked files
  • dedicated plugin-author and operator documentation describing:
  • analyzer discovery metadata
  • coverage semantics
  • plugin-aware rebuild triggers
  • the distinction between partial and full repository coverage

This phase should leave contributors and plugin authors with a direct route to understand how plugin installation affects indexing completeness.

Execution Rules

  • Use a dedicated branch for the migration.
  • Make multiple commits, with at least one commit per phase.
  • Split large phases into smaller atomic commits when needed.
  • Keep tests and documentation in-scope for every phase.
  • Preserve deterministic behavior unless a later ADR explicitly changes it.

Phase Ledger

Mark each phase as work lands.

  • [x] Phase 1 — Branch and Architecture Skeleton
  • [x] Phase 2 — Characterization Tests
  • [x] Phase 3 — Core Contracts and Normalized Artifacts
  • [x] Phase 4 — SQLite Backend Encapsulation
  • [x] Phase 5 — Indexer Orchestration Refactor
  • [x] Phase 6 — Python Analyzer Extraction
  • [x] Phase 7 — Query Abstraction
  • [x] Phase 8 — Registries and Configuration
  • [x] Phase 9 — Second Analyzer Proof
  • [x] Phase 10 — Final Documentation Consolidation
  • [x] Phase 11 — C Analyzer Semantic Parity
  • [x] Phase 12 — File-Role Classification
  • [x] Phase 13 — Evidence-Based Ranking Fusion
  • [x] Phase 14 — Diversity-Aware Result Selection
  • [x] Phase 15 — Cross-Language Relationship Graph
  • [x] Phase 16 — Language-Specific Semantic Text Units
  • [x] Phase 17 — Intent-Aware Retrieval Planning
  • [x] Phase 18 — Analyzer-Declared Discovery Metadata
  • [x] Phase 19 — Canonical-Directory Coverage Audit
  • [x] Phase 20 — Persisted Plugin Inventory and File Ownership
  • [x] Phase 21 — Plugin-Aware Staleness and Rebuild Policy
  • [x] Phase 22 — Coverage Commands, Policy Flags, and Documentation

Notes

Expected follow-up ADR topics include:

  • one active backend per repository instance
  • multiple analyzers per indexing run
  • C include-graph semantics and header-to-source ownership rules
  • normalized artifact model and symbol identity
  • embedding persistence and invalidation ownership
  • query surfaces depending on backend contracts rather than backend internals

Phase 10 leaves the branch with:

  • architecture pages updated to reflect the implemented registry, backend, and analyzer model
  • contributor guidance reconciled with the architecture workflow
  • README references updated to the current capability set

Phases 11 through 13 extend that baseline with:

  • a tree-sitter-backed C analyzer with richer normalized call, declaration, include-kind, and semantic-text artifacts
  • deterministic file-role classification used by retrieval and explain output
  • explicit merge diagnostics for evidence families, reciprocal-rank fusion, merge-time role contribution, and final merged score

Phase 14 adds deterministic diversity selection across:

  • per-file caps
  • file-role caps
  • mixed-language caps so one language family cannot monopolize the primary context block when another indexed language is also available

Phase 15 adds a first-class include-graph slice for C through:

  • exact include-edge queries backed by persisted include artifacts
  • deterministic direct and transitive local-include expansion in ctx
  • explain-mode diagnostics showing when include-graph edges contributed to module expansion

Phase 16 completes language-specific semantic text units through:

  • C embedding payloads that combine signatures, declaration comments, include context, and header-to-source pairing context
  • Python callable embedding payloads that combine docstrings with module summaries, symbolic ownership, assertion presence, decorator names, and fixture or setup context

Phase 17 completes intent-aware retrieval planning through:

  • deterministic primary intent families for behavior, test, configuration, API-surface, and architecture or navigation queries
  • an explicit retrieval plan that owns channel routing and explain-mode diagnostics
  • planner-driven gating for docstring issue enrichment, include-graph expansion, and reference collection while preserving earlier retrieval contracts

Phase 18 replaces hard-coded scanner discovery with analyzer-declared metadata through:

  • LanguageAnalyzer.discovery_globs as the stable discovery contract
  • scanner discovery derived from active analyzer metadata for both Git-backed and filesystem-backed indexing
  • third-party analyzer validation that rejects entry points missing discovery metadata

Phase 19 adds canonical-directory coverage auditing through:

  • deterministic inspection of tracked files under src/, tests/, and scripts/
  • uncovered-file reporting when no active analyzer claims a canonical file
  • index summaries that surface partial repository coverage without yet making it fatal

Phase 20 persists plugin inventory and file ownership through:

  • analyzer ownership columns on files rows
  • backend runtime metadata stored in the database
  • analyzer inventory rows carrying version and discovery-glob snapshots
  • coverage-complete state recorded alongside the backend runtime snapshot

Phase 21 activates that persisted metadata in rebuild policy through:

  • unchanged-file reindexing when analyzer ownership no longer matches
  • automatic rebuilds when stored backend runtime inventory changes
  • automatic rebuilds when stored analyzer inventory changes

Phase 22 completes the operator-facing surface through:

  • a dedicated codira cov inspection command
  • strict indexing preflight via codira index --require-full-coverage
  • plugin and operator documentation describing partial versus full coverage

Phases 18 through 22 now provide the core indexing-side mechanics needed to make the plugin model index-aware rather than only discovery-aware:

  • analyzer-driven file discovery instead of hard-coded core suffixes
  • deterministic repository coverage auditing for canonical source directories
  • persisted plugin inventory and analyzer ownership metadata in the index
  • CLI and documentation surfaces that distinguish partial from full coverage