Decision 0021 - Authoritative Unicode Ontology¶
Date: 28/02/2026 Status: Accepted
Context¶
Fontshow historically maintained Unicode-related knowledge through:
- handwritten script ranges
- partially duplicated Unicode tables
- runtime-derived structures
- heuristic normalization layers
This caused several issues:
- divergence from Unicode standards
- maintenance burden
- hidden inconsistencies between modules
- difficulty auditing inference correctness
- non-deterministic evolution of Unicode data
During Step 0 of the Language Inference Overhaul, the project migrated to a standards-grounded Unicode model.
Decision¶
Fontshow adopts a generated, authoritative Unicode ontology.
All Unicode structural data is now produced from vendored standards:
-
Unicode UCD:
-
Blocks.txt
- Scripts.txt
- ISO-15924 registry
A deterministic generator:
scripts/generate_unicode_tables.py
produces:
src/fontshow/ontology/unicode_tables.py
This module is now the single source of truth for Unicode data.
Architectural invariants¶
- Unicode data MUST NOT be handwritten in runtime modules.
- Runtime code MUST NOT derive Unicode structural tables.
- All Unicode ontology changes occur via regeneration.
- Generated files MUST NOT be manually edited.
- unicode_tables.py represents a frozen snapshot of Unicode semantics.
Data layers¶
Unicode Standard
↓
Vendored datasets
↓
Generator
↓
unicode_tables.py ← authoritative ontology
↓
Runtime inference
Consequences¶
Positive¶
- deterministic Unicode behavior
- auditability against standards
- simplified inference reasoning
- elimination of duplicated tables
- stable future refactors
Negative¶
- regeneration step required when upgrading Unicode version
- generator becomes critical infrastructure
Migration¶
Implemented in two commits:
-
Commit 1: Introduced generator, ISO normalization, and dual-source bridge.
-
Commit 2: Removed legacy tables and runtime derivations. unicode_tables.py became authoritative.
Future Work¶
Language inference redesign (Step 1) depends on this invariant.
Script-Gated Language Inference (Follow-up Rule)¶
Language inference operates under a script-authoritative model.
Language inference no longer derives languages from Unicode blocks. Scripts are mandatory evidence.
When charset-derived script inference produces a non-empty set of scripts, candidate languages MUST be restricted to languages whose primary script belongs to the inferred script set.
Formally:
LANGUAGE_PRIMARY_SCRIPT(language) ∈ inferred_scripts
This rule applies only to charset-derived inference.
Symbolic fonts (emoji, icon, or symbol fonts) may produce no inferred scripts; in this case the script gate is intentionally disabled and legacy permissive behavior is retained.
Canonical Latin fallback is now defined in terms of inferred scripts (LATN-only) rather than Unicode block presence.
Future inference sources (e.g., name-based heuristics) may adopt the same constraint but require separate evaluation.