Decision 0021 - Authoritative Unicode Ontology¶

Date: 28/02/2026 Status: Accepted

Context¶

Fontshow historically maintained Unicode-related knowledge through:

handwritten script ranges
partially duplicated Unicode tables
runtime-derived structures
heuristic normalization layers

This caused several issues:

divergence from Unicode standards
maintenance burden
hidden inconsistencies between modules
difficulty auditing inference correctness
non-deterministic evolution of Unicode data

During Step 0 of the Language Inference Overhaul, the project migrated to a standards-grounded Unicode model.

Decision¶

Fontshow adopts a generated, authoritative Unicode ontology.

All Unicode structural data is now produced from vendored standards:

Unicode UCD:
Blocks.txt
Scripts.txt
ISO-15924 registry

A deterministic generator:

scripts/generate_unicode_tables.py

produces:

src/fontshow/ontology/unicode_tables.py

This module is now the single source of truth for Unicode data.

Architectural invariants¶

Unicode data MUST NOT be handwritten in runtime modules.
Runtime code MUST NOT derive Unicode structural tables.
All Unicode ontology changes occur via regeneration.
Generated files MUST NOT be manually edited.
unicode_tables.py represents a frozen snapshot of Unicode semantics.

Data layers¶

Unicode Standard
        ↓
Vendored datasets
        ↓
Generator
        ↓
unicode_tables.py   ← authoritative ontology
        ↓
Runtime inference

Consequences¶

Positive¶

deterministic Unicode behavior
auditability against standards
simplified inference reasoning
elimination of duplicated tables
stable future refactors

Negative¶

regeneration step required when upgrading Unicode version
generator becomes critical infrastructure

Migration¶

Implemented in two commits:

Commit 1: Introduced generator, ISO normalization, and dual-source bridge.
Commit 2: Removed legacy tables and runtime derivations. unicode_tables.py became authoritative.

Future Work¶

Language inference redesign (Step 1) depends on this invariant.

Script-Gated Language Inference (Follow-up Rule)¶

Language inference operates under a script-authoritative model.

Language inference no longer derives languages from Unicode blocks. Scripts are mandatory evidence.

When charset-derived script inference produces a non-empty set of scripts, candidate languages MUST be restricted to languages whose primary script belongs to the inferred script set.

Formally:

LANGUAGE_PRIMARY_SCRIPT(language) ∈ inferred_scripts

This rule applies only to charset-derived inference.

Symbolic fonts (emoji, icon, or symbol fonts) may produce no inferred scripts; in this case the script gate is intentionally disabled and legacy permissive behavior is retained.

Canonical Latin fallback is now defined in terms of inferred scripts (LATN-only) rather than Unicode block presence.

Future inference sources (e.g., name-based heuristics) may adopt the same constraint but require separate evaluation.