Language normalization¶
Language normalization semantics¶
This document defines how Fontshow processes language identifiers extracted from font metadata.
Language handling is intentionally separated into two concerns:
- normalization
- validation
These concepts are related but not equivalent.
Normalization¶
Normalization is a best-effort transformation applied to language tags in order to improve consistency and downstream usability.
Normalization MAY include:
- canonical casing
- replacement of deprecated subtags
- removal of unsupported private-use extensions
- mapping of legacy identifiers to modern equivalents
Normalization:
- does not guarantee correctness
- does not enforce standards
- does not fail the pipeline
Its purpose is to reduce noise, not to validate input.
Validation¶
Validation determines whether a language tag is acceptable under a given policy.
Two validation modes exist:
Permissive (default)¶
- Invalid or deprecated tags are accepted
- Warnings may be emitted
- Processing continues
Strict (--strict-bcp47)¶
- Only RFC-compliant BCP-47 tags are allowed
- Deprecated or malformed tags cause failure
- No silent normalization is performed
Semantic Validation¶
Semantic validation is performed after language normalization and operates on the fully normalized inventory.
Semantic validation:
- does not modify data
- does not perform normalization
- does not perform schema validation
- only inspects semantic correctness of values
Currently, semantic validation includes:
- detection of invalid or unknown language codes
- validation of normalized language identifiers
Semantic validation may emit warnings.
Design constraints¶
- Normalization must never imply correctness
- Validation must never silently modify data
- Enforcement must always be explicit
- Behavior must be observable and documented
Non-goals¶
- Automatic language inference
- Linguistic correctness guarantees
- Silent mutation of source metadata