Skip to content

Language normalization

Language normalization semantics

This document defines how Fontshow processes language identifiers extracted from font metadata.

Language handling is intentionally separated into two concerns:

  • normalization
  • validation

These concepts are related but not equivalent.


Normalization

Normalization is a best-effort transformation applied to language tags in order to improve consistency and downstream usability.

Normalization MAY include:

  • canonical casing
  • replacement of deprecated subtags
  • removal of unsupported private-use extensions
  • mapping of legacy identifiers to modern equivalents

Normalization:

  • does not guarantee correctness
  • does not enforce standards
  • does not fail the pipeline

Its purpose is to reduce noise, not to validate input.


Validation

Validation determines whether a language tag is acceptable under a given policy.

Two validation modes exist:

Permissive (default)

  • Invalid or deprecated tags are accepted
  • Warnings may be emitted
  • Processing continues

Strict (--strict-bcp47)

  • Only RFC-compliant BCP-47 tags are allowed
  • Deprecated or malformed tags cause failure
  • No silent normalization is performed

Semantic Validation

Semantic validation is performed after language normalization and operates on the fully normalized inventory.

Semantic validation:

  • does not modify data
  • does not perform normalization
  • does not perform schema validation
  • only inspects semantic correctness of values

Currently, semantic validation includes:

  • detection of invalid or unknown language codes
  • validation of normalized language identifiers

Semantic validation may emit warnings.


Design constraints

  • Normalization must never imply correctness
  • Validation must never silently modify data
  • Enforcement must always be explicit
  • Behavior must be observable and documented

Non-goals

  • Automatic language inference
  • Linguistic correctness guarantees
  • Silent mutation of source metadata