Issue 68 - Loadability Batch Benchmark Results¶

Scope¶

Issue #68 evaluates whether the persisted LuaLaTeX loadability probe should run multiple candidate batches in parallel.

dump-fonts and parse-inventory expose bounded batch parallelism through --loadability-jobs. The benchmark path added for this issue exercises the same jobs parameter through:

scripts/benchmark_loadability_batches.sh light 1 2
scripts/benchmark_loadability_batches.sh medium 1 2
scripts/benchmark_loadability_batches.sh heavy 1 2

For a full local inventory, generate a dedicated ignored input and use Hyperfine to compare explicit job counts:

fontshow dump-fonts \
  --paths /path/to/fonts \
  --cache-dir tests/fixtures/benchmark_results/full-input-cache \
  --output tests/fixtures/full_loadability_benchmark_inventory.json

hyperfine \
  --warmup 1 \
  --runs 3 \
  --export-json tests/fixtures/benchmark_results/loadability-full-jobs.json \
  --command-name "loadability jobs=1" \
  "python scripts/run_loadability_probe.py tests/fixtures/full_loadability_benchmark_inventory.json --output /tmp/loadability-full-jobs-1.json --jobs 1" \
  --command-name "loadability jobs=2" \
  "python scripts/run_loadability_probe.py tests/fixtures/full_loadability_benchmark_inventory.json --output /tmp/loadability-full-jobs-2.json --jobs 2" \
  --command-name "loadability jobs=4" \
  "python scripts/run_loadability_probe.py tests/fixtures/full_loadability_benchmark_inventory.json --output /tmp/loadability-full-jobs-4.json --jobs 4" \
  --command-name "loadability jobs=8" \
  "python scripts/run_loadability_probe.py tests/fixtures/full_loadability_benchmark_inventory.json --output /tmp/loadability-full-jobs-8.json --jobs 8" \
  --command-name "loadability jobs=12" \
  "python scripts/run_loadability_probe.py tests/fixtures/full_loadability_benchmark_inventory.json --output /tmp/loadability-full-jobs-12.json --jobs 12"

The generated Hyperfine JSON files are intentionally ignored:

tests/fixtures/benchmark_results/loadability-light.json
tests/fixtures/benchmark_results/loadability-medium.json
tests/fixtures/benchmark_results/loadability-heavy.json
tests/fixtures/benchmark_results/loadability-full-jobs.json

Local Measurement Context¶

Date: 2026-04-18
Host: verona
Kernel: Linux 6.18.18-gentoo-dist
CPU: Intel Core i7-8700K, 6 cores / 12 threads
TeX engine: LuaHBTeX 1.18.0, TeX Live 2024 Gentoo Linux
Hyperfine settings: 1 warmup, 3 measured runs
Batch size: 32 candidates

Results¶

Profile	Fonts	Jobs	Mean	Stddev	User	System
light	8	1	1.486 s	0.029 s	1.369 s	0.115 s
light	8	2	1.429 s	0.158 s	1.330 s	0.098 s
medium	32	1	4.810 s	0.236 s	4.624 s	0.178 s
medium	32	2	4.847 s	0.258 s	4.672 s	0.160 s
heavy	72	1	10.660 s	0.265 s	10.222 s	0.396 s
heavy	72	2	6.412 s	0.228 s	10.998 s	0.393 s

Full Local Inventory Results¶

The full local inventory run used the same host and TeX context as the fixture measurements, but with the user's complete font tree.

Jobs	Mean	Stddev	User	System	Speedup vs serial
1	245.164 s	0.643 s	226.619 s	17.975 s	1.00x
2	127.385 s	1.601 s	234.325 s	18.076 s	1.92x
4	70.060 s	2.776 s	254.046 s	18.998 s	3.50x
8	47.287 s	1.578 s	335.482 s	22.840 s	5.18x
12	40.762 s	2.350 s	396.046 s	27.467 s	6.01x

All replay outputs had the same SHA-256 digest:

524ba602d5b08cd82a480bc937861c96c01dfec793ca569518d0ccc20a4a9d28

Interpretation¶

The light profile is too small to justify parallel scheduling. The medium profile has exactly one default-size candidate chunk, so jobs=2 cannot create useful parallel work and is effectively neutral.

The heavy profile creates multiple candidate chunks and jobs=2 reduced wall-clock time by about 1.66x on this machine. User CPU time increased, which is expected when two LuaLaTeX processes run concurrently. System time remained stable, and this run did not expose TeX-cache failures.

The full local inventory shows near-linear scaling through jobs=4, continued useful scaling at jobs=8, and a smaller but still real gain at jobs=12. The jobs=12 result is the fastest on this 12-thread machine, reducing wall-clock time from about 245 s to about 41 s while preserving byte-identical output. CPU time and system time increase at higher job counts, so jobs=12 is suitable when throughput is the priority and the machine can be dedicated to the run.

Recommendation¶

Use jobs=4 as the bounded default for CLI loadability probing. The measured benefit appears only once an inventory has more than one loadability chunk, and the parallel path remains workload- and machine-dependent.

For user-facing controls, keep the setting bounded:

default: jobs=4
first useful opt-in value: jobs=2
recommended full-inventory starting points: jobs=4 or jobs=8
use jobs=12 only when benchmarked locally and the machine can absorb the CPU load
only apply parallelism when candidate count exceeds the configured batch size
require deterministic result collation by candidate index
document TeX-cache contention as the main operational risk