Tokenizer design considerably impacts language mannequin efficiency,
but evaluating tokenizer high quality stays difficult. Whereas textual content compression has emerged as a standard intrinsic metric, latest work questions its reliability as a high quality indicator. We examine whether or not evaluating tokenizers on smaller fashions (350M parameters) reliably predicts their affect at bigger scales (2.7B parameters).
By means of experiments with established tokenizers from widely-adopted language fashions, we discover that tokenizer alternative minimally impacts English duties however yields important, scale-consistent variations in machine translation efficiency.
Based mostly on these findings, we suggest further intrinsic metrics that correlate extra strongly with downstream efficiency than textual content compression.
We mix these metrics into an analysis framework that permits extra dependable intrinsic tokenizer comparisons.
- † Work achieved whereas at Apple
- ‡ College of Copenhagen & ROCKWOOL Basis Analysis Unit