What is Tokenization Drift and How to Fix It?
Tokenization drift occurs when small changes in input formatting cause unpredictable shifts in model behavior, and it can be addressed through automated prompt optimization.
A model's performance can degrade suddenly, without any changes to the data, pipeline, or logic. The root cause often lies in how the input is tokenized. Before processing text, a model converts it into token IDs, and even minor formatting differences can produce entirely different token sequences.
This phenomenon is known as tokenization drift. When a model's input deviates from the learned patterns, it is no longer operating within its familiar distribution. The result isn't confusion; it's a model doing its best on inputs it was never optimized to handle.
For instance, during instruction tuning, models learn not only tasks but also the structure in which those tasks are presented. The GPT-2 tokenizer, which uses the same Byte-Pair Encoding scheme as GPT-4, LLaMA, and Mistral, can be used to demonstrate how small formatting changes affect tokens. A simple test with seven words, with and without a leading space, shows that not a single pair produces the same token ID.
This means the model doesn't just see a different ID; it sees a different sequence length, which shifts how attention is computed for everything that follows. To measure drift across prompts, a simple metric can be built. By creating variations of a standard SFT prompt format and measuring token overlap, it becomes clear that even small formatting changes can significantly alter the token sequence.
Automated Prompt Optimization (APO) can be used to pick formats that keep inputs consistent and reliable. In APO, a small validation set is used to test multiple prompt templates that differ in how closely they match the original SFT format. Templates that deviate more get lower effective accuracy, while those closer to the original format perform better.
The results are clear: most variants perform poorly, while the one that closely matches the SFT template stands out with high effectiveness. The APO loop simply picks the best-performing template automatically, ensuring that model performance remains stable. This approach can be used at scale with actual model outputs, testing multiple prompt formats, scoring them, and locking in the one that keeps performance stable.
Source: MarkTechPost