Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face Tokenizers Crate
Perplexity AI's research team reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code, achieving 5x lower p50 latency than Hugging Face tokenizers crate.

AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face Tokenizers Crate">
["Perplexity AI's research team has reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden, their inference technology repository. The new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE's tokenizer (C), with zero steady-state heap allocations.", "In production, the tokenizer reduced CPU utilization in Perplexity's inference stack by 5-6x and shaved double-digit milliseconds off reranker latency. LLM inference cost is typically framed around GPU work: KV caches, attention kernels, expert routing.
However, smaller models, such as embedding models, classifiers, and rerankers, tell a different story.", 'These models are two to three orders of magnitude smaller than frontier transformers. A reranker scoring hundreds of candidate documents per request is a clear example. With a small model, GPU compute often finishes in single-digit milliseconds.
Every input still passes through CPU-side tokenization first. When batch sizes are large, tokenization becomes a meaningful fraction of total request latency.', "Perplexity's work targets XLM-RoBERTa, a model with a 250K-token Unigram vocabulary trained with SentencePiece. Fine-tuned RoBERTa-family encoders are a common production choice for ranking, retrieval, and similarity tasks.
The Unigram tokenization algorithm was introduced by Kudo in 2018 and is implemented in SentencePiece. It frames segmentation as a most-probable-path problem.", 'The algorithm used to find that best path is the Viterbi algorithm, a dynamic programming technique from 1967. Perplexity used the Hugging Face tokenizers crate as the benchmark reference and made several optimizations to improve performance.
They isolated how much cost came from unnecessary work alone and made a zero-allocation port of the reference, which already cut p50 latency to 155 µs at 514 tokens, down from 326 µs in the reference.', 'Further optimizations included replacing the HashMap trie with a double-array trie and using a per-node bitmap, which resulted in a 4.8x wall-clock reduction from the original reference. The final Perplexity encoder produces token-exact output against the reference and uses rayon to parallelize across cores in production.']
Source: MarkTechPost