AI Tools

LightSeek Foundation Unveils TokenSpeed, an Open-Source LLM Inference Engine

AI News Desk

MarkTechPost

May 07, 2026

2 min read

The LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine designed to match TensorRT-LLM performance for agentic workloads.

LightSeek Foundation Unveils TokenSpeed, an Open-Source LLM Inference Engine

As AI deployment continues to expand, inference efficiency has become a critical bottleneck. The increasing demand on inference engines powering agentic coding systems, such as Claude Code, Codex, and Cursor, has led researchers at the LightSeek Foundation to develop TokenSpeed, an open-source LLM inference engine released under the MIT license. TokenSpeed is designed to meet the unique demands of agentic workloads, which differ significantly from typical chatbot interactions.

Unlike chatbots, coding agents involve contexts exceeding 50K tokens and conversations spanning dozens of turns. This places simultaneous pressure on two key metrics: per-GPU TPM (tokens per minute) and per-user TPS (tokens per second). The goal of TokenSpeed is to maximize per-GPU TPM while maintaining a per-user TPS floor, typically 70 TPS, and sometimes 200 TPS or higher.

The architecture of TokenSpeed is built around five design pillars: a compiler-backed modeling mechanism for parallelism, a high-performance scheduler, a safe KV resource reuse restriction, a pluggable layered kernel system supporting heterogeneous accelerators, and SMG integration for a low-overhead CPU-side request entrypoint. The modeling layer utilizes a local SPMD (Single Program, Multiple Data) approach, enabling developers to specify I/O placement annotations at module boundaries. A lightweight static compiler then automatically generates the required collective operations during model construction.

The scheduler in TokenSpeed makes a structural split between the control plane and the execution plane. The control plane, implemented in C++, enforces safe resource management through a finite-state machine that works with the type system. This approach catches errors in KV cache management earlier, reducing the likelihood of runtime errors.

The execution plane, implemented in Python, allows for faster feature iteration and lower cognitive load for developers. TokenSpeed's kernel layer treats GPU kernels as a first-class modular subsystem, providing a portable public API, a centralized registry and selection model, and an extensible plugin mechanism. This supports heterogeneous accelerators and isn't limited to NVIDIA hardware.

The team has also developed one of the fastest MLA (Multi-head Latent Attention) kernels for agentic workloads on NVIDIA Blackwell. Evaluations conducted with the EvalScope team compared TokenSpeed against TensorRT-LLM, the current state of the art on NVIDIA Blackwell, using SWE-smith traces that mirror production coding-agent traffic. The test model used was Kimi K2.5.

For coding agents running above 70 TPS/User, the best configuration, Attention TP4 + MoE TP4, showed that TokenSpeed dominates TensorRT-LLM across the entire Pareto frontier. It achieved roughly 9% faster performance in the min-latency case (batch size 1) and roughly 11% higher throughput around 100 TPS/User.

Source: MarkTechPost