Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200× Faster Than FAISS on GPUs
k-means has been an offline tool for decades.

k-means has been an offline tool for decades. You run it once to preprocess data, then move on. A team of researchers from UC Berkeley and UT Austin released Flash-KMeans, a new open-source library that targets a different setting. Modern AI pipelines now call k-means inside training and inference loops. At that frequency, latency per call matters more than theoretical FLOPs.
Flash-KMeans is an IO-aware implementation of standard Lloyd’s k-means. It does not change the math, and it does not approximate. It only restructures how the algorithm moves data on a GPU. On an NVIDIA H200, the research team reported up to 17.9× end-to-end speedup over the best baseline. Against NVIDIA cuML they report 33×. Against FAISS they report over 200×.
Flash-KMeans is a batched k-means library written in Triton GPU kernels. It ships under Apache 2.0 and installs with pip install flash-kmeans .
The output is mathematically identical to standard Lloyd’s k-means. The speedup comes from kernel-level dataflow, not from skipping work. That separates it from algorithmic methods like triangle-inequality pruning or coreset sampling.
A standard Lloyd iteration has two stages. The assignment stage computes each point’s distance to every centroid, then picks the nearest. The update stage averages the points in each cluster to form new centroids. Both stages are simple arithmetic. On GPUs, both are bottlenecked by memory, not compute.
The first bottleneck is the assignment stage. Standard code builds a full distance matrix D of shape N×K in High Bandwidth Memory (HBM). It writes the matrix, then reads it back to run argmin. For N=65536, K=1024, d=128, B=32, the distance math takes 2.6ms. Writing and consuming D takes about 23ms. The matrix is the cost, not the arithmetic.
Flash-KMeans replaces this with FlashAssign. The design borrows from FlashAttention. FlashAssign streams tiles of points and centroids from HBM into on-chip SRAM. It fuses distance computation with an online argmin. The full N×K matrix is never materialized. This cuts the dominant IO complexity from O(NK) to O(Nd + Kd). At the kernel level, FlashAssign reaches up to 21.2×. In one case it cut assignment from 122.5ms to 5.8ms.
The second bottleneck is the centroid update stage. Standard code uses scatter-style atomic adds. Each thread adds its point into a shared sum buffer keyed by cluster id. Many threads hit the same ‘hot’ cluster at once. That causes atomic contention and hardware serialization. The research team measured only 50 GB/s effective bandwidth here on an H200.
Flash-KMeans replaces this with Sort-Inverse Update. It sorts the 1D assignment vector by cluster id using argsort. Identical cluster ids then form contiguous segments. Each thread block reduces a segment on-chip, then issues one atomic add per segment. The heavy point matrix is never physically permuted. Atomic operations drop from ( O ( ( K + N B N ) d ) ) (O((K + \frac{N}{B_N})d)) . The kernel reaches up to 6.3×.
Source: MarkTechPost