Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication
GPU communication overhead is a measurable bottleneck in production AI workloads.

In the pursuit of faster and more efficient artificial intelligence workloads, researchers have identified a significant bottleneck: GPU communication overhead. According to data cited by the mKernel project, this overhead can consume 43.6% of the forward pass and 32% of end-to-end training time. The issue is even more pronounced in popular Mixture-of-Experts (MoE) models, where inter-device communication can account for up to 47% of total execution time.
To address this challenge, researchers from UC Berkeley's UCCL project have developed mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single kernel. This innovative approach aims to minimize the communication bottleneck by enabling GPU-driven communication, where the GPU itself triggers transfers, eliminating the need for host-driven communication. The standard model for multi-GPU communication, known as host-driven, relies on the CPU to run the control path and call into libraries like NCCL or NVSHMEM.
However, this approach has two major limitations. Firstly, CPUs are not scaling with GPU compute, leading to microsecond-scale host orchestration overhead. Secondly, host-driven systems overlap compute and communication at coarse kernel boundaries, making finer-grained overlap at the tile or chunk level impossible.
mKernel offers a compelling alternative by fusing intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel. This library supports multi-GPU and multi-node configurations, allowing for fine-grained intra-kernel overlap and persistent kernel with SM specialization. The research team evaluated mKernel on two 2-node × 8-H200 clusters, benchmarking it against several existing libraries, including NCCL, Triton-distributed, and Flux.
The mKernel library is open-source and supports two networking backends, with requirements including NVIDIA Hopper GPUs, CUDA 12.9, and Python with PyTorch. By providing a more efficient and scalable solution for GPU-driven communication, mKernel has the potential to significantly accelerate AI workloads and push the boundaries of what is possible in the field. As the research team notes, further benchmarking at larger scales is still in progress, but the initial results are promising.
With mKernel, the UCCL project aims to make GPU-driven communication a reality, unlocking new levels of performance and efficiency in AI computing.
Source: MarkTechPost