Unlocking Asynchronicity in Continuous Batching
Separating CPU and GPU workloads can lead to a massive performance boost for inference by eliminating idle gaps and maximizing GPU utilization.

['The pursuit of efficient LLM inference is a pressing concern, especially when it comes to long generation tasks. In a series of posts, we aim to explore innovative techniques to optimize inference performance. This article focuses on unlocking asynchronicity in continuous batching, a method that can significantly enhance GPU utilization and throughput.', 'To put things into perspective, an H200 GPU costs around $5 an hour on Inference Endpoints.
While that may seem affordable for an hour, the expenses add up quickly, reaching $120 per day. Therefore, maximizing the utilization of the GPU is crucial to get the most out of your investment. Continuous batching is a technique that improves GPU utilization by scheduling tightly packed batches, eliminating compute waste on padding.', "However, continuous batching has a limitation - it's typically synchronous, meaning the CPU and GPU take turns processing.
The CPU prepares a batch, transfers the inputs to the GPU, and then waits while the GPU computes. Once the GPU finishes, it waits for the CPU to prepare the next batch. This synchronous process creates idle gaps that can account for nearly a quarter of the total runtime.", "To overcome this limitation, we can adopt asynchronous batching, where CPU batch preparation and GPU batch computation run in parallel.
This approach requires some technical difficulties to be addressed. We'll need to figure out how to run batch preparation for batch N+1 while batch N is computing. By leveraging CUDA streams, we can categorize operations and enable concurrent execution of CPU and GPU operations.", 'The key to achieving asynchronicity lies in understanding CUDA streams and events.
A stream is an ordered queue of GPU operations that execute in a specific order. By using non-default streams, we can enable concurrent execution of operations. However, we need to synchronize the streams to ensure that operations are executed in the correct order.
This is where CUDA events come into play - they allow us to mark specific points in a stream and wait for those points to be reached before proceeding.', 'By applying these techniques, we can create a pipeline with explicit ordering, where the CPU enqueues all GPU work and then moves on, while the GPU enforces the ordering through events. This leads to a significant reduction in idle gaps and a substantial increase in GPU utilization. In our experiment, we observed a 22% speedup, with the GPU active for 99.4% of the total runtime.', "The full implementation of asynchronous batching is available in the transformers library.
As we continue to push the boundaries of efficient LLM inference, we'll explore other techniques to further optimize performance, such as offloading requests and decode-specific kernels. Stay tuned for more insights and innovations in the field of AI research."]
Source: Hugging Face