Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
Mastering PyTorch profiling is crucial for optimizing models; this guide helps beginners understand and utilize torch.profiler.

["What you cannot profile, you cannot optimize. Whether you're trying to boost tokens per second in a Large Language Model (LLM), shave milliseconds off inference, or simply understand why your training loop runs slower than expected, the path inevitably leads to profiling. However, profiling comes with a steep learning curve, and even with its importance, opening a trace can feel daunting.", "This series, 'Profiling in PyTorch,' aims to lower that barrier.
We'll document the journey from a beginner's point of view, requiring no prerequisites apart from basic PyTorch knowledge. This leisurely read is packed with 'Aha!' moments, structured around questions that arise when opening a trace and seeking answers. By the end, you should be proficient in reading profiler traces and using them to drive optimization.", "To start, let's define two key concepts: GPU kernels, which are automatically translated from PyTorch operations, and the importance of understanding these for optimization.
We'll use a simple script, '01_matmul_add.py,' which mimics neuron interactions with matrix multiplication and addition. This will help us understand how compilation works later on.", 'Profiling with torch.profiler involves several steps, including setting the profiler, running the code, and analyzing the artifacts produced, such as a table and a JSON file. The table provides a summary of events, including their duration on CPU and GPU, while the JSON file allows for a visual dispatch chain analysis.', 'Upon analyzing the initial trace, we notice the GPU stays idle most of the time, indicating an overhead-bound algorithm.
This changes when we increase the size of matrix multiplications, shifting towards a compute-bound scenario. Visualizing the dispatch chain reveals insights into CPU and GPU activities, including waiting or idle times.', "Further investigation into the trace uncovers details like the 'dead window' between entering a record function and PyTorch dispatching it, and the impact of warmup stages on profiling. We also explore how PyTorch calls different CUDA runtime calls based on operations like matrix multiplication and addition.", "The article delves into the nuances of PyTorch's compilation with torch.compile, showing how it captures tensor-heavy regions, optimizes them, and runs generated code.
This results in a fusion at the dispatcher level but not necessarily at the kernel level. The CPU-side hierarchy reflects torch.compile's runtime architecture, including steps like cache lookup and AOTDispatcher Runtime Wrapper.", 'Finally, comparing eager and compiled runs reveals that while compilation can optimize certain aspects, it also introduces its own overhead. The article concludes with a reference guide for common patterns in traces and a look forward to future parts of the series, which will tackle more complex scenarios and real models.']
Source: Hugging Face