AI Research

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

AI News Desk

Hugging Face

Jun 11, 2026

12 min read

In the first part of this series "Profiling in PyTorch" , we used torch.add(torch.matmul(x, w), b) to learn how to read PyTorch profiler traces.

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

In the first part of this series "Profiling in PyTorch" , we used torch.add(torch.matmul(x, w), b) to learn how to read PyTorch profiler traces. We also discussed several other topics that came our way - the CPU dispatch chain, launch overhead, the difference between an overhead-bound and a compute-bound regime, and some internals of torch.compile .

In the second iteration (this blog post), we climb one rung up the ladder. We replace the hand-written matmul-add pair with an nn.Linear (with bias=True ). This is the building block every deep learning model uses. We then stack three of them (specific to our example), with an activation in between, to form a Multilayer Perceptron (MLP) block.

The scripts for this blog post live here: 02_linear.py , 03_simple_mlp.py , and 03_kernels_mlp.py . Like before, it helps to open them in a separate tab and walk through the code as you read. We use an NVIDIA A100-SXM4-80GB GPU to run the scripts. It is really easy to set up a GPU on the Hugging Face infrastructure and experiment with the scripts using Dev Mode with Spaces . One could also run the scripts with the Hugging Face Jobs pipeline .

Before we begin, a quick recap of two ideas we will lean on repeatedly:

nn.Linear is a module wrapper around the same matrix multiplication and addition we already profiled in Part 1 . The only difference is that it owns its weight and bias as parameters and exposes a forward method that PyTorch users have grown familiar with.

Where x is the input, w is the weight and b is the bias. Let's run 02_linear.py and check the profile.

trace-util is a utility that will sync your traces to a Hugging Face bucket and then provide the Preffeto URLs on your terminal.

Figure 1 shows the profiler trace of a forward call of the linear layer. We trace the forward call of the linear layer with a similar schedule setup as the previous traces, with wait=1 , warmup=1 and active=3 . This is why we see three Profile Steps in the CPU and GPU lanes.

If we zoom into the profiler trace, as we do in Figure 2, we notice an aten::t (transpose) op before the aten::addmm (multiplication and addition) op. We can already figure out that nn.Linear transposes the weight parameter and then multiplies it with the input. This is the reason we see an aten::t op.

An important thing to notice is that aten::t does not really copy or reorganize data: it only rewrites tensor metadata (shape and stride) on the CPU to represent the transposed matrix. It does not launch a kernel on the GPU. One can verify this two ways: by looking at the GPU lane in the trace, or by checking the aten::t row in the profiler table and the time it took on CUDA.

Share this article

X LinkedIn Telegram

Source: Hugging Face