PyTorch 效能分析（二）：從 nn.Linear 到融合 MLP

2026年6月11日 00:00

重點摘要

回到文章列表。PyTorch 效能分析（二）：從 nn.Linear 到融合 MLP，發表於 2026 年 6 月 11 日，GitHub 更新。Aritra Roy Gosthipaty、Rémi Ouazan Reboul、Sergio Paniego、Pedro Cuenca、Sayak Paul 共同撰寫。在本系列的第一部分「PyTorch 效能分析」中，我們使用 torch.add(torch.matmul(x, w), b) 來學習如何解讀 PyTorch 效能分析器的追蹤結果，並探討了 CPU 分派鏈、啟動開銷、開銷受限與運算受限的差異，以及 torch.compile 的內部運作。在第二篇（本文）中，我們進一步提升層級，將手寫的 matmul-add 組合替換為 nn.Linear（偏置設為 True）。

站內 AI 整理稿

Back to Articles Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP Published June 11, 2026 Update on GitHub Upvote 3 Aritra Roy Gosthipaty ariG23498 Follow Rémi Ouazan Reboul ror Follow Sergio Paniego sergiopaniego Follow Pedro Cuenca pcuenq Follow Sayak Paul sayakpaul Follow In the first part of this series "Profiling in PyTorch", we used torch.add(torch.matmul(x, w), b) to learn how to read PyTorch profiler traces. We also discussed several other topics that came our way - the CPU dispatch chain, launch overhead, the difference between an overhead-bound and a compute-bound regime, and some internals of torch.compile. In the second iteration (this blog post), we climb one rung up the ladder. We replace the hand-written matmul-add pair with an nn.Linear (with bias=True). This is the building block every deep learning model uses. We then stack three of them (specific to our example), with an activation in between, to form a Multilayer Perceptron (MLP) block. The scripts for this blog post live here: 02_linear.py, 03_simple_mlp.py, and 03_kernels_mlp.py. Like before, it helps to open them in a separate tab and walk through the code as you read. We use an NVIDIA A100-SXM4-80GB GPU to run the scripts. It is really easy to set up a GPU on the Hugging Face infrastructure and experiment with the scripts using Dev Mode with Spaces. One could also run the scripts with the Hugging Face Jobs pipeline. Before we begin, a quick recap of two ideas we will lean on repeatedly: A GPU kernel is a program that runs in parallel on many threads of the GPU. The CPU schedules and launches these kernels. Most of the PyTorch overhead you see in a profiler trace is this scheduling work. From matmul-add to Linear nn.Linear is a module wrapper around the same matrix multiplication and addition we already profiled in Part 1. The only difference is that it owns its weight and bias as parameters and exposes a forward method that PyTorch users have grown familiar with. # bias=True would truly emulate the multiplication and addition # operations we have seen in part 1 of the series linear_layer = nn.Linear(in_dim, out_dim, bias=True) y = linear_layer(x) The operation at hand can be written as: y = x @ w.T + b Where x is the input, w is the weight and b is the bias. Let's run 02_linear.py and check the profile. uv run 02_linear.py --batch 1024 --in_dim 32 --out_dim 64 uvx trace-util traces -b traces trace-util is a utility that will sync your traces to a Hugging Face bucket and then provide the Preffeto URLs on your terminal. Figure 1: Profiler trace of nn.Linear Figure 1 shows the profiler trace of a forward call of the linear layer. We trace the forward call of the linear layer with a similar schedule setup as the previous traces, with wait=1, warmup=1 and active=3. This is why we see three Profile Steps in the CPU and GPU lanes. What is the transpose doing? Figure 2: The transpose CPU row If we zoom into the profiler trace, as we do in Figure 2, we notice an aten::t (transpose) op before the aten::addmm (multiplication and addition) op. We can already figure out that nn.Linear transposes the weight parameter and then multiplies it with the input. This is the reason we see an aten::t op. An important thing to notice is that aten::t does not really copy or reorganize data: it only rewrites tensor metadata (shape and stride) on the CPU to represent the transposed matrix. It does not launch a kernel on the GPU. One can verify this two ways: by looking at the GPU lane in the trace, or by checking the aten::t row in the profiler table and the time it took on CUDA. Why are there no separate mul and add kernels? Figure 3: No aten::add in the profile of a linear layer There is no aten::add (the bias addition) in the dispatch chain of the linear layer, as seen in Figure 3. This is because the bias addition has been folded into the matrix multiplication kernel, using what is called an epilogue. An epilogue is a small computation that a GEMM (GEneral Matrix Multiply) kernel does at the very end, just before it writes its result back to HBM (High Bandwidth Memory, the GPU's main memory). Adding a bias, applying an activation, or scaling by a constant are all classic epilogues. The point of an epilogue is to avoid loading or writing to HBM a second time, since memory traffic makes an operation expensive. nn.Linear calls torch.nn.functional.linear, which, in turn, calls aten::linear. aten::linear looks at the inputs, notices that a bias was passed, and dispatches aten::addmm(bias, x, weight) instead of doing a matmul and an add separately. addmm computes: out = x @ weight.T + bias The cuBLAS GEMM kernel that runs on the GPU has a bias-add variant built in, and that's the kernel aten::addmm picks. The add never appears as a separate kernel because it is part of the matmul kernel's writeback, which is exactly what an epilogue is. This is the moment to notice something subtle. The kernel you saw in Part 1 under --compile (addmm) is the kernel that eager nn.Linear already uses. There is nothing left for torch.compile to fuse here, which is the next thing we will verify. Can --compile help a single Linear? Let's compile the forward call and look at the profiler trace. (The profiler trace is visualized in the next section) uv run 02_linear.py --batch 1024 --in_dim 32 --out_dim 64 --compile uvx trace-util traces -b traces If you compare the eager and compiled traces for a single nn.Linear's forward, you will find: The same cuBLAS GEMM kernel on the GPU. The same aten::addmm op on the CPU. A few extra rows on the CPU lane unique to compile. This is worth internalizing. A common reflex is to reach for torch.compile whenever a model feels slow. For a single GEMM-with-bias, compile has very little to do. This is not a bug, this is just that compile needs more than one operation to possibly do any fusing. Let's prove that by looking at an MLP. Where did the transpose go? Kernel layouts and pre-ops A careful reader of the two traces (eager vs compile) will notice that the eager CPU dispatch chain has more in it than the compiled one. Figure 4: Eager dispatch chain where aten::linear walks through aten::t (transpose) and then aten::addmm Figure 5: Compiled dispatch chain where aten::addmm is called directly, with no transpose The eager CPU dispatch chain inside aten::linear is aten::t followed by aten::addmm (Figure 4). To understand what aten::t actually does, we need a quick detour into strides and views. A tensor stores its data as one flat, contiguous run of numbers in memory. The shape and stride are metadata that sit on top of that run and tell PyTorch how to walk it: a stride of (s0, s1) means "step s0 elements to move one row, step s1 to move one column". Change the metadata and you get a different view of the same raw data, with no copy: >>> M = torch.tensor([[0, 1], ... [2, 3], ... [4, 5]]) >>> M.shape, M.stride() (torch.Size([3, 2]), (2, 1)) # two steps per row, one step per column >>> T = M.t() # transpose >>> T.shape, T.stride() (torch.Size([2, 3]), (1, 2)) # shape and stride swapped, data untouched >>> T tensor([[0, 2, 4], [1, 3, 5]]) >>> T.flatten() # forced to materialize, so the data is reordered tensor([0, 2, 4, 1, 3, 5]) M.t() did not move a single number. It returned a new view whose strides are swapped, so reading it row-by-row now walks the original buffer 0, 1, 2, 3, 4, 5 in transposed order. The underlying data is identical; only the metadata differs. This is exactly what aten::t does inside the linear layer: it does not allocate a new tensor or copy any data, it produces a view of the weight with rewritten strides. As we can see in Figure 5, compile did not remove a GPU kernel: it removed the CPU overhead of dispatching that view. Inductor traced through the view chain at compile time, computed the resulting strides once, and emitted a direct aten::addmm call with those strides hard-coded. A few microseconds of CPU work disappear while the GPU does identical math. As one

原始來源：Hugging Face Blog ↗

查看原始來源

鈦媒體AI應用場景

AI預測不了“佛得角”

AI預測模型在世界盃足球賽預測中集體失準，特別是對非洲隊伍「佛得角」的表現完全錯估，凸顯大模型在面臨動態不確定性與非主流聯賽數據不足時的脆弱性。這場預測翻車事件引發外界對AI可信度的質疑，也促使科技公司反思如何修正模型，導入即時動態資訊以提升預測準確度。

剛剛閱讀分析

智東西AI應用場景

智能家居終於“智能”了！有記憶、能認人的“賈維斯”，小米先交卷了

{"id":"bfc7e789-db52-4597-89dc-85a30161bd27","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":158,"output_tokens":1400,"total_to...

剛剛閱讀分析