Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
重點摘要
Back to Articles Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler Published May 29, 2026 Update on GitHub Upvote 7 +1 Aritra Roy Gosthipaty ariG23498 Follow Sayak Paul sayakpaul Follow Sergio Paniego sergiopaniego Follow Rémi Ouazan Reboul ror Follow Pedro Cuenca pcuenq Follow What you cannot profile, you cannot optimize. Whether you are trying to squeeze more tokens per second out of a Large Language Model (LLM), shave milliseconds off inference, or just understand why your training loop runs slower than the spec sheet promises, the path eventually runs through profiling. The catch is that profiling has a steep on-ramp. The traces are dense walls of colored rectangles. The events carry intimidating names. Most tutorials assume you can already read them. So even when we
Back to Articles Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler Published May 29, 2026 Update on GitHub Upvote 7 +1 Aritra Roy Gosthipaty ariG23498 Follow Sayak Paul sayakpaul Follow Sergio Paniego sergiopaniego Follow Rémi Ouazan Reboul ror Follow Pedro Cuenca pcuenq Follow What you cannot profile, you cannot optimize. Whether you are trying to squeeze more tokens per second out of a Large Language Model (LLM), shave milliseconds off inference, or just understand why your training loop runs slower than the spec sheet promises, the path eventually runs through profiling. The catch is that profiling has a steep on-ramp. The traces are dense walls of colored rectangles. The events carry intimidating names. Most tutorials assume you can already read them. So even when we know we should be profiling, opening a trace can feel like a chore best left for later (or for someone else). This post, and the series it kicks off, is our attempt to lower that on-ramp. This is the opening post of Profiling in PyTorch, a series where we slowly build the skill of reading profiler traces and use it to drive optimization. The plan: Part 1 (this post): start with the simplest possible operation, a matrix multiplication followed by a bias add, and learn how to read what the profiler hands back. Part 2: scale up to nn.Linear and a small MLP, use the traces to motivate optimizations, and peek at the kernels underneath. Part 3: put it all together on Large Language Models with transformers. We document the journey from a beginner's point of view. No prerequisites apart from basic PyTorch. Treat this as a leisurely read with some "Aha!" moments. The structure of the post is intentionally question-led: we open a trace, ask "wait, why is that happening?", and chase the answer until something clicks. By the end you should know: how to set up torch.profiler and what it actually hands back, how to read the profiler table and the trace (CPU lane, GPU lane, and the suspicious gaps in between), the chain of events from a Python call all the way down to a CUDA kernel, what changes (and, more interestingly, what does not change) when you slap torch.compile on top. Before we begin, two definitions that will make everything below read better: A GPU kernel is a program that runs in parallel on many threads of the GPU. The CPU schedules and launches these kernels. You don't usually have to write GPU kernels yourself; when you use a PyTorch operation, it is automatically translated to one or more kernels that do the job on GPU. With those two ideas in your back pocket, let's start asking questions. Here is the entire script that we use for the post: 01_matmul_add.py. We recommend opening this script in a separate tab and walk through the code step by step. We use the NVIDIA A100-SXM4-80GB GPU to run the scripts. The matrix multiplication and addition operation As correctly quipped by Dr. Sara Hooker, just as we are primarily made up of water, Deep Neural Networks are primarily made up of matrix multiplies. As fundamental as they are, it would be a shame to start our profiling journey with anything else. def fn(x, w, b): return torch.add(torch.matmul(x, w), b) The matrix addition along with the matrix multiplication mimics how weights and biases interact in a neuron. This addition (pun intended) will help us understand how it paves the way for compilation later in the post. To profile, we will be using the torch.profiler module. The steps involved are: Have the code to profile ready (here def fn, which wraps the matrix multiplication and matrix addition) Annotate the algorithm. While this is completely optional, we recommend doing this. The record_function annotates our function as matmul_add, which will be easy to navigate in the traces (as we note later) def step(): with torch.profiler.record_function("matmul_add"): return fn(x, w, b) Wrap the code with the torch.profiler.profile context manager with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, # the cpu activities torch.profiler.ProfilerActivity.CUDA, # the gpu activities ], ) as prof: # it is recommended to run events multiple times to warm up the GPUs for _ in range(5): step() prof.step() Export the profile # the profiler table prof.key_averages().table(sort_by="cuda_time_total", row_limit=15) # the profiler trace prof.export_chrome_trace(trace_path) The profiler exports two distinct artifacts: The profiler table: Provides the statistical summary of the algorithm. It answers "What is taking the most time". This becomes really helpful to figure out hotspots. A hotspot would be events that take the most amount of time, can be a bottleneck of the pipeline, or an event that is triggered a lot of times. The profiler trace: Provides the temporal execution view. Answers "When and Why an operation happened", depicting the activities taking place on the CPU and the GPU. This is helpful when we want to investigate the kernel(s) that were launched, any delays in launching them, any overlap between CPU and GPU activities, etc. Let's see the two in action with our first execution. (Here is the entire 01_matmul_add.py script) It is recommended to run this script on a machine with a GPU. uv run 01_matmul_add.py --size 64 If you run the above script (on a GPU machine) you will find a folder traces/01_matmul_add with the two artifacts: 64_bf16_cold_eager.json 64_bf16_cold_eager.txt Figure 1: Profiler table for matmul add on 64 sized matrices The .txt file holds the profiler table. Upon opening the file, as shown in Figure 1, one would be greeted with a big table with the first column consisting of the events that were triggered inside the scope of profile. The other columns are related to the time the event takes on the CPU or GPU or any other device(s) specified in activities within torch.profiler.profile. Look at which events take the most amount of time, and try to intuitively understand if that event should in fact take that time. It is also important to look at the column "# of Calls" which dictates how many times the event was triggered. While we are at it, let's also talk about "Self CPU/CUDA" vs "CPU/CUDA total". The "Self" columns measure time spent only inside the event itself, excluding its children. The "total" columns include the event and all of its children together. So if you look at the "CPU total" of matmul_add, it consists of the time it took on self plus the children events it triggered. This is an important nuance to note. If you look at the last two lines out of the table you would notice that the profiler tells us that Self CPU time total: 2.314ms Self CUDA time total: 23.104us The CPU time is in ms while the GPU time is in us. To put things in perspective, the time spent on GPUs (the kernel ampere_bf16_s16816gemm...) is less than 1% of the time spent on the CPU (the matmul_add operation). The GPU stays idle most of the time, which is an immediate red flag. The reason this happens is that the GPU can compute a small matmul very quickly, so our code spends most of the time preparing the kernels, launching them on the GPU, sending the data to multiply and gathering the results. This concept is known as an overhead-bound algorithm. The easiest way to move out of this regime is to use bigger matrix multiplications. uv run 01_matmul_add.py --size 4096 Figure 2: Profiler table for matmul add on 4096 sized matrices The last two lines in Figure 2 are: Self CPU time total: 4.908ms Self CUDA time total: 4.495ms Both times are in ms, which means we have materialized more GPU time just by increasing the size of the matrix multiplications. If you look at Figure 2 you would also notice that the most CUDA time is now taken by the GPU kernel (ampere_bf16_s16816gemm_..) and not by the CPU operation that launched it (matmul_add). This means that we were indeed able to move from overhead bound to compute bound. We now move into visualising the dispatch chain, which lives inside the .json artifacts.
Related
相關文章

Edge AI Daily 早報(6月19日)
AI Engineer World's Fair 2026規模再創新高,標誌AI工程從幕後走向舞臺中央。行業面臨結構性調整:楊立昆警示OpenAI年虧210億美元揭示商業模式脆弱性,Transformer之父轉投OpenAI反映人才爭奪白熱化。Anthropic多線佈局——語音支持七種語言、加入碳清除聯盟、落子首爾辦事處,展現生態擴張野心。監管壓力加劇,意大利依據DMA調查蘋果iCloud,巴西開放iOS側載佣金降至5%,蘋果圍牆花園持續崩塌。

今天起,Claude Design要把設計師和程序員變成同一種人了
猝不及防!Anthropic深夜甩出Claude Design大更新,設計系統一鍵導入,代碼雙向同步,9大平臺一鍵導出。Anthropic設計師親自下場錄屏:AI跑了八輪自查,才敢把設計稿給你看。

OpenAI 成為 Rust 基金會白金會員,合計贊助 60 萬美元
OpenAI 正式成為 Rust 基金會白金會員,將提供總計 60 萬美元資金,用於支持 Rust 開源項目維護者及 Rust 創新實驗室等計劃。這標誌著 AI 巨頭對安全、高效系統編程語言的重視。 #OpenAI #Rust #開源

Claude Design 上線首周用戶破百萬,和 Claude Code 共享 AI 配額
Anthropic 今天(6 月 18 日)發佈公告,在宣佈 Claude Design 上線首周用戶規模突破 100 萬後,進一步強化和 Claude Code 的雙向聯動,實現從設計到編程的無縫工作流。
谷歌時隔6年再發智能音箱,Gemini上桌,售價不到700元
智東西 編譯 | 劉煜 編輯 | 陳駿達 智東西6月18日消息,谷歌昨日宣佈,其首款搭載居家版Gemini語音助手的智能音箱(Google Home Speaker)已開啟預售,將於當地時間6月25日正式上市,售價為99.99美元(約合人民幣677.03元)。在此之前,谷歌已有6年沒有推出過獨立智能音箱產品。 谷歌這款智能音箱外觀近似球形,風格類似亞馬遜新一代Echo音箱與蘋果舊款音箱HomePod Mini。 ▲谷歌智能音箱(圖源:谷歌官網) 使用音箱時,用戶只需通過口令“Hey Google”或“OK Google”喚醒Gemini,就可以繼續下達相應指令。這與谷歌舊款音箱、智能顯示屏等喚醒語音助手的方式相同。此外,用戶只要按照日常說話習慣下達命令,Gemini便能理解用戶意圖,相比之前大大提升溝通效率。 一、加強短時對話記憶,會員可與Gemini不限次數對話 谷歌此次推出的全新音箱升級諸多功能。其中,音箱搭載的Gemini語音助手擁有10款全新擬人化語音音色,用戶可以根據喜好自行選擇聲線。音箱還可支持用戶一次性下達多條語音指令,即使指令未能說對、說完整,用戶中途改口Gemini也能識別。 Gemini還具備多鏈路推理能力,落地到實際生活場景中比較實用。例如,用戶問:“我支持的足球隊下場比賽天氣如何?”Gemini收到指令後,會自動查詢賽事時間、舉辦地點,同時匹配相應時段天氣,再給出答覆。 同時,Gemini加強了短時對話記憶,能承接上下文實現連續對話功能。即使用戶連續追問、甚至串聯多項任務、不重複交代前置條件,該語音助手也能實現來回連貫交流。 ▲谷歌Gemini對話場景(圖源:谷歌官網) 不僅如此,Gemini搭配的連續對話功能,能讓應答後的音箱麥克風保持短暫收音,用戶無需重複喊“OK Google”就能繼續提問。該功能現已全面支持所有Gemini原生適配的語言,包括

微軟,考慮接入DeepSeek
這篇消息聚焦「微軟,考慮接入DeepSeek」。原始導語提到:Copilot Cowork轉為按量計費。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。