Hugging Face BlogAI硬體

加速微調Transformer模型：NVIDIA NeMo AutoModel

2026年6月24日 16:00

重點摘要

HuggingFace Transformers 已成為開源 AI 生態系的基石，最新推出的 Transformers v5 版本進一步強化其對混合專家（MoE）模型的原生支援，而 MoE 現已成為前沿模型的主流架構。v5 版本內建 MoE 基礎元件，包括專家後端、動態權重載入與分散式執行，使 MoE 具備可擴展性且易於開發建置。NVIDIA NeMo AutoModel 可在此基礎上加速微調流程。

站內 AI 整理稿

Back to Articles Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel Enterprise + Article Published June 24, 2026 Upvote - Adil Asif adil-asif Follow nvidia Alexandros Koumparoulis akoumpa Follow nvidia Wenwen Gao wgao2021 Follow nvidia Sylendran Arunagiri Sylendran95 Follow nvidia David Messina davidsalmessina70 Follow nvidia Bernard Nguyen bernardwin Follow nvidia HuggingFace Transformers has become the foundation of the open-source AI ecosystem, and the recent Transformers v5 release strengthened it with first-class support for Mixture-of-Experts (MoE) models, now the dominant architecture for frontier models. v5 ships the MoE foundations: expert backends, dynamic weight loading, and distributed execution that make MoE extensible and easy to build on. NVIDIA NeMo AutoModel is an open library part of the NVIDIA NeMo framework for building custom generative AI models at scale. NeMo AutoModel builds cleanly on top of v5, adding Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels, and it leans on v5's dynamic weight loading to bring those optimizations to a broad and growing set of model families. The payoff is 3.4-3.7x higher training throughput and 29-32% less GPU memory on fine-tuning MoE models than native Transformers v5, using the same from_pretrained() API: a single import line, with no other code changes. This blog details how this combination works and how users can fine-tune MoE models faster without changing their APIs. Background The rise of MoE models has introduced new challenges to efficient training: Routing tokens across hundreds of experts, fusing expert matmuls into a single kernel, sharding weights across GPUs, and overlapping communication with computation all require infrastructure beyond what a general-purpose library provides out of the box. Transformers v5 (“v5”) introduced first-class MoE support such as expert backends, dynamic weight loading, and tensor parallel plans for distributed execution. In addition, v5 made distributed training first-class by integrating PyTorch's DeviceMesh directly into from_pretrained(). NeMo AutoModel builds on top of v5 by subclassing AutoModelForCausalLM, and adding Expert Parallelism (EP), DeepEP fused all-to-all dispatch, and TransformerEngine kernels. DeepEP is the piece v5 doesn't have yet: it overlaps communication with expert compute. And because NeMo AutoModel rides v5's reversible weight conversion to load each model, it can focus its engineering on these reusable core ops instead of per-model checkpoint plumbing, while save_pretrained() still emits standard HF checkpoints that tools like vLLM and SGLang can load. The next section walks through how the two work together and the performance gains we measured, from full fine-tuning NVIDIA Nemotron 3 Ultra 550B A55B across 16 nodes down to single-node models such as Qwen3-30B-A3B and Nemotron 3 Nano 30B A3B. NeMo AutoModel: Same API, More Performance One of NeMo AutoModel's goals is API compatibility with HuggingFace Transformers to enable open-source community. NeMoAutoModelForCausalLM subclasses AutoModelForCausalLM, so any code that works with HF models works with AutoModel too. Here's what loading a model looks like in both. Only the import changes: That single import does a lot of work. For popular MoE architectures like Qwen3, NVIDIA Nemotron, GPT-OSS, and DeepSeek V3, NeMo AutoModel ships hand-tuned implementations with TransformerEngine attention, fused linear layers, and custom expert kernels. For everything else, it falls back to vanilla HF while still applying optimizations like Liger kernel patching, among others. And whichever path it takes, the resulting model is ready to scale: pass a device_mesh and you have multi-GPU training without further rewrites. Where NeMo AutoModel really shines is scaling MoE models to multi-GPU training. To train Nemotron 3 Nano 30B A3B with Expert Parallelism across 8 GPUs, one adds the distributed mesh configuration: import os import torch import torch.distributed as dist from nemo_automodel import NeMoAutoModelForCausalLM from nemo_automodel.recipes._dist_utils import create_distributed_setup_from_config dist.init_process_group(backend="nccl") torch.manual_seed(0) torch.cuda.set_device(int(os.environ.get("LOCAL_RANK", 0))) dist_setup = create_distributed_setup_from_config( { "strategy": "fsdp2", "ep_size": 8, }, ) model = NeMoAutoModelForCausalLM.from_pretrained( "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", dtype=torch.bfloat16, distributed_setup=dist_setup, ) dist.destroy_process_group() This gives speed, scalability and memory-optimizations with FSDP2, Expert Parallelism, TransformerEngine kernels and DeepEP dispatch, all from a from_pretrained() call. Performance Comparison We evaluated NeMo AutoModel in two regimes: full fine-tuning a frontier-scale 550B model across 16 nodes, and training two 30B MoE models on a single node. The 550B result shows why Expert Parallelism is essential at scale; the 30B results quantify the per-GPU speedup over Transformers v5. Nemotron 3 Ultra 550B A55B (full fine-tune, multi-node) Nemotron 3 Ultra 550B A55B is a 550B-parameter hybrid model shipping with Mamba2, LatentMoE, and Multi-Token Prediction (MTP). We benchmark a full fine-tune: every parameter is updated and the Adam optimizer state is materialized, which at this scale spans 16 H100 nodes (128 GPUs). Methodology: Parameter Value Hardware 16x H100 80GB (128 GPUs) Expert Parallelism EP=64 Local batch size 2 Sequence length 4,096 Features MTP, activation checkpointing, fused linear cross-entropy Kernels DeepEP dispatch + torch_mm experts + TransformerEngine Metric NeMo AutoModel (EP=64) TPS/GPU (avg) 815 TFLOP/s/GPU ~293 Peak Memory 58.2 GiB Why there is no Transformers v5 column. Transformers v5 runs out of memory at this scale, so there is no v5 number to report here. AutoModel's Expert Parallelism shards the experts across GPUs to bring the footprint within budget, which is what lets the full fine-tune run. The 30B comparisons below show the same advantage where v5 fits. Single-node 30B MoE benchmarks We benchmarked three approaches on a single node with 8x H100 80GB GPUs: HF Transformers v4 (hub code), HF Transformers v5 (with best available optimizations), and NeMo AutoModel (EP=8 + custom kernels). Methodology: Parameter Value Hardware 8x H100 80GB (single node) Sequence length 4,096 Local batch size 1 A note on the routing gate. The NeMo AutoModel numbers below use a balanced routing gate, which forces tokens to be distributed uniformly across experts. This emulates the ideal operating point an MoE is trained toward: a well-trained model's load-balancing loss drives expert utilization to near-uniform, so balanced routing reflects the steady-state a real workload converges to (and removes the straggler noise that random dummy tokens otherwise inject into expert parallelism). v4/v5 run their native router on the same dummy tokens. The balanced gate therefore measures NeMo AutoModel at its target MoE operating point, and the v4/v5 columns reflect their out-of-the-box behavior. Qwen3-30B-A3B Metric v4 v5 (FA2 + grouped_mm) NeMo AutoModel (EP=8) v5 → NeMo AutoModel TPS/GPU (avg) deadlock 3,075 11,340 3.69x Peak Memory — 68.2 GiB 48.1 GiB -29% Avg Forward+Loss — 582 ms 194 ms 3.00x Avg Backward — 758 ms 178 ms 4.26x Why v4 deadlocks: Transformers v4 stores Qwen3 MoE experts as a ModuleList of 128 individual MLP modules, each separately FSDP-wrapped. The forward pass uses a data-dependent loop that only iterates experts that received tokens. With different data per rank, different ranks skip different experts, causing mismatched FSDP AllGather/ReduceScatter collectives and an indefinite hang. Transformers v5 fixes this by storing experts as fused 3D parameter tensors (no per-expert modules, no per-expert FSDP collectives). Nemotron 3 Nano 30B A3B Metric v4 (hub code) v5 (FA2 + grouped_mm + Mamba CUDA) NeMo AutoModel (EP=8) v5 → NeMo AutoModel TPS/GPU (avg

原始來源：Hugging Face Blog ↗

查看原始來源