You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As organizations strive to maximize the value of their generative AI investments, access to the latest model developments is crucial for continued success. By using state-of-the-art models on Day-0, teams can harness these innovations efficiently, maintain relevance, and be competitive.
The past year has seen a flurry of exciting model series releases in the open-source community, including Meta Llama, Google Gemma, Mistral Codestral, Codestral Mamba, Large 2, Mixtral, Qwen 3, 2, and 2.5, Deepseek R1, NVIDIA Nemotron, and NVIDIA Llama Nemotron. These models are often made available on the Hugging Face Hub, providing the broader community with easy access.
Shortly after release, many users focus on evaluating model capabilities and exploring potential applications. Fine-tuning for specific use cases often becomes a key priority to gain an understanding of the models' potential and to identify opportunities for innovation.
The NVIDIA NeMo Framework uses NVIDIA Megatron-Core and Transformer-Engine (TE) backends to achieve high throughput and Model Flops Utilization (MFU) on thousands of NVIDIA GPUs, driving exceptional performance. However, integrating new model architectures into the NeMo framework requires multi-stage model conversion using Megatron-Core primitives, followed by validation of different phases, including supervised and parameter-efficient finetuning, model evaluation, and Hugging Face to NeMo conversion. This introduces a time delay between model release and optimal training/post-training recipe development.
To ensure Day-0 support for the latest models, NeMo framework introduces the Automatic Model (AutoModel) feature.
Introducing AutoModel in NVIDIA NeMo Framework
Figure 1: AutoModel provides seamless integration with Hugging Face Models and ecosystem, models can be fine-tuned, deployed and evaluated right away.
AutoModel is a library within NeMo Framework (i.e. NeMo AutoModel) that offers GPU-accelerated, Hugging Face-native and PyTorch-native training to provide Day-0 fine-tuning for every LLM and most VLMs available on the Hugging Face Hub.
In addition, the Hugging Face-native integration bypasses the need for checkpoint conversion, allowing users to leverage any Hugging Face-compatible software such as vLLM, SGLang, etc.
With a lightweight dependency footprint, AutoModel is available on GitHub and PyPI, ready to install with a single command or integrate effortlessly into your existing Hugging Face workflows:
AutoModel is built around four core principles, that guide our decision-making process:
Day-0 Hugging Face support: Works out-of-the-box with every LLM and most VLMs on the Hugging Face Hub. Models scale transparently from a single GPU to multi-node clusters with FSDP-2 and TP/CP.
Flexibility & Modularity: Every component, data module, optimizer, scheduler, checkpoint I/O, distributed back end, is swappable. Keep the pieces you like, replace the ones you don't.
Trainer-Free: No rigid "trainer" god-class à la Lightning/HF Trainer style. Your training script remains linear Python, so all logic is visible, debuggable, and hackable.
Lightweight & Torch-Native: Built entirely on PyTorch 2 primitives (DDP, FSDP, TorchCompile, torch.distributed), delivering maximum compatibility and performance with minimal extra dependencies.
Key Features of AutoModel
AutoModel currently offers the following key capabilities:
Model Parallelism: Supports Fully-Sharded Data Parallelism 2 (FSDP2) and Distributed Data Parallel (DDP). In addition, it supports Tensor Parallelism (TP) and Context Parallelism (CP) (eng. preview).
PyTorch JIT Compilation: Enhanced performance via Just-In-Time compilation.
Custom Triton Kernels: To support key operations (e.g., LoRA) and enhance performance.
Training Recipes: Use optimized recipes to fine-tune your model or modify one to use your data.
Export Support: Export to vLLM; TensorRT-LLM and SGLang export coming soon.
Checkpoint Interoperability (Hugging Face-native): AutoModel keeps checkpoints in the standard Hugging Face format, so fine-tuned weights can be loaded directly by any library, service, or runtime that understands the HF format—no conversion required.
By using the Hugging Face ecosystem, AutoModel enables effortless integration of a vast array of LLMs, without requiring explicit checkpoint rewrites. All models are natively supported, with a subset of the most popular also receiving optimized Megatron-Core support for speed of light throughput. Figure 2 shows the two options a user have while Table 1 gives a detailed comparison on the two options. Users can start from the AutoModel path and seamlessly transition to Megatron-core path as their experiments scale.
Figure 2. NeMo framework training workflow with the new AutoModel path that offers Day 0 support
Feature
Megatron-Core Backend
AutoModel Backend
Coverage
Most popular LLMs with recipes tuned by experts
All models supported in Hugging Face on Day-0
Training Throughput Performance
Optimal Throughput with Megatron-Core
Good Performance with liger kernels, cut cross entropy and PyTorch JIT
Scalability
thousands of GPUs with full 4-D parallelism (TP, PP, CP, EP)
Comparable scalability using PyTorch native TP, CP, and FSDP2 at slightly reduced training throughput
Inference Path
Export to TensorRT-LLM, vLLM, or directly to NVIDIA NIM
Export to vLLM
Table 1. Comparison of the two backends in the NeMo framework: Megatron-Core and AutoModel
How to Use AutoModel
NeMo AutoModel provides a range of recipe configs for supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT).
Ready-to-Use Recipes
The following recipes demonstate how to fine-tune an LLM and a VLM model with various setups (single-GPU to multi-node).
A recipes defines an end-to-end workflow (data → training → eval) for a variety of tasks, such as, training, fine-tuning, knowledge distillation, and can be configured with YAML files.
The (introductory guide)[https://docs.nvidia.com/nemo/automodel/latest/repository-structure.html] provides more information about recipes and the repository/code structure.
Below is a minimal YAML configuration that showcases the key fields understood by the LLM finetune recipe. Feel free to copy it as a starting point and tweak any value.
model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrainedpretrained_model_name_or_path: meta-llama/Llama-3.2-1B # The Hugging Face Hub model ID, can be adjusted as needed.distributed:
_target_: nemo_automodel.components.distributed.fsdp2.FSDP2Managerdp_size: none # will use all GPUs available.tp_size: 1# Set this to greater than one to enable tensor parallelism.cp_size: 1sequence_parallel: falseloss_fn:
_target_: nemo_automodel.components.loss.linear_ce.FusedLinearCrossEntropydataset:
_target_: nemo_automodel.components.datasets.llm.squad.make_squad_datasetdataset_name: rajpurkar/squad # we will use the SQuAD datasets.split: train# We will PEFT for finetuningpeft:
_target_: nemo_automodel.components._peft.lora.PeftConfigmatch_all_linear: Truedim: 8alpha: 32use_triton: True # The custom Triton kernel offers improved performance.
The snippet above is broken down as follows:
model: selects the causal-language-model autoclass and points to the pretrained checkpoint (pretrained_model_name_or_path). The pretrained checkpoint can be a model id from the Hugging Face Hub or a local path.
distributed: sets up parallelism using the FSDP2Manager. dp_size: none automatically uses all visible GPUs, while tp_size controls tensor parallelism, and sequence_parallel toggles sequence-level sharding.
loss_fn: chooses the training criterion; here we use FusedLinearCrossEntropy, for reduced memory usage.
dataset: tells AutoModel how to fetch and preprocess data. This example downloads the SQuAD corpus from the Hugging Face Hub and preprocesses it with the make_squad_dataset function.
peft: enables parameter-efficient fine-tuning with LoRA, applying adapters to every linear layer and leveraging the Triton kernel for additional speed-ups.
You can extend this configuration with evaluation blocks, schedulers, loggers, or override any hyper-parameter directly on the command line using --my_section.my_param=value syntax, which will overwrite/append the my_param parameter to the my_section in the config.
Run a Recipe
To run a NeMo AutoModel recipe, you need a recipe script (e.g., LLM, VLM) and a YAML config file (e.g., LLM, VLM):
NeMo Automodel introduces two autoclasses: NeMoAutoModelForCausalLM and NeMoAutoModelForImageTextToText with more autoclasses upcoming.
You can leverage these optimized implementations, as they become available, directly in your own codebase. For example:
fromnemo_automodelimportNeMoAutoModelForCausalLM# Load a pretrained model and move it to GPU (remove `.to("cuda")` if using CPU only)model=NeMoAutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B").to("cuda")
# Generate textoutput=model.generate(["Hello, my name is"])
print(output[0])
This allows you to use NeMo Automodel's optimizations in your own software.
Use AutoModel checkpoints with vLLM
If you prefer serving your model with vLLM, simply point vLLM to the fine-tuned checkpoint directory that AutoModel produced (or any Hugging Face model ID).
After fine-tuning completes you can use the checkpoint right away with vLLM:
fromvllmimportLLM, SamplingParams# Path to your AutoModel fine-tuned checkpoint (HF format)ckpt_path="checkpoints/epoch_0_step_10/model/"# Initialise vLLM with the HF-format weightsllm=LLM(model=ckpt_path, dtype="float16")
# Generate textoutputs=llm.generate([
"Tell me a short story about AI assistants."
], SamplingParams(max_tokens=64))
print(outputs[0].outputs[0].text)
Because AutoModel preserves the standard Hugging Face format, no conversion step is required—vLLM (or any other HF-compatible runtime) can load the weights directly.
From Automodel to Megatron Path
Sometimes you out-grow the small-scale experiments or light-weight training loop that AutoModel provides. When you need massive scale (thousands of GPUs), pipeline-, tensor-, expert- and sequence-parallelism, or bleeding-edge features, you can hand the very same weights over to the Megatron path within Nemo Framework.
Megatron Bridge is a thin compatibility layer on top of NVIDIA Megatron-Core. It converts a standard Hugging Face (or AutoModel) checkpoint into Megatron's internal representation and exposes helper utilities to run Megatron's highly-optimized pre-training / fine-tuning loops. Megatron Bridge is available in the NeMo 25.07 docker container.
The migration path is intentionally friction-less:
Fine-tune or run quick experiments with AutoModel.
Point Megatron Bridge to the resulting HF-format checkpoint directory.
Launch Megatron-Core training with your preferred parallelism scheme.
frommegatron.bridgeimportAutoBridge# Example: continue training the checkpoint we produced earlier# ckpt_path can also be any Hugging Face model id such as "meta-llama/Llama-3.2-1B"ckpt_path="checkpoints/epoch_0_step_10/model/"# HF-format directory from AutoModel# Wrap the weights with a Bridge instancebridge=AutoBridge.from_hf_pretrained(ckpt_path)
# Export a Megatron-Core provider that plugs into existing recipesmodel_provider=bridge.to_megatron_provider()
# Import a ready-made Megatron recipe (or craft your own)importmegatron.bridge.recipes.llama.llama32_1basllama32_1bfrommegatron.bridge.training.pretrainimportpretrainfrommegatron.bridge.training.gpt_stepimportforward_stepcfg=llama32_1b.pretrain_config()
cfg.model=model_providercfg.train.train_iters=10# quick sanity run# Align sequence length & vocab size with the providercfg.dataset.sequence_length=cfg.model.seq_lengthcfg.tokenizer.vocab_size=cfg.model.vocab_sizepretrain(cfg, forward_step)
Conclusion
NeMo AutoModel enables rapid experimentation with a performant implementation, natively supporting Hugging Face models without requiring model conversions. Additionally, it provides a seamless “opt-in” to the high-performance Megatron-core path, allowing users to easily switch to optimized training with minimal code changes.
To get started, refer to the AutoModel recipes for PEFT LoRA and SFT. We also invite the developer community to share feedback, contribute code, to help shape the future development of AutoModel.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Authors: @akoumpa, @shashank3959, @snowmanwwg, @bernardwin
As organizations strive to maximize the value of their generative AI investments, access to the latest model developments is crucial for continued success. By using state-of-the-art models on Day-0, teams can harness these innovations efficiently, maintain relevance, and be competitive.
The past year has seen a flurry of exciting model series releases in the open-source community, including Meta Llama, Google Gemma, Mistral Codestral, Codestral Mamba, Large 2, Mixtral, Qwen 3, 2, and 2.5, Deepseek R1, NVIDIA Nemotron, and NVIDIA Llama Nemotron. These models are often made available on the Hugging Face Hub, providing the broader community with easy access.
Shortly after release, many users focus on evaluating model capabilities and exploring potential applications. Fine-tuning for specific use cases often becomes a key priority to gain an understanding of the models' potential and to identify opportunities for innovation.
The NVIDIA NeMo Framework uses NVIDIA Megatron-Core and Transformer-Engine (TE) backends to achieve high throughput and Model Flops Utilization (MFU) on thousands of NVIDIA GPUs, driving exceptional performance. However, integrating new model architectures into the NeMo framework requires multi-stage model conversion using Megatron-Core primitives, followed by validation of different phases, including supervised and parameter-efficient finetuning, model evaluation, and Hugging Face to NeMo conversion. This introduces a time delay between model release and optimal training/post-training recipe development.
To ensure Day-0 support for the latest models, NeMo framework introduces the Automatic Model (AutoModel) feature.
Introducing AutoModel in NVIDIA NeMo Framework
Figure 1: AutoModel provides seamless integration with Hugging Face Models and ecosystem, models can be fine-tuned, deployed and evaluated right away.
AutoModel is a library within NeMo Framework (i.e. NeMo AutoModel) that offers GPU-accelerated, Hugging Face-native and PyTorch-native training to provide Day-0 fine-tuning for every LLM and most VLMs available on the Hugging Face Hub.
In addition, the Hugging Face-native integration bypasses the need for checkpoint conversion, allowing users to leverage any Hugging Face-compatible software such as vLLM, SGLang, etc.
With a lightweight dependency footprint, AutoModel is available on GitHub and PyPI, ready to install with a single command or integrate effortlessly into your existing Hugging Face workflows:
pip3 install git+https://github.com/NVIDIA-NeMo/Automodel.git(package available on PyPI soon).Installation Matrix
pip install git+https://github.com/NVIDIA-NeMo/Automodel.gitdocker run nvcr.io/nvidia/nemo:25.07AutoModel core principles
AutoModel is built around four core principles, that guide our decision-making process:
Day-0 Hugging Face support: Works out-of-the-box with every LLM and most VLMs on the Hugging Face Hub. Models scale transparently from a single GPU to multi-node clusters with FSDP-2 and TP/CP.
Flexibility & Modularity: Every component, data module, optimizer, scheduler, checkpoint I/O, distributed back end, is swappable. Keep the pieces you like, replace the ones you don't.
Trainer-Free: No rigid "trainer" god-class à la Lightning/HF Trainer style. Your training script remains linear Python, so all logic is visible, debuggable, and hackable.
Lightweight & Torch-Native: Built entirely on PyTorch 2 primitives (DDP, FSDP, TorchCompile, torch.distributed), delivering maximum compatibility and performance with minimal extra dependencies.
Key Features of AutoModel
AutoModel currently offers the following key capabilities:
By using the Hugging Face ecosystem, AutoModel enables effortless integration of a vast array of LLMs, without requiring explicit checkpoint rewrites. All models are natively supported, with a subset of the most popular also receiving optimized Megatron-Core support for speed of light throughput. Figure 2 shows the two options a user have while Table 1 gives a detailed comparison on the two options. Users can start from the AutoModel path and seamlessly transition to Megatron-core path as their experiments scale.
Figure 2. NeMo framework training workflow with the new AutoModel path that offers Day 0 support
Table 1. Comparison of the two backends in the NeMo framework: Megatron-Core and AutoModel
How to Use AutoModel
NeMo AutoModel provides a range of recipe configs for supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT).
Ready-to-Use Recipes
The following recipes demonstate how to fine-tune an LLM and a VLM model with various setups (single-GPU to multi-node).
A recipes defines an end-to-end workflow (data → training → eval) for a variety of tasks, such as, training, fine-tuning, knowledge distillation, and can be configured with YAML files.
The (introductory guide)[https://docs.nvidia.com/nemo/automodel/latest/repository-structure.html] provides more information about recipes and the repository/code structure.
meta-llama/Llama-3.2-1B•SQuAD
google/gemma-3-4b-itExample Recipe Configuration
Below is a minimal YAML configuration that showcases the key fields understood by the LLM finetune recipe. Feel free to copy it as a starting point and tweak any value.
The snippet above is broken down as follows:
pretrained_model_name_or_path). The pretrained checkpoint can be a model id from the Hugging Face Hub or a local path.FSDP2Manager.dp_size: noneautomatically uses all visible GPUs, whiletp_sizecontrols tensor parallelism, andsequence_paralleltoggles sequence-level sharding.FusedLinearCrossEntropy, for reduced memory usage.make_squad_datasetfunction.You can extend this configuration with evaluation blocks, schedulers, loggers, or override any hyper-parameter directly on the command line using
--my_section.my_param=valuesyntax, which will overwrite/append themy_paramparameter to themy_sectionin the config.Run a Recipe
To run a NeMo AutoModel recipe, you need a recipe script (e.g., LLM, VLM) and a YAML config file (e.g., LLM, VLM):
Using NeMo Automodel as a library
NeMo Automodel introduces two autoclasses:
NeMoAutoModelForCausalLMandNeMoAutoModelForImageTextToTextwith more autoclasses upcoming.You can leverage these optimized implementations, as they become available, directly in your own codebase. For example:
This allows you to use NeMo Automodel's optimizations in your own software.
Use AutoModel checkpoints with vLLM
If you prefer serving your model with vLLM, simply point vLLM to the fine-tuned checkpoint directory that AutoModel produced (or any Hugging Face model ID).
First fine-tune a model with automodel:
After fine-tuning completes you can use the checkpoint right away with vLLM:
Because AutoModel preserves the standard Hugging Face format, no conversion step is required—vLLM (or any other HF-compatible runtime) can load the weights directly.
From Automodel to Megatron Path
Sometimes you out-grow the small-scale experiments or light-weight training loop that AutoModel provides. When you need massive scale (thousands of GPUs), pipeline-, tensor-, expert- and sequence-parallelism, or bleeding-edge features, you can hand the very same weights over to the Megatron path within Nemo Framework.
Megatron Bridge is a thin compatibility layer on top of NVIDIA Megatron-Core. It converts a standard Hugging Face (or AutoModel) checkpoint into Megatron's internal representation and exposes helper utilities to run Megatron's highly-optimized pre-training / fine-tuning loops. Megatron Bridge is available in the NeMo 25.07 docker container.
The migration path is intentionally friction-less:
Conclusion
NeMo AutoModel enables rapid experimentation with a performant implementation, natively supporting Hugging Face models without requiring model conversions. Additionally, it provides a seamless “opt-in” to the high-performance Megatron-core path, allowing users to easily switch to optimized training with minimal code changes.
To get started, refer to the AutoModel recipes for PEFT LoRA and SFT. We also invite the developer community to share feedback, contribute code, to help shape the future development of AutoModel.
Beta Was this translation helpful? Give feedback.
All reactions