Tiny-Megatron is a minimalistic, educational re-implementation of the Megatron-LM library for distributed deep learning. This project provides clean, understandable implementations of various parallelism strategies used in large-scale language model training.
- Tensor Parallelism (TP): Split individual layers across multiple devices
- Data Parallelism (DP): Replicate model across devices, shard data batches
- 2D Hybrid Parallelism: Combine TP + DP for effective scalability
- Custom Neural Network Modules: Optimized implementations of Linear, Embedding, LayerNorm
- Automatic Kernel Selection: Runtime auto-tuner for optimal performance
- Flexible Parallel Context: Easy configuration of multi-dimensional parallelism
- Wrapper-Based Design: Non-intrusive parallelization of existing models
- Clean, Readable Code: Well-documented implementations for learning
- Modular Architecture: Each parallelism strategy is independently implemented
- Complete Examples: Full training scripts demonstrating each approach
Tiny-Megatron/
├── tiny_megatron/core/ # 🏗️ Core Library
│ ├── dist/ # Distributed Parallelism
│ │ ├── tp/ # • Tensor Parallelism (TP)
│ │ ├── dp/ # • Data Parallelism (DP)
│ │ ├── hybrid/ # • 2D Hybrid Parallelism (TP + DP)
│ │ └── utils/ # • Communication utilities
│ ├── module/ # Custom NN Modules
│ │ ├── linear.py # • Optimized Linear layers
│ │ ├── embedding.py # • Embedding layers
│ │ ├── normalization.py # • LayerNorm implementation
│ │ └── ops/ # • Low-level operations
│ └── autotuner/ # Performance Optimization
│ └── runtime_tuner.py # • Automatic kernel selection
│
├── example/ # 🚀 Training Examples
│ ├── model.py # • GPT-2 model implementation
│ ├── tp/train.py # • Tensor parallelism demo
│ ├── dp/train.py # • Data parallelism demo
│ └── hybrid/train.py # • 2D hybrid parallelism demo
| Component | Purpose | Key Files |
|---|---|---|
| Distributed Parallelism | Core parallel strategies | dist/{tp,dp,hybrid}/ |
| Custom Modules | Optimized NN building blocks | module/{linear,embedding}.py |
| ParallelContext | Multi-dimensional coordination | dist/utils/comm.py |
| Auto-tuner | Performance optimization | autotuner/runtime_tuner.py |
| Examples | Complete training demos | example/{tp,dp,hybrid}/ |
- Python 3.8+
- PyTorch 2.0+ with CUDA support
- NCCL for multi-GPU communication
git clone https://github.com/liangyuwang/Tiny-Megatron.git
cd Tiny-Megatron
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install tqdm# Split model layers across 2 GPUs
torchrun --nproc_per_node=2 example/tp/train.py# Replicate model, distribute data batches
torchrun --nproc_per_node=2 example/dp/train.py# Combine TP and DP: TP=2 x DP=2
torchrun --nproc_per_node=4 example/hybrid/train.pyimport torch
from tiny_megatron.core import ParallelContext, apply_tensor_parallel
from example.model import GPT2Model, GPTConfig
# Initialize distributed environment
# ... (distribution setup code)
# Create model and parallel context
config = GPTConfig()
model = GPT2Model(config).cuda()
# Configure parallelism
parallel_config = {"tp": 2} # Use 2 GPUs for tensor parallelism
context = ParallelContext(parallel_config)
# Apply tensor parallelism
tp_config = {
"column_linear_names": ["attn.c_attn", "mlp.c_fc"],
"row_linear_names": ["attn.c_proj", "mlp.c_proj"]
}
model = apply_tensor_parallel(
model=model,
parallel_context=context,
tp_config=tp_config
)
# Train normally
optimizer = torch.optim.AdamW(model.parameters())
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()from tiny_megatron.core import ParallelContext, apply_hybrid_parallel
# Configure 2D parallelism for 4 GPUs
parallel_config = {
"tp": 2, # 2-way tensor parallelism
"dp": 2 # 2-way data parallelism
}
context = ParallelContext(parallel_config)
# Apply 2D hybrid parallelism
tp_config = {
"column_linear_names": ["attn.c_attn", "mlp.c_fc"],
"row_linear_names": ["attn.c_proj", "mlp.c_proj"]
}
model = apply_hybrid_parallel(
model=model,
parallel_context=context,
tp_config=tp_config
)- Column Parallel: Split weight matrices column-wise (e.g., attention projections)
- Row Parallel: Split weight matrices row-wise (e.g., MLP layers)
- Communication: All-gather for activations, all-reduce for gradients
- Model Replication: Same model on each device
- Data Sharding: Different data batches per device
- Gradient Synchronization: All-reduce after backward pass
- Combined Strategy: Tensor Parallelism (TP) + Data Parallelism (DP)
- Flexible Configuration: Support various TP and DP combinations
- Efficient Scaling: Optimal resource utilization for medium-scale training
Central coordination for multi-dimensional parallelism:
context = ParallelContext({
"tp": tensor_parallel_size,
"dp": data_parallel_size
})Optimized implementations with built-in parallelism support:
Linear: Matrix multiplication with automatic kernel selectionEmbedding: Token/position embeddingsLayerNorm: Layer normalization
Automatic selection of optimal kernels:
tuner = RuntimeAutoTuner(
warmup_iterations=10,
measure_iterations=100
)export MASTER_ADDR=localhost
export MASTER_PORT=29500
export WORLD_SIZE=4
export LOCAL_RANK=0parallel_config = {
"tp": 2, # Tensor parallel size
"dp": 2, # Data parallel size
}Each parallelism strategy includes complete training examples:
example/tp/train.py: Tensor parallelism with GPT-2example/dp/train.py: Data parallelism trainingexample/hybrid/train.py: 2D hybrid parallelism demo
- ✅ Tensor Parallelism (TP): Column and row parallelism for linear layers
- ✅ Data Parallelism (DP): Standard gradient synchronization
- ✅ 2D Hybrid Parallelism: TP + DP combinations
To maintain code simplicity and readability, we are currently focusing on TP and DP implementations. Future releases will include:
- 🔄 Pipeline Parallelism (PP): Layer-wise model partitioning
- 🔄 ZeRO Optimizer States: Memory-efficient optimizer state sharding
- 🔄 Expert Parallelism (EP): Mixture-of-experts model scaling
- 🔄 Sequence Parallelism (SP): Sequence dimension parallelism for long contexts
- 🔄 5D Hybrid Parallelism: TP + EP + SP + DP (ZeRO) + PP combinations
These advanced strategies will be added incrementally while maintaining the educational and minimalistic nature of the codebase.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Megatron-LM: Original Megatron library
- Tiny-FSDP: Minimalistic PyTorch FSDP re-implementation
- Tiny-DeepSpeed: Minimalistic DeepSpeed re-implementation
If you use Tiny-Megatron in your research, please cite:
@misc{tiny-megatron,
title={Tiny-Megatron: A Minimalistic Re-implementation of Megatron-LM},
author={Liangyu Wang},
year={2024},
url={https://github.com/liangyuwang/Tiny-Megatron}
}