Training Pipeline Engineering
Training pipelines are infrastructure problems with a math hat.
The Pipeline Components
A training pipeline has four core components. Each is a potential failure point:
"""
Training pipeline components:
1. Data Management (DVC, LakeFS, Delta Lake)
- Version-control your data
- Track which data was used for which model
- Reproduce any experiment exactly
- Without version control: "I'm not sure what data was used"
- With version control: "Run dvc checkout dataset-v12.3 and rebuild"
2. Experiment Tracking (W&B, MLflow, Neptune)
- Log hyperparameters, metrics, artifacts
- Compare experiments visually
- Track GPU hours per experiment
- Without tracking: "I can't find which hyperparameters worked"
- With tracking: "Compare experiments by loss, lr, epoch, GPU hours
3. Distributed Training (FSDP, DeepSpeed, Megatron)
- Split computation across multiple GPUs
- Synchronize gradients
- Handle communication overhead
- Without distributed: one GPU = slow training = wasted compute
- With distributed: training 4x faster, communication cost < training cost
4. Checkpointing (save/restore)
- Save model state periodically
- Resume training from last checkpoint
- Save the best model based on validation performance
- Without checkpointing: crash = restart = lost training time
- With checkpointing: crash = resume = 0% training time lost
"""
Data Versioning (Why You Need DVC)
"""
Data versioning is like git for datasets.
Without DVC:
data/raw/
data/processed/
data/final/ ← who knows which is the right one?
With DVC:
data/
├── raw.dvc → hash of raw data (unchanged between versions)
├── processed.dvc → hash of processed data
└── final.dvc → hash of training data
git diff: shows which data changed
dvc checkout: reproduces the exact dataset used for this commit
"""
Experiment Tracking (Why You Need MLflow/W&B)
"""
Experiment tracking lets you:
1. Reproduce any experiment (hyperparameters + metrics = reproducibility)
2. Compare experiments (which lr worked best?)
3. Track GPU hours (which experiments were too expensive?)
4. Find the best model (which checkpoint had lowest validation loss?)
MLflow example:
import mlflow
with mlflow.start_run() as run:
mlflow.log_param("lr", 3e-4)
mlflow.log_param("epochs", 3)
mlflow.log_metric("train_loss", 2.5)
mlflow.log_metric("val_loss", 2.8)
mlflow.log_artifact("best_model.pt")
W&B example:
import wandb
wandb.init(project="my-model", config={"lr": 3e-4, "epochs": 3})
wandb.log({"train_loss": 2.5, "val_loss": 2.8})
wandb.run.log_model("best_model.pt")
Choose:
- MLflow: open-source, self-hosted, more flexible
- W&B: cloud-native, beautiful UI, better visualization
- Neptune: enterprise-grade, good for teams
"""
Distributed Training Patterns
"""
Distributed training patterns, ranked by complexity:
1. Data Parallelism (DistributedDataParallel — DDP/FSDP)
- Each GPU has the full model, processed different data batches
- Synchronize gradients after each step
- Best for: most models, moderate scale (up to 16 GPUs easily)
- Complexity: easy
2. Tensor Parallelism
- Split model weights across GPUs
- Each GPU has a piece of the matrix
- Best for: large models (70B+) on multiple GPUs
- Complexity: moderate
3. Pipeline Parallelism
- Split layers across GPUs (layer 1-10 on GPU 1, layer 11-20 on GPU 2)
- Each GPU processes different layer at different time
- Best for: very deep models on limited GPU memory
- Complexity: high
4. Hybrid (Tensor + Pipeline + Data)
- Combine all three
- Best for: training massive models (100B+ tokens)
- Complexity: extreme
- Example: PaLM used 256 GPUs with tensor-pipeline-data parallelism
For most teams: Data Parallelism (FSDP) + Tensor Parallelism (Megatron)
This gets you to 70B models with 8-16 GPUs. That's where most people stop.
"""
Checkpoint Strategy (Save What Matters)
"""
Best practices for checkpointing:
1. Save every N steps (configurable, typically 500-2000)
- Frequent saves: less data loss on crash
- Infrequent saves: less disk I/O
- Sweet spot: every 1000 steps for normal training, every 500 for unstable
2. Save best model (lowest validation loss) separately
- Always keep the best checkpoint
- Name it clearly: "best_model_step_5000.pt"
3. Use incremental checkpointing
- Don't overwrite previous checkpoints
- Keep last N checkpoints (or use checkpointing to save space
4. Use checkpoint compression (save as .tar.gz or bfloat16)
- Reduces disk space by 2-4x
- Minimal impact on training speed
5. Store checkpoints in object storage (S3, GCS, etc.)
- Not on local disk
- Local disk can fill up, corrupt, or disappear
- Object storage survives hardware failure
Checkpointing checklist:
- [ ] Automatic checkpointing every N steps
- [ ] Best model saved separately
- [ ] Checkpoints stored in object storage
- [ ] Checkpoint compression enabled
- [ ] Resume from checkpoint works (test this!)
"""
Practical Pipeline Example
"""
Minimal training pipeline (what you need to build):
class TrainingPipeline:
def __init__(self):
self.data_manager = DVCManager() # Version control
self.experiment_tracker = MLflowTracker() # Track experiments
self.checkpoint_saver = CheckpointManager() # Save checkpoints
self.distributed_trainer = DDPTrainer() # Train on multiple GPUs
def run(self, config):
# 1. Setup data
train_data = self.data_manager.load("train-v12.3")
val_data = self.data_manager.load("val-v12.3")
# 2. Start experiment
run = self.experiment_tracker.start_run(config)
# 3. Train
for epoch in range(config.epochs):
train_loss = self.distributed_trainer.train(train_data)
val_loss = self.distributed_trainer.validate(val_data)
# Log metrics
self.experiment_tracker.log_metrics(epoch, train_loss, val_loss)
# Save checkpoint
if epoch % 10 == 0:
self.checkpoint_saver.save(
model=self.distributed_trainer.model,
epoch=epoch,
val_loss=val_loss
)
# 4. Final metrics
final_metrics = self.distributed_trainer.evaluate(val_data)
self.experiment_tracker.finish_run(final_metrics)
"""
Summary
Training pipelines are infrastructure:
- Data management: Version control your data. DVC is the standard.
- Experiment tracking: MLflow for open-source, W&B for cloud.
- Distributed training: FSDP for most cases. Tensor + Pipeline for large models.
- Checkpointing: Every N steps, best model, object storage.
Build with reproducibility as the first priority. If you can’t reproduce it, it didn’t happen.