Reproducibility
Reproducibility in ML means being able to recreate the exact results of a training run or experiment. It is a prerequisite for debugging, scientific rigor, and safe model deployment.
Sources of Non-Reproducibility
Random seeds: weight initialization, data shuffling, dropout, data augmentation, and batch sampling all use random number generators. Without fixing seeds, results differ across runs.
Non-deterministic GPU operations: CUDA has some non-deterministic operations (cuDNN convolution algorithms chosen at runtime; atomic operations in parallel reductions). Even with fixed seeds, results may differ.
Floating-point non-associativity: $(a + b) + c \neq a + (b + c)$ in floating point. Parallel reductions with different thread orderings give slightly different results.
Framework/library versions: different versions of PyTorch, cuDNN, or CUDA can produce different numerical results. A gradient computation bug fixed in one version may change outputs.
Data pipeline: shuffling order, preprocessing randomness, augmentation stochasticity.
Distributed training: order of gradient reductions and parameter updates may vary with number of GPUs, communication patterns.
Fixing Random Seeds
import random, numpy as np, torch
import os
def seed_everything(seed: int):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
# Force deterministic CUDA operations (may reduce speed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True disables non-deterministic cuDNN algorithms. May incur a 10-30% speed penalty.
torch.use_deterministic_algorithms(True) raises an error if any non-deterministic operation is attempted (PyTorch 1.11+).
Environment Pinning
Pin every dependency to an exact version.
requirements.txt with exact versions:
torch==2.3.0
transformers==4.40.0
numpy==1.26.4
conda environment.yml: pin package versions and Python version.
Docker: Docker images capture the exact OS, CUDA version, Python version, and all installed packages. A Dockerfile with pinned base image and package versions fully specifies the environment.
Poetry / uv: modern Python dependency managers with lockfiles that pin transitive dependencies.
Data Reproducibility
Version the dataset: use DVC or a data catalog to pin exact dataset versions to each experiment.
Store the preprocessing script + version: not just the processed data, but the code that produced it.
Log data statistics: number of examples per split, class distribution, mean/std of features. Detect accidental data changes.
Deterministic splits: use a fixed random seed for splitting. Better: use a hash of a stable key (user ID, item ID) to assign records to splits deterministically.
Code Reproducibility
Git commit hash: log the exact commit hash (and ideally a git diff for uncommitted changes) with every experiment.
No uncommitted changes in production runs: enforce via CI that model training runs only on clean commits.
Experiment tracking: log commit hash, config file, and code snapshot (MLflow, W&B) with every run.
Checkpoint Reproducibility
Save optimizer state: a checkpoint that includes the model weights, optimizer state ($m$, $v$ in Adam), learning rate schedule state, and epoch/step count allows resuming a run exactly.
Random state: save and restore the state of all random number generators at the checkpoint.
checkpoint = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"scheduler": scheduler.state_dict(),
"epoch": epoch,
"rng_state": torch.get_rng_state(),
"cuda_rng_state": torch.cuda.get_rng_state_all(),
"np_rng_state": np.random.get_state(),
}
Levels of Reproducibility
| Level | Description | Requirement |
|---|---|---|
| Numerical | Bit-for-bit identical outputs | Fixed seeds + deterministic ops + identical hardware |
| Statistical | Same metrics within noise | Fixed seeds + same code/data |
| Scientific | Same conclusions | Same methodology, data, and evaluation |
| Functional | Same production behavior | Sufficient for deployment |
Bit-for-bit reproducibility is often impractical across different hardware or library versions. Statistical reproducibility (same metrics ± small variance) is the realistic target.
Continuous Integration for Reproducibility
Run short smoke-test training jobs in CI on every commit. Check that metrics match a baseline within tolerance. Catches regressions from code or dependency changes early.