Foundation Models
A foundation model is a large model trained on broad data at scale that can be adapted to a wide range of downstream tasks. The term was coined by Bommasani et al. (2021) in the Stanford CRFM report.
What Makes a Foundation Model?
Scale: trained on massive datasets (hundreds of billions to trillions of tokens) with billions to trillions of parameters.
Self-supervised pretraining: no task-specific labels required. The model learns from raw data via objectives like next-token prediction, masked token prediction, or contrastive image-text alignment.
Emergent capabilities: properties not present in smaller models emerge at scale: in-context learning, chain-of-thought reasoning, code generation, instruction following.
Transfer: a single pretrained model serves as the starting point for many downstream tasks via fine-tuning, prompting, or retrieval augmentation. Fewer task-specific parameters required.
Paradigm Shift
Before foundation models (pre-2018): each task required a separate model trained from scratch or with limited transfer. Feature engineering was important.
After (2018-present): pretrain one large model; adapt to any task. “Pretrain then fine-tune” paradigm, then “pretrain then prompt” paradigm.
Examples by Modality
| Modality | Foundation Model | Organization |
|---|---|---|
| Text | GPT-4, Claude 3, LLaMA-3, Gemini | OpenAI, Anthropic, Meta, Google |
| Code | Codex, DeepSeek-Coder, StarCoder | OpenAI, DeepSeek, HuggingFace |
| Image | CLIP, DINO, SAM | OpenAI, Meta |
| Image generation | Stable Diffusion, DALL-E 3, Flux | Stability AI, OpenAI, Black Forest |
| Vision-Language | LLaVA, GPT-4V, Gemini 1.5 | Academia, OpenAI, Google |
| Audio | Whisper, AudioPaLM, MusicGen | OpenAI, Google, Meta |
| Video | Sora, Stable Video Diffusion | OpenAI, Stability AI |
| Multimodal | GPT-4o, Gemini 1.5 Pro, Claude 3 | OpenAI, Google, Anthropic |
| Science (bio) | AlphaFold, ESMFold, Geneformer | DeepMind, Meta, Broad |
| Science (chem) | ChemBERTa, MolBERT | Various |
Pretraining Objectives
Causal language modeling (CLM): predict the next token. Autoregressive; enables generation. GPT family.
Masked language modeling (MLM): predict masked tokens. Bidirectional context; strong representations. BERT family.
Contrastive image-text: align image and text embeddings. CLIP, SigLIP.
Masked image modeling: predict pixels or discrete visual tokens for masked patches. MAE, BEiT.
Denoising: reconstruct corrupted text spans or image regions. T5, BART.
Emergent Abilities
Abilities that appear at a certain scale threshold but are absent in smaller models.
| Ability | Approx. scale threshold |
|---|---|
| In-context learning | ~1B parameters |
| Chain-of-thought reasoning | ~100B parameters |
| Instruction following (zero-shot) | ~10B+ with RLHF |
| Multi-step arithmetic | ~100B parameters |
| Code synthesis | ~10B+ |
Debate: some “emergent” abilities may be artifacts of evaluation metrics (sharp transitions appear smooth under alternative metrics). But qualitative jumps in capability are widely observed.
Foundation Model Risks
Homogenization: if everyone builds on the same few foundation models, failures and biases in those models propagate everywhere.
Opacity: large foundation models are poorly understood. It is hard to predict what they will do in novel situations.
Misuse: capable models can be used to generate misinformation, phishing content, or cyberattacks.
Data issues: training data contains copyrighted content, personal data, and harmful content. Legal and ethical challenges.
Concentration of power: training frontier models requires billions of dollars; only a few organizations can do it.
These risks are the motivation for AI safety research, alignment techniques, and regulatory frameworks.