Foundation Models

A foundation model is a large model trained on broad data at scale that can be adapted to a wide range of downstream tasks. The term was coined by Bommasani et al. (2021) in the Stanford CRFM report.

What Makes a Foundation Model?

Scale: trained on massive datasets (hundreds of billions to trillions of tokens) with billions to trillions of parameters.

Self-supervised pretraining: no task-specific labels required. The model learns from raw data via objectives like next-token prediction, masked token prediction, or contrastive image-text alignment.

Emergent capabilities: properties not present in smaller models emerge at scale: in-context learning, chain-of-thought reasoning, code generation, instruction following.

Transfer: a single pretrained model serves as the starting point for many downstream tasks via fine-tuning, prompting, or retrieval augmentation. Fewer task-specific parameters required.

Paradigm Shift

Before foundation models (pre-2018): each task required a separate model trained from scratch or with limited transfer. Feature engineering was important.

After (2018-present): pretrain one large model; adapt to any task. “Pretrain then fine-tune” paradigm, then “pretrain then prompt” paradigm.

Examples by Modality

Modality	Foundation Model	Organization
Text	GPT-4, Claude 3, LLaMA-3, Gemini	OpenAI, Anthropic, Meta, Google
Code	Codex, DeepSeek-Coder, StarCoder	OpenAI, DeepSeek, HuggingFace
Image	CLIP, DINO, SAM	OpenAI, Meta
Image generation	Stable Diffusion, DALL-E 3, Flux	Stability AI, OpenAI, Black Forest
Vision-Language	LLaVA, GPT-4V, Gemini 1.5	Academia, OpenAI, Google
Audio	Whisper, AudioPaLM, MusicGen	OpenAI, Google, Meta
Video	Sora, Stable Video Diffusion	OpenAI, Stability AI
Multimodal	GPT-4o, Gemini 1.5 Pro, Claude 3	OpenAI, Google, Anthropic
Science (bio)	AlphaFold, ESMFold, Geneformer	DeepMind, Meta, Broad
Science (chem)	ChemBERTa, MolBERT	Various

Pretraining Objectives

Causal language modeling (CLM): predict the next token. Autoregressive; enables generation. GPT family.

Masked language modeling (MLM): predict masked tokens. Bidirectional context; strong representations. BERT family.

Contrastive image-text: align image and text embeddings. CLIP, SigLIP.

Masked image modeling: predict pixels or discrete visual tokens for masked patches. MAE, BEiT.

Denoising: reconstruct corrupted text spans or image regions. T5, BART.

Emergent Abilities

Abilities that appear at a certain scale threshold but are absent in smaller models.

Ability	Approx. scale threshold
In-context learning	~1B parameters
Chain-of-thought reasoning	~100B parameters
Instruction following (zero-shot)	~10B+ with RLHF
Multi-step arithmetic	~100B parameters
Code synthesis	~10B+

Debate: some “emergent” abilities may be artifacts of evaluation metrics (sharp transitions appear smooth under alternative metrics). But qualitative jumps in capability are widely observed.

Foundation Model Risks

Homogenization: if everyone builds on the same few foundation models, failures and biases in those models propagate everywhere.

Opacity: large foundation models are poorly understood. It is hard to predict what they will do in novel situations.

Misuse: capable models can be used to generate misinformation, phishing content, or cyberattacks.

Data issues: training data contains copyrighted content, personal data, and harmful content. Legal and ethical challenges.

Concentration of power: training frontier models requires billions of dollars; only a few organizations can do it.

These risks are the motivation for AI safety research, alignment techniques, and regulatory frameworks.