Large Language Models
Large language models (LLMs) are Transformer-based language models trained on massive text corpora at scales where qualitatively new capabilities emerge: in-context learning, chain-of-thought reasoning, instruction following, and tool use.
What Makes a Model “Large”?
Scale in three axes:
| Axis | Range (modern LLMs) |
|---|---|
| Parameters $N$ | 7B to 1T+ |
| Training tokens $D$ | 1T to 15T tokens |
| Compute $C$ | $10^{23}$ to $10^{25}$ FLOPs |
The combination of scale and transformer architecture produces capabilities absent in smaller models: step-by-step reasoning, code generation, complex instruction following.
Pretraining
Autoregressive language modeling on a curated mix of web text, books, code, and scientific articles.
Data sources: Common Crawl (filtered), books, Wikipedia, GitHub, arXiv, StackExchange.
Data quality filtering: deduplication (MinHash), quality filters (perplexity-based, heuristic), PII removal.
Tokenizer: BPE or SentencePiece; vocabulary 32k-128k.
Architecture: decoder-only Transformer with pre-RMSNorm, SwiGLU FFN, RoPE positional encoding, grouped query attention.
Optimization: AdamW with cosine learning rate schedule and warmup. Mixed precision (bf16). Gradient checkpointing. ZeRO sharding / tensor parallelism / pipeline parallelism across thousands of GPUs.
Scaling Laws
Kaplan et al. (2020): loss follows a power law in $N$, $D$, and $C$:
\[L(N) \propto N^{-\alpha_N}, \quad L(D) \propto D^{-\alpha_D}, \quad L(C) \propto C^{-\alpha_C}\]Optimal compute allocation: spend roughly equally on model size and data.
Chinchilla (Hoffmann et al. 2022): revised analysis; optimal ratio is $D \approx 20N$. A 70B model needs ~1.4T tokens for optimal training. Earlier large models (Gopher, GPT-3) were significantly undertrained.
Instruction Tuning
Pretrained LLMs generate plausible text continuations, not helpful responses to questions. Instruction tuning aligns the model to follow instructions.
Supervised Fine-Tuning (SFT): train on a curated dataset of (instruction, ideal response) pairs covering diverse tasks: summarization, coding, math, Q&A, roleplay.
RLHF (Reinforcement Learning from Human Feedback):
- Collect human preference comparisons between pairs of model outputs.
- Train a reward model $r(x, y)$ to predict the preferred response.
- Fine-tune the LLM with PPO to maximize $r(x, y)$ subject to a KL penalty.
DPO (Direct Preference Optimization): analytically eliminates the reward model; directly optimizes the LLM on preference pairs:
\[\mathcal{L}_\text{DPO}(\pi_\theta) = -\mathbb{E}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right]\]Simpler and more stable than RLHF in practice.
In-Context Learning
The ability to perform a new task by conditioning on a few examples in the prompt, without any gradient updates.
Zero-shot: describe the task in natural language. No examples.
Few-shot: provide $k$ (input, output) examples before the test input. GPT-3 demonstrated strong few-shot capabilities at 175B parameters.
Mechanism: the Transformer forward pass “simulates” gradient descent over the in-context examples via attention. The model has implicitly learned to learn from demonstrations during pretraining.
Chain-of-Thought Prompting
Wei et al. (2022). Including step-by-step reasoning in the examples (or prompting the model to “think step by step”) dramatically improves performance on multi-step reasoning tasks.
Standard prompting: Q → A directly.
Chain-of-thought: Q → reasoning steps → A.
Zero-shot CoT: append “Let’s think step by step.” to the prompt. Effective even without demonstrations.
Self-consistency: sample multiple reasoning paths; take the majority vote answer. Reduces variance from random sampling.
Tool Use and Agents
LLMs can use external tools by generating structured calls that are executed and whose results are returned to the model.
Tool types: web search, code interpreter, calculator, calendar, database queries, API calls.
ReAct (Reason + Act): interleave reasoning traces with tool call actions. The model reasons about what tool to use, generates a call, observes the result, and continues reasoning.
Function calling (OpenAI API): the model outputs a JSON object specifying the function name and arguments. The runtime executes the function and returns the result.
Multi-step agents: complex tasks are decomposed into sequences of reasoning and tool use steps. Examples: code debugging agents, research assistants, data analysis agents.
Efficient Inference
LLM inference is memory-bound (loading weights and KV cache).
KV cache: store key and value tensors for all past tokens to avoid recomputation. Memory: $O(n \cdot d \cdot L)$ for $n$ tokens, $d$ hidden size, $L$ layers.
Quantization: reduce precision from FP16 to INT8 or INT4. GPTQ (post-training), QLoRA (fine-tuning). 4-bit quantization reduces memory by $4\times$ with minor quality loss.
Speculative decoding: a small draft model generates $k$ candidate tokens; the large model verifies them in a single forward pass. Speeds up generation by 2-3$\times$ with no quality loss.
Continuous batching (vLLM): dynamically batch requests at the token level rather than the sequence level. Dramatically improves GPU utilization for serving.
Parameter-Efficient Fine-Tuning (PEFT)
Full fine-tuning of a 70B model requires ~140GB GPU memory. PEFT methods update only a small fraction of parameters.
LoRA (Low-Rank Adaptation): add low-rank matrices $A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times d}$ to weight matrices; only $A$ and $B$ are trained.
\[W' = W + \Delta W = W + BA, \quad r \ll d\]Merges into the original weights at inference: no added latency.
QLoRA: quantize the base model to 4-bit; train LoRA adapters in 16-bit. Enables fine-tuning a 70B model on a single 48GB GPU.
Prompt tuning / prefix tuning: prepend a small number of learned tokens to the input. Only the prefix parameters are trained.
Prominent LLMs (as of early 2025)
| Model | Org | Params | Context | Open weights |
|---|---|---|---|---|
| GPT-4o | OpenAI | Unknown | 128k | No |
| Claude 3.5 Sonnet | Anthropic | Unknown | 200k | No |
| Gemini 1.5 Pro | Unknown | 1M | No | |
| LLaMA-3.1 405B | Meta | 405B | 128k | Yes |
| Mistral Large 2 | Mistral | 123B | 128k | No |
| Qwen2.5 72B | Alibaba | 72B | 128k | Yes |
| DeepSeek-V3 | DeepSeek | 671B (MoE) | 128k | Yes |
Alignment and Safety
Hallucination: LLMs generate plausible-sounding but false statements. Especially problematic for low-frequency facts. Mitigated by retrieval augmentation, citation, and fine-tuning on factual datasets.
Jailbreaking: adversarial prompts that bypass safety guardrails. Active area of red-teaming research.
Constitutional AI (Anthropic): uses a set of principles to critique and revise model outputs; reduces reliance on human labelers for safety feedback.
Bias: LLMs reflect biases in pretraining data (gender, racial, cultural). Evaluated with bias benchmarks (WinoBias, BBQ) and mitigated with balanced training data and RLHF.