Evaluation of Generative Models
Evaluating generative models is challenging because outputs are open-ended: there is no single correct answer. Evaluation must measure fluency, coherence, factuality, helpfulness, safety, and task-specific quality.
The Evaluation Challenge
Reference-based metrics (BLEU, ROUGE) compare output to a small set of human references. Fail when a model produces a correct but differently worded response. Low correlation with human judgment for high-quality models.
Human evaluation is the gold standard but is expensive, slow, and not reproducible.
LLM-as-judge is emerging as a scalable proxy for human evaluation.
Automated Metrics
Text Generation
| Metric | Measures | Limitations |
|---|---|---|
| BLEU | n-gram precision vs. references | Insensitive to meaning; poor for diverse outputs |
| ROUGE | n-gram recall vs. references | Same as BLEU |
| METEOR | Stems, synonyms, recall | Better than BLEU; still reference-dependent |
| BERTScore | Semantic similarity via BERT | Requires references; scales better |
| MAUVE | Distribution-level quality | Requires samples; no direct item scores |
| Perplexity | Fluency under a reference LM | Measures likelihood; doesn’t capture factuality |
Image Generation
| Metric | Measures |
|---|---|
| FID (Fréchet Inception Distance) | Distribution similarity (real vs. generated) |
| IS (Inception Score) | Quality and diversity |
| CLIP score | Image-text alignment |
| Precision / Recall | Fidelity (precision) and diversity (recall) separately |
| LPIPS | Perceptual similarity to reference images |
FID is computed by passing 50k real and 50k generated images through an InceptionV3 classifier; comparing the statistics of the penultimate layer activations:
\[\text{FID} = \|\mu_r - \mu_g\|^2 + \text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})\]Lower FID = more similar distributions. Sensitive to the number of samples used.
LLM-as-Judge
Use a capable LLM (GPT-4, Claude) to evaluate another model’s output. Scales better than human evaluation.
Pointwise scoring: rate a single response on a Likert scale.
Rate the following response for helpfulness and accuracy on a scale of 1-5.
Question: [question]
Response: [model response]
Rating:
Pairwise comparison: choose the better of two responses.
Which of the following responses to the question is more helpful and accurate?
Question: [question]
Response A: [model A]
Response B: [model B]
Better response: A or B
MT-Bench: 80 challenging multi-turn questions; GPT-4 as judge. Standard for assistant model evaluation.
AlpacaEval: automatically evaluates instruction-following quality vs. a reference model (text-davinci-003).
Arena Hard: 500 challenging user prompts; pairwise GPT-4 judging.
Chatbot Arena (lmarena.ai): crowdsourced human pairwise comparisons; ELO ratings. The most trusted LLM ranking.
Calibration Evaluation
Expected Calibration Error (ECE): see ML Evaluation Systems. Measures how well predicted confidence matches empirical accuracy.
TriviaQA calibration: ask factual questions; measure ECE of verbalized confidence (“I’m 90% sure…”).
Factuality and Hallucination Evaluation
TruthfulQA: 817 questions where humans frequently give wrong answers due to misconceptions. Tests if models generate truthful responses.
FACTOR / FActScorer: given a generated response, decompose into atomic claims; verify each claim against a knowledge base. Precision and recall of factual claims.
FactScore: GPT-4-based fact verification for person biographies. Scores the fraction of atomic claims that are factually correct.
FEVEROUS / HaluEval: evaluate hallucination rates in specific domains (dialogue, summarization, QA).
Safety Evaluation
ToxiGen: measure the probability that the model generates toxic content about demographic groups.
BBQ (Bias Benchmark for QA): evaluate social bias in question answering.
WinoBias / WinoGrande: coreference bias evaluation.
HarmBench: standardized red-teaming benchmark. Attack success rate of adversarial jailbreaks.
SORRY-Bench: evaluate refusal quality. Does the model refuse appropriately without over-refusal of benign requests?
Capability Benchmarks
| Benchmark | Capability tested |
|---|---|
| MMLU | World knowledge (57 subjects) |
| HellaSwag | Commonsense reasoning |
| ARC-E / ARC-C | Science reasoning |
| GSM8K | Grade school math |
| MATH | Competition math |
| HumanEval | Python code generation |
| MBPP | Programming benchmarks |
| DROP | Discrete reasoning over paragraphs |
| GPQA | PhD-level science questions |
Benchmark saturation: as models improve, benchmarks become saturated (near-100% accuracy). New harder benchmarks are continually needed.
Data contamination: if benchmark questions appear in training data, scores are inflated. Hard to detect with certainty; decontamination by fuzzy matching is imperfect.
Evaluation Best Practices
Use multiple benchmarks: no single benchmark captures all relevant dimensions.
Report error bars: evaluation has variance; report confidence intervals.
Diverse prompts: test with different phrasings; models can be sensitive to prompt wording.
Failure analysis: categorize failures by type. Random errors, consistent failures on a specific topic, and format failures have different remedies.
Human evaluation for final assessment: automated metrics correlate imperfectly with human judgment. Human side-by-side evaluation is required for final product decisions.
Avoid overfitting to benchmarks: optimizing solely for benchmark performance may improve numbers without improving real-world utility.