Text Summarization

Text summarization condenses a document (or set of documents) into a shorter text that preserves the most important information.

Summarization Types

Extractive summarization: select and concatenate sentences or phrases directly from the source document. The output is always a subset of the input text. Preserves fluency of the original; may lack coherence when stitched together.

Abstractive summarization: generate a new text that may contain words and phrases not in the source. More flexible and fluent; harder to train and more prone to hallucination.

Single-document vs. multi-document: summarize one document or multiple related documents into a single summary.

Query-focused summarization: the summary is conditioned on a specific question or topic, rather than summarizing the whole document.

Aspect-based summarization: produce summaries for specific aspects (e.g., “pros and cons” from product reviews).

Evaluation: ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

\[\text{ROUGE-N} = \frac{\sum_{\text{ref}} \sum_{\text{ngram} \in \text{ref}} \text{Count}_\text{match}(\text{ngram})}{\sum_{\text{ref}} \sum_{\text{ngram} \in \text{ref}} \text{Count}(\text{ngram})}\]

ROUGE-1: unigram recall.

ROUGE-2: bigram recall.

ROUGE-L: longest common subsequence (LCS); captures sentence-level structure without requiring consecutive matches.

\[\text{ROUGE-L} = \frac{\text{LCS}(r, c)}{|r|} \quad \text{(recall component)}\]

ROUGE correlates reasonably with human judgments for extractive summaries; less so for abstractive summaries.

BERTScore: computes semantic similarity using contextual BERT embeddings. More robust to paraphrases than ROUGE.

G-Eval: LLM-based evaluation rubric; strong correlation with human judgments.

Extractive Methods

Lead-3 baseline: take the first 3 sentences. Strong for news articles where critical information leads.

TextRank: graph-based algorithm. Build a sentence similarity graph; run PageRank to find the most central sentences. Unsupervised; no training required.

Neural extractive (SummaRuNNer, BertSum): score each sentence independently for inclusion. Trained with binary labels (include/exclude) derived from ROUGE-greedy oracle selection.

Oracle selection: given $k$ sentences to select, find the subset maximizing ROUGE against the reference. Used to generate training labels; upper bounds extractive system performance.

Abstractive Methods

Sequence-to-Sequence with Attention

Encoder reads the document; decoder generates the summary token by token with attention over encoder states.

Copy mechanism (pointer-network): at each step, the decoder can either generate from the vocabulary or copy a token from the source. Critical for handling rare words (names, numbers, technical terms).

\[p(w) = p_\text{gen} \cdot p_\text{vocab}(w) + (1 - p_\text{gen}) \sum_{i: x_i = w} \alpha_i\]

where $p_\text{gen} \in [0,1]$ is a learned switch and $\alpha_i$ are attention weights over source tokens.

Coverage mechanism: penalizes attending to the same source tokens repeatedly to reduce repetition in the output.

Pretrained Seq2Seq Models

BART (Lewis et al. 2020): denoising autoencoder pretraining. Encoder is bidirectional; decoder is causal. Pretrained objectives include token masking, deletion, text infilling, sentence permutation, document rotation. Strong out-of-the-box for summarization.

T5: text-to-text format. Fine-tune with input "summarize: <document>" and output as the reference summary.

Pegasus (Zhang et al. 2020): pretraining task specifically designed for summarization: Gap Sentence Generation (GSG). Randomly select and mask important sentences; the model learns to regenerate them. Achieves strong ROUGE on multiple summarization benchmarks.

Long-document models: standard Transformers are limited by context length.

  • Longformer: sliding window + global attention; handles up to 16k tokens.
  • LED (Longformer Encoder-Decoder): seq2seq version.
  • BigBird: sparse attention for long inputs.
  • LLM-based: GPT-4 or Claude with 100k+ context can summarize full books.

Training Data

CNN/DailyMail: ~300k news articles with human-written highlights. Most studied benchmark.

XSum (Extreme Summarization): single-sentence summaries from BBC articles. Requires more abstraction.

PubMed / ArXiv: scientific papers with abstracts as reference summaries. Long documents; domain-specific vocabulary.

SAMSum: dialogues with manually written summaries. Tests meeting or conversation summarization.

Challenges

Hallucination: the model generates facts not supported by the source. Critical issue in summarization; measured with FactCC, QAFactEval, or NLI-based faithfulness classifiers.

Faithfulness vs. ROUGE: a model can improve ROUGE by copying source sentences but fail to be faithful to the source’s meaning. ROUGE is a poor proxy for faithfulness.

Long document summarization: attending to a full book or lengthy report exceeds most models’ context windows. Hierarchical approaches (summarize chapters, then the chapter-level summaries) are a workaround.

Multi-document summarization: merging information from multiple sources while resolving redundancy and contradiction.

Domain adaptation: biomedical, legal, and financial texts require specialized vocabulary and writing styles different from news.