Text Preprocessing

Raw text is messy, inconsistent, and not directly consumable by ML models. Preprocessing transforms raw text into a clean, normalized form before tokenization and modeling.

Text Cleaning

Lowercasing: convert all text to lowercase. Simple but loses case-based information (proper nouns, acronyms). Optional for neural models that handle case via subword tokens.

Punctuation removal: strip punctuation for bag-of-words or TF-IDF pipelines. Avoid for models where punctuation carries syntactic or semantic meaning.

Whitespace normalization: collapse multiple spaces, tabs, newlines into a single space. Strip leading/trailing whitespace.

HTML/markup stripping: remove tags, entities (&, <) from web-scraped text. Use a parser (BeautifulSoup) rather than regex for robustness.

Encoding normalization: decode to Unicode; handle mojibake (garbled encodings). Normalize Unicode to NFC or NFKC form to merge equivalent representations (e.g., accented characters).

Sentence Segmentation

Split a document into individual sentences. Non-trivial due to abbreviations (“Dr.”, “U.S.”), ellipses, and quoted speech.

Rule-based: Punkt algorithm (NLTK). Trains on abbreviation lists; handles common cases well.

Neural: spaCy’s sentencizer, stanza. More robust on diverse domains.

Stopword Removal

Remove high-frequency function words (“the”, “is”, “at”) that carry little discriminative information for bag-of-words models.

When to use: TF-IDF, topic models, classical IR.

When not to use: sentiment analysis (negation words like “not” are critical), neural models (handle stopwords implicitly), tasks where function words carry grammatical meaning.

Caution: stopword lists are language-specific and domain-specific.

Stemming

Reduces a word to its root form by stripping affixes using rules, without regard for linguistic correctness.

  • “running” → “run”
  • “studies” → “studi” (over-stemming)
  • “better” → “better” (under-stemming)

Porter Stemmer: fast, simple, English-specific. Applies a cascade of suffix-stripping rules.

Snowball Stemmer: improved Porter; supports multiple languages.

Output may not be a real word. Useful for increasing recall in IR.

Lemmatization

Maps a word to its canonical dictionary form (lemma) using morphological analysis and POS tags.

  • “running” → “run”
  • “better” → “good”
  • “studies” → “study”

Requires POS tagging for disambiguation (“saw” as verb → “see”; “saw” as noun → “saw”).

Tools: NLTK WordNetLemmatizer, spaCy, stanza. Slower than stemming but linguistically correct.

Method Speed Accuracy Real word output
Stemming Fast Lower No
Lemmatization Slower Higher Yes

Part-of-Speech (POS) Tagging

Labels each token with its grammatical role: noun, verb, adjective, adverb, etc.

Universal POS tags: NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PUNCT, X.

Methods:

  • HMM-based (Viterbi decoding over tag sequence).
  • CRF-based (discriminative sequence labeling).
  • Neural (BiLSTM or Transformer encoder; state of the art).

Uses: lemmatization disambiguation, dependency parsing, feature engineering.

Named Entity Recognition (NER)

Identifies and classifies named entities in text: persons, organizations, locations, dates, etc.

See Information Extraction for detailed treatment.

Dependency Parsing

Identifies syntactic relationships between words as a directed tree (dependency graph). Each word has a head word and a dependency relation label (subject, object, modifier, etc.).

Transition-based parsing (arc-eager): $O(n)$ transitions; fast. Used in spaCy.

Graph-based parsing (Biaffine): scores all possible arcs; $O(n^2)$ but highly accurate. Used in stanza.

Uses: relation extraction, semantic role labeling, question answering.

Text Normalization for Specific Domains

Social media: expand contractions (“don’t” → “do not”), handle hashtags, mentions, emojis, slang.

Medical text: expand abbreviations, normalize drug names, handle negation (“no fever”).

Numbers and dates: normalize “3rd”, “three”, “3” to a consistent form where needed.

Preprocessing Pipeline Summary

A typical classical NLP pipeline:

raw text
  → clean (strip HTML, fix encoding)
  → sentence segment
  → tokenize
  → lowercase
  → remove stopwords (optional)
  → stem / lemmatize (optional)
  → POS tag (optional)
  → feature extraction (TF-IDF, n-grams)

For neural models, most steps are replaced by a learned tokenizer and the model itself. Cleaning (HTML stripping, encoding normalization) and sentence segmentation remain useful.