Feature Engineering

Feature engineering transforms raw input data into representations that are more informative for a learning algorithm. Even powerful models benefit from well-engineered features; poor representations can make a problem unlearnable regardless of model capacity.

“Applied machine learning is basically feature engineering.” — Andrew Ng

Feature Types

Type Description Examples
Numerical Continuous or discrete numbers Age, price, count
Categorical Discrete unordered values Country, product category
Ordinal Discrete ordered values Rating (1-5), education level
Text Sequences of tokens Reviews, documents
Temporal Time-indexed values Timestamps, time series
Geospatial Coordinates or regions Lat/lon, postal codes
Image / Audio Raw tensors Pixel arrays, waveforms

Numerical Feature Transformations

Scaling

Required for distance-based and gradient-based methods. Trees are invariant to monotone transformations.

Standard scaling (z-score):

\[x' = \frac{x - \mu}{\sigma}\]

Results in zero mean and unit variance. Sensitive to outliers.

Min-max scaling:

\[x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}\]

Maps to $[0, 1]$. Sensitive to outliers.

Robust scaling:

\[x' = \frac{x - \text{median}(x)}{\text{IQR}(x)}\]

Uses interquartile range; robust to outliers.

Log transform: $x’ = \log(x + 1)$. Reduces right skew; requires $x \geq 0$.

Box-Cox transform:

\[x'(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \lambda \neq 0 \\ \log x & \lambda = 0 \end{cases}\]

Finds optimal $\lambda$ to normalize distribution. Requires $x > 0$.

Yeo-Johnson: generalization of Box-Cox that handles $x \leq 0$.

Binning / Discretization

Converts continuous feature to ordinal categories. Reduces sensitivity to outliers and can expose non-linear relationships for linear models.

  • Fixed-width bins: equal-size intervals.
  • Quantile bins: equal-frequency intervals (handles skew better).

Categorical Feature Encoding

One-Hot Encoding

Creates a binary indicator for each category. For $K$ categories: $K$ binary columns (or $K-1$ to avoid multicollinearity).

\[\text{cat} \in \{A, B, C\} \to [1, 0, 0], [0, 1, 0], [0, 0, 1]\]

Appropriate when categories have no natural order. Explodes dimensionality for high-cardinality features.

Ordinal Encoding

Maps categories to integers preserving order. Only use when order is meaningful.

Target Encoding (Mean Encoding)

Replace each category with the mean target value in that category:

\[\text{enc}(c) = \frac{\sum_{i: x_i = c} y_i}{|\{i: x_i = c\}|}\]

Risk of target leakage. Use cross-validation or smoothing:

\[\text{enc}(c) = \frac{n_c \cdot \bar{y}_c + \lambda \cdot \bar{y}_{\text{global}}}{n_c + \lambda}\]

Embedding

Learned dense vectors for high-cardinality categoricals (e.g., user ID, product ID). Standard in deep learning; can be pre-trained (word2vec, GloVe) or learned end-to-end.

Hashing Trick

Maps categories to a fixed-size vector of length $m$ via a hash function. No vocabulary needed; handles unseen categories. Risk of collisions.

Temporal Features

From timestamps, extract:

  • Calendar features: hour, day-of-week, month, quarter, year, is_holiday
  • Cyclical encoding: encode periodic features as $(\sin, \cos)$ pairs to preserve continuity
\[\text{hour\_sin} = \sin\!\left(\frac{2\pi \cdot \text{hour}}{24}\right), \quad \text{hour\_cos} = \cos\!\left(\frac{2\pi \cdot \text{hour}}{24}\right)\]
  • Lag features: $x_{t-1}, x_{t-2}, \ldots$ for time series.
  • Rolling statistics: rolling mean, std, min, max over a window.
  • Time since event: days since last purchase, etc.

Feature Interactions

Polynomial features: expand $[x_1, x_2]$ to $[x_1, x_2, x_1^2, x_1 x_2, x_2^2]$. Allows linear models to capture nonlinear relationships.

Cross features: explicit products of two categorical or numerical features.

Ratio features: $x_1 / x_2$ can capture meaningful relationships (e.g., debt-to-income ratio).

Feature Selection

Reduces dimensionality, removes irrelevant/noisy features, and speeds up training.

Method Approach
Filter (variance threshold) Remove features with variance below threshold
Filter (correlation) Remove features highly correlated with each other
Univariate statistical test Select by $\chi^2$, ANOVA $F$-statistic, mutual information with target
Recursive Feature Elimination (RFE) Repeatedly train and remove weakest features
L1 regularization (Lasso) Drives irrelevant feature weights to zero
Tree-based importance Rank by impurity decrease (Gini/entropy) or permutation importance
SHAP values Model-agnostic, measures each feature’s marginal contribution

Handling Missing Values

Strategy When to Use
Mean / median imputation Numerical features, MCAR/MAR assumption
Mode imputation Categorical features
Indicator feature Add binary “was_missing” column alongside imputed value
Model-based imputation MICE, KNN imputation
Drop rows Very few missing, missing not at random
Drop feature High fraction missing with no predictive value

Text Features

Bag of Words (BoW): count matrix over vocabulary. Sparse, ignores order.

TF-IDF:

\[\text{tf-idf}(t, d) = \text{tf}(t, d) \cdot \log \frac{N}{\text{df}(t)}\]

Downweights common terms. Better than raw counts for retrieval and classification.

N-grams: captures local word context. Bigrams, trigrams added to vocabulary.

Word embeddings: dense representations from word2vec, GloVe, fastText. Average or pool over tokens for document representation.

Sentence/document embeddings: BERT CLS token, Sentence-BERT, or bag of word-embeddings.

Feature Engineering Principles

  • Features should encode domain knowledge that the model cannot easily learn on its own from raw inputs.
  • Always transform training features and apply the same transformation to test/production features (fit on train only).
  • Check for data leakage: no feature should encode information about the future or the target itself.
  • High-cardinality features require special treatment to avoid overfitting.