Feature Engineering
Feature engineering transforms raw input data into representations that are more informative for a learning algorithm. Even powerful models benefit from well-engineered features; poor representations can make a problem unlearnable regardless of model capacity.
“Applied machine learning is basically feature engineering.” — Andrew Ng
Feature Types
| Type | Description | Examples |
|---|---|---|
| Numerical | Continuous or discrete numbers | Age, price, count |
| Categorical | Discrete unordered values | Country, product category |
| Ordinal | Discrete ordered values | Rating (1-5), education level |
| Text | Sequences of tokens | Reviews, documents |
| Temporal | Time-indexed values | Timestamps, time series |
| Geospatial | Coordinates or regions | Lat/lon, postal codes |
| Image / Audio | Raw tensors | Pixel arrays, waveforms |
Numerical Feature Transformations
Scaling
Required for distance-based and gradient-based methods. Trees are invariant to monotone transformations.
Standard scaling (z-score):
\[x' = \frac{x - \mu}{\sigma}\]Results in zero mean and unit variance. Sensitive to outliers.
Min-max scaling:
\[x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}\]Maps to $[0, 1]$. Sensitive to outliers.
Robust scaling:
\[x' = \frac{x - \text{median}(x)}{\text{IQR}(x)}\]Uses interquartile range; robust to outliers.
Log transform: $x’ = \log(x + 1)$. Reduces right skew; requires $x \geq 0$.
Box-Cox transform:
\[x'(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \lambda \neq 0 \\ \log x & \lambda = 0 \end{cases}\]Finds optimal $\lambda$ to normalize distribution. Requires $x > 0$.
Yeo-Johnson: generalization of Box-Cox that handles $x \leq 0$.
Binning / Discretization
Converts continuous feature to ordinal categories. Reduces sensitivity to outliers and can expose non-linear relationships for linear models.
- Fixed-width bins: equal-size intervals.
- Quantile bins: equal-frequency intervals (handles skew better).
Categorical Feature Encoding
One-Hot Encoding
Creates a binary indicator for each category. For $K$ categories: $K$ binary columns (or $K-1$ to avoid multicollinearity).
\[\text{cat} \in \{A, B, C\} \to [1, 0, 0], [0, 1, 0], [0, 0, 1]\]Appropriate when categories have no natural order. Explodes dimensionality for high-cardinality features.
Ordinal Encoding
Maps categories to integers preserving order. Only use when order is meaningful.
Target Encoding (Mean Encoding)
Replace each category with the mean target value in that category:
\[\text{enc}(c) = \frac{\sum_{i: x_i = c} y_i}{|\{i: x_i = c\}|}\]Risk of target leakage. Use cross-validation or smoothing:
\[\text{enc}(c) = \frac{n_c \cdot \bar{y}_c + \lambda \cdot \bar{y}_{\text{global}}}{n_c + \lambda}\]Embedding
Learned dense vectors for high-cardinality categoricals (e.g., user ID, product ID). Standard in deep learning; can be pre-trained (word2vec, GloVe) or learned end-to-end.
Hashing Trick
Maps categories to a fixed-size vector of length $m$ via a hash function. No vocabulary needed; handles unseen categories. Risk of collisions.
Temporal Features
From timestamps, extract:
- Calendar features: hour, day-of-week, month, quarter, year, is_holiday
- Cyclical encoding: encode periodic features as $(\sin, \cos)$ pairs to preserve continuity
- Lag features: $x_{t-1}, x_{t-2}, \ldots$ for time series.
- Rolling statistics: rolling mean, std, min, max over a window.
- Time since event: days since last purchase, etc.
Feature Interactions
Polynomial features: expand $[x_1, x_2]$ to $[x_1, x_2, x_1^2, x_1 x_2, x_2^2]$. Allows linear models to capture nonlinear relationships.
Cross features: explicit products of two categorical or numerical features.
Ratio features: $x_1 / x_2$ can capture meaningful relationships (e.g., debt-to-income ratio).
Feature Selection
Reduces dimensionality, removes irrelevant/noisy features, and speeds up training.
| Method | Approach |
|---|---|
| Filter (variance threshold) | Remove features with variance below threshold |
| Filter (correlation) | Remove features highly correlated with each other |
| Univariate statistical test | Select by $\chi^2$, ANOVA $F$-statistic, mutual information with target |
| Recursive Feature Elimination (RFE) | Repeatedly train and remove weakest features |
| L1 regularization (Lasso) | Drives irrelevant feature weights to zero |
| Tree-based importance | Rank by impurity decrease (Gini/entropy) or permutation importance |
| SHAP values | Model-agnostic, measures each feature’s marginal contribution |
Handling Missing Values
| Strategy | When to Use |
|---|---|
| Mean / median imputation | Numerical features, MCAR/MAR assumption |
| Mode imputation | Categorical features |
| Indicator feature | Add binary “was_missing” column alongside imputed value |
| Model-based imputation | MICE, KNN imputation |
| Drop rows | Very few missing, missing not at random |
| Drop feature | High fraction missing with no predictive value |
Text Features
Bag of Words (BoW): count matrix over vocabulary. Sparse, ignores order.
TF-IDF:
\[\text{tf-idf}(t, d) = \text{tf}(t, d) \cdot \log \frac{N}{\text{df}(t)}\]Downweights common terms. Better than raw counts for retrieval and classification.
N-grams: captures local word context. Bigrams, trigrams added to vocabulary.
Word embeddings: dense representations from word2vec, GloVe, fastText. Average or pool over tokens for document representation.
Sentence/document embeddings: BERT CLS token, Sentence-BERT, or bag of word-embeddings.
Feature Engineering Principles
- Features should encode domain knowledge that the model cannot easily learn on its own from raw inputs.
- Always transform training features and apply the same transformation to test/production features (fit on train only).
- Check for data leakage: no feature should encode information about the future or the target itself.
- High-cardinality features require special treatment to avoid overfitting.