Model Interpretability

Model interpretability (also called explainability or XAI) is the degree to which a human can understand the cause of a model’s prediction. It is essential for debugging, building trust, satisfying regulatory requirements, and detecting bias.

Interpretability vs. accuracy: complex models (neural nets, gradient boosting) are often more accurate but harder to interpret. Interpretable models (linear regression, decision trees) trade some accuracy for transparency.

Scope of Explanation

Scope	Question
Global	How does the model behave overall? Which features matter most?
Local	Why did the model make this specific prediction for this instance?
Cohort	How does the model behave on a specific subgroup?

Intrinsically Interpretable Models

Linear Models

Coefficients $\mathbf{w}$ directly quantify the effect of each feature:

\[\hat{y} = \mathbf{w}^T x + b\]

$w_j$ is the change in $\hat{y}$ per unit increase in $x_j$, holding all else constant.

Conditions for valid interpretation:

Features are standardized (otherwise magnitudes are not comparable).
No strong multicollinearity (correlated features share credit arbitrarily).
Correct functional form (linear relationship holds).

Decision Trees

Provide natural rule-based explanations:

if age > 30:
    if income > 50k: predict "approved"
    else: predict "denied"
else:
    predict "denied"

Each prediction path is a conjunction of conditions. Depth $\leq 5$ trees are comprehensible to humans; deeper trees lose interpretability.

Rule-Based Models

Decision rules: IF-THEN statements learned directly. E.g., RuleFit learns linear model + sparse set of rules from tree splits.

Scoring systems: additive point-based models (FICO score) where each feature contributes a small integer score. Directly applicable by humans without computation.

Post-hoc Explanation Methods

Applied after training any model.

Feature Importance

Permutation importance: measure drop in validation metric when feature $j$ is randomly permuted:

\[\text{PI}_j = m(\hat{y}, y) - m(\hat{y}_{j\text{ permuted}}, y)\]

Model-agnostic; unbiased; accounts for feature interactions (any drop in performance means $j$ was carrying information).

SHAP-based importance: mean absolute SHAP value over the dataset. Consistent and theoretically grounded.

SHAP (SHapley Additive exPlanations)

Attributes the model’s prediction to each feature based on Shapley values from cooperative game theory.

Shapley value: the average marginal contribution of feature $j$ across all possible orderings of features:

\[\phi_j = \sum_{S \subseteq \mathcal{F} \setminus \{j\}} \frac{|S|!(|\mathcal{F}| - |S| - 1)!}{|\mathcal{F}|!} [f(S \cup \{j\}) - f(S)]\]

SHAP explanation: $\hat{f}(x) = \phi_0 + \sum_{j=1}^d \phi_j(x)$, where $\phi_0 = \mathbb{E}[\hat{f}(x)]$.

Properties:

Efficiency: $\sum_j \phi_j = \hat{f}(x) - \mathbb{E}[\hat{f}]$
Symmetry: features that contribute equally get equal values.
Dummy: features that never affect output get $\phi_j = 0$.
Additivity: SHAP values of a sum of models equal sum of individual SHAP values.

Algorithms:

Variant	Target Model	Complexity
TreeSHAP	Tree-based models	$O(TLD^2)$; exact
KernelSHAP	Any model	$O(2^d)$ exact, approximated in practice
LinearSHAP	Linear models	$O(d)$; exact
DeepSHAP	Neural networks	Approximation via DeepLIFT

LIME (Local Interpretable Model-agnostic Explanations)

Locally approximates the black-box model with a simple (linear) model around a specific instance $x’$.

Algorithm:

Sample perturbed instances around $x’$.
Get predictions from black-box model for each perturbation.
Weight by proximity to $x’$: $w_i = \exp(-d(x’, z_i)^2 / \sigma^2)$.
Fit a sparse linear model to the weighted dataset.
Report linear coefficients as explanation.

Limitations: local linearity assumption may not hold; explanation can be unstable across runs.

Partial Dependence Plots (PDP)

Shows the marginal effect of one (or two) features on the predicted outcome, averaging over all other features:

\[\hat{f}_S(x_S) = \mathbb{E}_{x_C}[\hat{f}(x_S, x_C)] \approx \frac{1}{n}\sum_{i=1}^n \hat{f}(x_S, x_C^{(i)})\]

Assumes feature independence. Can be misleading with correlated features.

Individual Conditional Expectation (ICE) Plots

One line per sample: shows how prediction changes as a single feature varies, holding all others at their observed values. PDP is the mean of ICE curves. Reveals heterogeneous effects that PDP averages away.

Centered ICE (c-ICE): subtract the value at a reference point to highlight differences in slopes.

Accumulated Local Effects (ALE)

Resolves the feature independence assumption of PDP. Uses conditional distribution $P(X_{-j} \mid X_j)$ rather than marginal:

\[\hat{f}_{j,\text{ALE}}(x_j) = \int_{z_0}^{x_j} \mathbb{E}_{X_{-j}|X_j=z}\!\left[\frac{\partial \hat{f}}{\partial X_j}(z, X_{-j})\right] dz\]

Unbiased even with correlated features.

Saliency and Attribution for Neural Networks

Method	Approach
Gradient (Saliency map)	$\lvert\partial \hat{y} / \partial x_j\rvert$; highlights input features by gradient magnitude
Integrated Gradients	Integrates gradients from baseline to input; satisfies completeness axiom
GradCAM	Class-weighted activation maps; localizes image regions driving predictions
LIME for images	Superpixel perturbations; identifies key image regions

Global Surrogate Models

Train an interpretable model (e.g., decision tree) to mimic the black-box model’s predictions over the entire input space. Explanation quality depends on how well the surrogate approximates the original.

\[R^2_{\text{surrogate}} = 1 - \frac{\sum_i (g(x_i) - \hat{f}(x_i))^2}{\sum_i (\hat{f}(x_i) - \bar{\hat{f}})^2}\]

High $R^2$ means the surrogate is faithful to the original.

Fairness and Bias

Interpretability enables fairness auditing:

Inspect SHAP values or PDP for sensitive attributes (age, race, gender).
Disparate impact: $\frac{P(\hat{y}=1 \mid A=0)}{P(\hat{y}=1 \mid A=1)}$ should be $\geq 0.8$ (four-fifths rule).
Equalized odds: equalize TPR and FPR across groups.

Choosing an Explanation Method

Scenario	Recommended Method
Tree model, any scope	TreeSHAP
Any model, local explanation	LIME or KernelSHAP
Feature ranking	Permutation importance or mean SHAP
Single feature effect	ALE or ICE plot
Neural network, image	GradCAM or Integrated Gradients
Regulatory requirement (actionable)	Decision rules, scoring systems
Debugging / data quality	Residual analysis + SHAP