Hypothesis Testing

Hypothesis testing provides a framework for making decisions about population parameters based on sample data. It answers: “Is the observed effect real, or could it be due to random chance?”

Basic Framework

Null hypothesis ($H_0$): Default assumption (no effect, no difference, status quo).

Alternative hypothesis ($H_1$ or $H_a$): What we want to find evidence for.

Test statistic: A function of the data used to decide between hypotheses.

Rejection region: Values of the test statistic that lead to rejecting $H_0$.

Types of Errors

Decision	$H_0$ True	$H_0$ False
Reject $H_0$	Type I error ($\alpha$)	Correct (Power)
Fail to reject $H_0$	Correct	Type II error ($\beta$)

Type I error ($\alpha$): False positive (rejecting true null).

Type II error ($\beta$): False negative (failing to reject false null).

Power ($1 - \beta$): Probability of correctly rejecting a false null hypothesis.

Significance Level

Significance level ($\alpha$): Maximum acceptable Type I error rate.

Common choices: $\alpha = 0.05$, $\alpha = 0.01$, $\alpha = 0.001$.

Decision rule: Reject $H_0$ if p-value $< \alpha$.

P-value

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) the observed value, assuming $H_0$ is true:

$$ \text{p-value} = P(T \geq t_{\text{obs}} | H_0) $$

Interpretation:

Small p-value ($< \alpha$): Data is unlikely under $H_0$ → reject $H_0$
Large p-value: Data is consistent with $H_0$ → fail to reject

Important: p-value is NOT:

The probability that $H_0$ is true
The probability that the result is due to chance
A measure of effect size or practical significance

Common Tests

Z-test (One Sample)

Test if population mean equals a specified value $\mu_0$ (known variance $\sigma^2$).

Hypotheses:

$H_0: \mu = \mu_0$
$H_1: \mu \neq \mu_0$ (two-tailed), $\mu > \mu_0$ or $\mu < \mu_0$ (one-tailed)

Test statistic:

$$ Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} $$

Distribution: $Z \sim \mathcal{N}(0, 1)$ under $H_0$.

T-test (One Sample)

Test if population mean equals $\mu_0$ (unknown variance, estimated from data).

Test statistic:

$$ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} $$

Distribution: $t \sim t_{n-1}$ (Student’s t with $n-1$ degrees of freedom).

T-test (Two Sample)

Compare means from two independent groups.

Test statistic:

$$ t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} $$

where $s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}$ (pooled variance).

Distribution: $t \sim t_{n_1 + n_2 - 2}$.

Paired T-test

Compare means from matched pairs (before/after, same subjects).

Test statistic:

$$ t = \frac{\bar{d}}{s_d / \sqrt{n}} $$

where $d_i = x_{i1} - x_{i2}$ (differences).

Chi-Squared Test (Goodness of Fit)

Test if observed frequencies match expected frequencies.

Test statistic:

$$ \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} $$

Distribution: $\chi^2 \sim \chi^2_{k-1-p}$ where $p$ = number of estimated parameters.

Chi-Squared Test (Independence)

Test if two categorical variables are independent.

Test statistic:

$$ \chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $$

where $E_{ij} = \frac{(\text{row total}_i)(\text{column total}_j)}{\text{grand total}}$.

Distribution: $\chi^2 \sim \chi^2_{(r-1)(c-1)}$.

F-test (ANOVA)

Compare means across $k > 2$ groups.

Test statistic:

$$ F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{MS_B}{MS_W} $$

Distribution: $F \sim F_{k-1, N-k}$.

Null hypothesis: All group means are equal.

F-test (Variance Comparison)

Compare variances from two populations.

Test statistic:

$$ F = \frac{s_1^2}{s_2^2} $$

Distribution: $F \sim F_{n_1-1, n_2-1}$.

Confidence Intervals

A $100(1-\alpha)\%$ confidence interval is a range that, under repeated sampling, would contain the true parameter value in $100(1-\alpha)\%$ of cases.

For mean (known $\sigma$):

$$ \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} $$

For mean (unknown $\sigma$):

$$ \bar{x} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}} $$

Interpretation (frequentist): If we repeated the experiment many times, $100(1-\alpha)\%$ of the computed intervals would contain the true parameter.

Relationship Between Hypothesis Tests and Confidence Intervals

A two-tailed hypothesis test at level $\alpha$ rejects $H_0: \mu = \mu_0$ if and only if $\mu_0$ is outside the $100(1-\alpha)\%$ confidence interval.

Multiple Testing Problem

When conducting $m$ hypothesis tests, the probability of at least one false positive increases:

$$ P(\text{at least one Type I error}) = 1 - (1 - \alpha)^m $$

For $m = 20$ tests at $\alpha = 0.05$: ~64% chance of at least one false positive.

Correction Methods

Bonferroni correction: Divide significance level by number of tests.

$$ \alpha_{\text{adjusted}} = \frac{\alpha}{m} $$

Conservative but simple.

Holm-Bonferroni method: Step-down procedure (less conservative).

Sort p-values: $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$
Reject $H_{(i)}$ if $p_{(i)} \leq \frac{\alpha}{m - i + 1}$
Stop at first non-rejection

Benjamini-Hochberg procedure: Controls False Discovery Rate (FDR).

Sort p-values: $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$
Find largest $k$ such that $p_{(k)} \leq \frac{k}{m} \alpha$
Reject all $H_{(i)}$ for $i \leq k$

Power Analysis

Statistical power: $P(\text{reject } H_0 \mid H_1 \text{ is true}) = 1 - \beta$

Factors affecting power:

Sample size ($n$): larger $n$ → higher power
Effect size: larger effect → higher power
Significance level ($\alpha$): higher $\alpha$ → higher power (but more Type I errors)
Variance: lower variance → higher power

A priori power analysis: Determine required sample size before conducting study.

Non-parametric Tests

When distributional assumptions (e.g., normality) are violated.

Mann-Whitney U Test (Wilcoxon Rank-Sum)

Non-parametric alternative to two-sample t-test.

Procedure: Rank all observations, sum ranks for each group.

Wilcoxon Signed-Rank Test

Non-parametric alternative to paired t-test.

Kruskal-Wallis Test

Non-parametric alternative to one-way ANOVA.