Notes on Machine Learning 11: Estimators

(ML 11.1) Estimators

Model the data as random variables: $D = (X_1, \ldots, X_n)$.

Definition. A statistic is a random variable $S = f(D)$ that is a function of the data $D$.

Terminology. An estimator is a statistic intended to approximate a parameter governing the distribution of $D$.

Notation.

$\hat{\theta}$ denotes an estimator for a parameter $\theta$.
$\hat{\theta}_n$ emphasize (the dependence on) $n$

Example. $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$, iid

Sample mean: $\,\,$ $\hat{\mu} = \bar{X} = \frac{1}{n}\sum_{i=1}^nX_i$ $\,\,$ /cf. $\sigma^2 = \mathbb{E}((X - \mu)^2)$
“Biased” sample variance: $\,\,$ $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(X_i -\bar{X})^2$
“unbiased” sample variance: $\,\,$ $s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i -\bar{X})^2$

Definition.

The bias of an estimator $\hat{\theta}$ is $\,$ ${\rm bias}(\hat{\theta}) = \mathbb{E}(\hat{\theta}) - \theta$.
An estimator $\hat{\theta}$ is unbiased if $\,$ ${\rm bias}(\hat{\theta}) = 0$.

Example.

$\hat{\mu}$ is unbiased: $\mathbb{E}(\hat{\mu}) = \mathbb{E}(\frac{1}{n}\sum_{i=1}^nX_i) =\frac{1}{n}\sum \mathbb{E}(X_i) = \frac{1}{n}\sum \mu = \mu$
$\hat{\sigma}^2$ is biased. (Exercise)
$s^2$ is unbiased. (Exercise)

(ML 11.2) Decision theory terminology in different contexts

General	Estimators	*$^$Regression/Classification (SL)**
Decision rule $\delta$	$^*$Estimator function $g$	Prediction function $f$
State $s$ (unknown)	Parameter $\theta$ (unknown)	Target value $Y$ (unknown)
$^*$Data $D$ (observed)	Data $D$ (observed)	Point $X$ (observed)
Action $a = \delta(D)$	Estimator/Estimate $\hat{\theta}=g(D)$	Prediction $\hat{Y} = f(X)$
Loss $L(s, a)$	Loss $L(\theta, \hat{\theta})$	Loss $L(Y, \hat{Y})$

Example. (Estimators)
An estimator is a random rariable: $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n X_i$.
An estimate is a number: $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i = 2.3$.
(In some situation, the procedure $g$ is refered to as an estimator!)

(ML 11.3) Frequentist risk, Bayesian expected loss, and Bayes risk

Loss and Risk. Exciting session to clear up all the mud!

Data: $\,$ $D = (X_1, \ldots, X_n)$, $D \sim p_\theta$
Parameter: $\,$ $\theta \sim \pi$ $\,$, i.e., the parameter $\theta$ is a random variable. Estimator: $\,$ $\hat{\theta} = f(D) = \delta(D)$

Everything begins with : $\,\,\,\,\,$ Loss $=L(\theta, f(D))$.
We wanna minimize the loss but it’s an RV!
Two option to deal with it:

Averaging over $\theta$ given the data : $E(L(\theta, f(D)) \vert D) =:\rho(\pi, f(D))$ Bayesian expected loss
Averaging over the data given $\theta$ : $E(L(\theta, f(D)) \vert \theta) =: R(\theta, f)$ (Frequentist) risk

Bayesians vs. frequentists

(ML 11.4) Choosing a decision rule - Bayesian and frequentist

How to choose $f$.

Bayesian: Assume $\pi$
Case 1. Know $D$. Choose $f(D)$ to minimize $\rho(\pi, f(D))$
Case 2. Don’t know $D$. Choose $f$ to minimize $r(\pi, f)$

Frequentist: Introduce a furthere principle to guide your choice.
(a) Unbiasedness
(b) Admissibility (c) Minimax
(d) Invariance

A frequentist approaches — A frequentist approach

(ML 11.5) Bias-Variance decomposition (MSE $=$ bias$^2$ + var)

“A super impportant port of ML is what’s called model selection and a tool for model selection is the bias-variance decomposition.”

Almost trivial identity but extremely handy.

Definition. Let $D$ be random data. The MSE of an estimator $\hat{\theta} = f(D)$ for $\theta$ is

${\rm MSE}(\hat{\theta}) = E((\hat{\theta} - \theta)^2\vert \theta)$

Put $\vert \theta$ to emphasize we’re not averaging over $\theta$ here (we don’t have a distribution over $\theta$).
We’re averaging over the data!

${\rm MSE}(\theta)$ is nothing but the risk $R(\theta, f(D))$ under square loss, i.e., when the loss function is the square of the deifference.

Recall. ${\rm bias}(\hat{\theta}) = \mathbb{E}(\theta) -\theta$.

Proposition. MSE$(\theta) = {\rm bias}(\hat{\theta})^2 + {\rm var}(\hat{\theta})$

Proof:

Silly example. $X \sim N(\theta, 1)$ where $\theta$ nonrandom & unknown and $D = X$

“Natural” estimate of $\theta$ : $\delta_1(D) = X \leadsto$ bias$^2 = 0$, var$ = 1$, MSE$ =1$
“Silly” estimate of $\theta$ : $\delta_0(D) = 0 \leadsto$ bias$^2 = \theta^2$, var$ = 0$, MSE$ = \theta^2$

cf. Shrinkage, Stein’s paradox

(ML 11.6) Inadmissibility

Recall. $R(\theta, \delta) = \mathbb{E}(L(\theta, \delta(D) \vert \theta)$.

Example. Target: $N(\theta, 1)$ where $\theta \in \mathbb{R}$.
Data: $D = (X_1, \ldots, X_n)$ where $X_i \sim (\theta, 1)$, iid.
Risk: $L(\theta, \delta) = (\theta - \delta)$.

“Natural” : $\delta^{(1)}_n(D) = X_n \leadsto$ $R(\theta, \delta^{(1)}) = 1$.
“Silly” : $\delta^{(0)}_n(D) = 0 \leadsto$ $R(\theta, \delta^{(0)}) = \theta^2$.

Definition. Given decision rules $\delta, \delta’$, we say $\delta$ dominates $\delta’$ if $R(\theta, \delta) \le R(\theta, \delta’)$ for all $\theta \in \Theta$ and $R(\theta, \delta) < R(\theta, \delta’)$ for some $\theta \in \Theta$.

Definition. A decision rule $\delta$ is inadmissible if there is another decision rule that dominates $\delta$. Otherwise, $\delta$ is admissible.

Fact. Both $\delta_1$ and $\delta_0$ are admissible.

(ML 11.7) A fun exercise on inadmissibility

Exercise. $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$.
Assume $\mu$ is known.
Loss: $L(\sigma_1^2, \sigma_2^2) = (\sigma_1^2 - \sigma_2^2)^2$.
$\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(X_i-\bar{x})^2$.
$s^2 = \frac{n}{n-1}\hat{\sigma}^2$
$s_c^2 = c\hat{\sigma}^2$, $c \ge 0$.

Does $\hat{\sigma}^2$ dominates $s^2$? Or vice versa?
Are $\hat{\sigma}^2$ and $s^2$ admissible?
Is there a “best” $s_c^2$ for some $c \ge 0$?

(ML 11.8) Bayesian decision theory

$\rho(\pi, \delta(D)) = \mathbb{E}(L(\theta, \delta(D)) \vert D)$ $\begin{aligned} r(\pi, \delta) &= \mathbb{E}(L(\theta, \delta(D))) \\ &= \mathbb{E} (\mathbb{E}(L(\theta, \delta(D)) \vert D)) \\ &= \mathbb{E} (\rho(\pi, \delta(D))) \end{aligned}$

Terminology.(Informal)

A generalized Bayes rule is a decision rule $\delta$ that minimizes $\rho(\pi, \delta(D))$ for each $D$.
A Bayes rule minimizes is a decision rule $\delta$ that minimizes $r(\pi, \delta)$.

Remark

GBR $\Rightarrow$ BR, but BR $\not\Rightarrow$ GBR.
If $r(\pi, \delta) = \infty$ for all $\delta$, then anything is a BR, but a GBR still makes sense.
On a set of $\pi$-measure $0$, BR is arbitrary bit GBR is still sensible.

Complete Class Theorems. Under mild conditions:

Every GBR (for a proper $\pi$) is admissible;
Every admissible decision rule is a GBR for some (possibly improper) $\pi$

Check out the following :