Notes on Machine Learning 4: Maximum Likelihood Estimation
by 장승환
(ML 4.1) (ML 4.2) Maximum Likelihood Estimation (MLE) (part 1, 2)
Setup. Given data $D = (x_1, \ldots, x_n)$ where $x_i \in \mathbb{R}^d$.
Assume a family of distributions $\{p_\theta : \theta \in \Theta\}$ on $\mathbb{R}^d$. $\,$ $p_\theta(x) = p(x \vert \theta) $
Assume $D$ is a sample from $X_1, \ldots, X_n \sim p_\theta$ iid for some $\theta \in \Theta$.
Goal. Estimate the true $\theta$ that $D$ comes from.
Definition. $\theta_{\rm MLE}$ is a MLE for $\theta$ if $\theta_{\rm MLE} “=” \arg\max_{\theta \in \Theta p(D\vert \theta)}$.
More precisely, $p(D \vert \theta_{\rm MLE}) = \max_{\theta\in \Theta} p(D\vert \theta)$,
where $p(D\vert \theta) = p(x_1, \ldots, x_n \vert \theta) = \prod_{i=1}^np(x_i\vert \theta) = \prod_{i=1}^nP[X=x_i\vert \theta]$.
Note: $p(D\vert \theta)$ is called the likelihood function
Remark.
- MLE might not be unique.
- MLE may fail to exist.
Pros:
- Easy
- Interpretable
- Asymptotic properties
$\,\,\,$ - Consistent (converges to the true value with probability $1$ when $n \rightarrow \infty$)
$\,\,\,$ - Normal
$\,\,\,$ - Efficient (lowest asymptotic variance) - Invariant under reparametrization: $g(\theta_{\rm MLE})$ is a MLE for $g(\theta)$
Cons:
- Point estimate, so no representation of uncertainty. $p(x \vert D)\approx p(x \vert \theta_{\rm MLE})$
- Overfitting (severe problem)
$\,\,\,$ - Regression $\,\,\,$ - Black swan “paradox” - Wrong objective? - Disregards loss
- Existence & uniqueness is not guaranteed.
(ML 4.3) MLE for univariate Gaussian mean
Suppsose we want to estimate a RV $X \sim N(\theta, \sigma^2)$, where $\theta$ is unknown.
The only information we have is that the parameter $\theta$ comes form a distribution on $\Theta$, i.e., $\theta \in \Theta$.
We have the density $\,$ $p(x \vert \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(x - \theta)^2 \right)$.
Assume that we have data $D = (x_1, \ldots, x_n)$ gotten via $X_i \sim N(\theta, \sigma^2)$ iid ($X_i = x_i$)
For MLE, wanna maximize, over $\theta$, the following:
Euqivalently, can maximize its log instead:
gives $\theta_{\rm MLE}$ $= \frac{1}{n}\sum_{i=1}^n x_i$ since $\frac{\partial^2}{\partial\theta^2} \log p(D \vert \theta) = -\frac{n}{\sigma^2} < 0$
In sum, we target $\theta_{\rm MLE}$ from distributoins $(N(\theta, \sigma^2))_{\theta \in \Theta}$,
and the sample mean $\frac{1}{n}\sum_{i=1}^n X_i$ provides an effective estimator of it.
(ML 4.4) (ML 4.5) MLE for a PMF on a finite set
Suppose we want to estimate a RV $X \sim p$ where the PMF $p: \{1, \ldots, m\} \rightarrow [0,1]$ is unknown.
Assume the data $D = (x_1, \ldots, x_n)$ gotten via $X_1, \ldots, X_n \sim p$ iid.
Set the parameter $\theta = (\theta_1, \ldots, \theta_m) = (p_\theta(1), \ldots, p_\theta(m)) \in \Theta$
where $\Theta=\{\theta = (\theta_1, \ldots, \theta_m) : \sum_{i=1}^m \theta_i =1\}$.
That is, $\,$ $p_\theta(i) = \theta_i = p(i\vert \theta) = P[X= i\vert \theta]$ $\,$ and $\,$ $\sum_{i=1}^m p_\theta(i) = 1$.
Now $\,$ $p(D \vert \theta_{\rm MLE}) \ge p(D \vert \theta)$ for all $\theta \in \Theta$ where
where we put $n_j =$ #$\{i: x_i =j\}$.
In other words, we are maximizing the function $p(D \vert \cdot): \mathbb{R}^m \rightarrow [0,1]$ given by
under the constraint $\sum_{i=1}^m \theta_i = 1$.
In sum,
- maximize: $\,\,\,$ $\prod_{j=1}^m \theta_j^{n_j}$
- subject to: $\,\,$ $\sum_{i=1}^m \theta_i = 1$
Equivalently, we can maximize its log:
- maximize: $\,\,\,$ $\log(D\vert \theta) = \sum_{j=1}^m n_j \log \theta_j$
- subject to: $\,\,$ $\sum_{i=1}^m \theta_i = 1$
Let’s define a PMF $q$ by $q_j = \frac{n_j}{n}$; note $\sum_{j= 1}^m = 1$.
Now we have
Note
Thus, $\theta_{\rm MLE} = \left(\frac{n_1}n{}, \ldots, \frac{n_m}{n}\right)$, the empirical distribution!
Notes on relative entrpy:
- $D(q \Vert \theta) = \sum_{j=1}^m q_j \log \frac{q_j}{\theta_j}$
- $D(q \Vert \theta) \ge 0$
- $D(q \Vert \theta) = 0$ if and only if $q = \theta$
Subscribe via RSS