(ML 4.1) (ML 4.2) Maximum Likelihood Estimation (MLE) (part 1, 2)

Setup. Given data $D = (x_1, \ldots, x_n)$ where $x_i \in \mathbb{R}^d$.
Assume a family of distributions $\{p_\theta : \theta \in \Theta\}$ on $\mathbb{R}^d$. $\,$ $p_\theta(x) = p(x \vert \theta) $
Assume $D$ is a sample from $X_1, \ldots, X_n \sim p_\theta$ iid for some $\theta \in \Theta$.

Goal. Estimate the true $\theta$ that $D$ comes from.

Definition. $\theta_{\rm MLE}$ is a MLE for $\theta$ if $\theta_{\rm MLE} “=” \arg\max_{\theta \in \Theta p(D\vert \theta)}$.
More precisely, $p(D \vert \theta_{\rm MLE}) = \max_{\theta\in \Theta} p(D\vert \theta)$,
where $p(D\vert \theta) = p(x_1, \ldots, x_n \vert \theta) = \prod_{i=1}^np(x_i\vert \theta) = \prod_{i=1}^nP[X=x_i\vert \theta]$.

Note: $p(D\vert \theta)$ is called the likelihood function


  1. MLE might not be unique.
  2. MLE may fail to exist.


  • Easy
  • Interpretable
  • Asymptotic properties
    $\,\,\,$ - Consistent (converges to the true value with probability $1$ when $n \rightarrow \infty$)
    $\,\,\,$ - Normal
    $\,\,\,$ - Efficient (lowest asymptotic variance)
  • Invariant under reparametrization: $g(\theta_{\rm MLE})$ is a MLE for $g(\theta)$


  • Point estimate, so no representation of uncertainty. $p(x \vert D)\approx p(x \vert \theta_{\rm MLE})$
  • Overfitting (severe problem)
    $\,\,\,$ - Regression $\,\,\,$ - Black swan “paradox”
  • Wrong objective? - Disregards loss
  • Existence & uniqueness is not guaranteed.

(ML 4.3) MLE for univariate Gaussian mean

Suppsose we want to estimate a RV $X \sim N(\theta, \sigma^2)$, where $\theta$ is unknown.

The only information we have is that the parameter $\theta$ comes form a distribution on $\Theta$, i.e., $\theta \in \Theta$.

We have the density $\,$ $p(x \vert \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(x - \theta)^2 \right)$.

Assume that we have data $D = (x_1, \ldots, x_n)$ gotten via $X_i \sim N(\theta, \sigma^2)$ iid ($X_i = x_i$)

For MLE, wanna maximize, over $\theta$, the following:

Euqivalently, can maximize its log instead:

gives $\theta_{\rm MLE}$ $= \frac{1}{n}\sum_{i=1}^n x_i$ since $\frac{\partial^2}{\partial\theta^2} \log p(D \vert \theta) = -\frac{n}{\sigma^2} < 0$

In sum, we target $\theta_{\rm MLE}$ from distributoins $(N(\theta, \sigma^2))_{\theta \in \Theta}$,
and the sample mean $\frac{1}{n}\sum_{i=1}^n X_i$ provides an effective estimator of it.

(ML 4.4) (ML 4.5) MLE for a PMF on a finite set

Suppose we want to estimate a RV $X \sim p$ where the PMF $p: \{1, \ldots, m\} \rightarrow [0,1]$ is unknown.

Assume the data $D = (x_1, \ldots, x_n)$ gotten via $X_1, \ldots, X_n \sim p$ iid.

Set the parameter $\theta = (\theta_1, \ldots, \theta_m) = (p_\theta(1), \ldots, p_\theta(m)) \in \Theta$
where $\Theta=\{\theta = (\theta_1, \ldots, \theta_m) : \sum_{i=1}^m \theta_i =1\}$.

That is, $\,$ $p_\theta(i) = \theta_i = p(i\vert \theta) = P[X= i\vert \theta]$ $\,$ and $\,$ $\sum_{i=1}^m p_\theta(i) = 1$.

Now $\,$ $p(D \vert \theta_{\rm MLE}) \ge p(D \vert \theta)$ for all $\theta \in \Theta$ where

where we put $n_j =$ #$\{i: x_i =j\}$.

In other words, we are maximizing the function $p(D \vert \cdot): \mathbb{R}^m \rightarrow [0,1]$ given by

under the constraint $\sum_{i=1}^m \theta_i = 1$.

In sum,

  • maximize: $\,\,\,$ $\prod_{j=1}^m \theta_j^{n_j}$
  • subject to: $\,\,$ $\sum_{i=1}^m \theta_i = 1$

Equivalently, we can maximize its log:

  • maximize: $\,\,\,$ $\log(D\vert \theta) = \sum_{j=1}^m n_j \log \theta_j$
  • subject to: $\,\,$ $\sum_{i=1}^m \theta_i = 1$

Let’s define a PMF $q$ by $q_j = \frac{n_j}{n}$; note $\sum_{j= 1}^m = 1$.
Now we have


Thus, $\theta_{\rm MLE} = \left(\frac{n_1}n{}, \ldots, \frac{n_m}{n}\right)$, the empirical distribution!

Notes on relative entrpy:

  • $D(q \Vert \theta) = \sum_{j=1}^m q_j \log \frac{q_j}{\theta_j}$
  • $D(q \Vert \theta) \ge 0$
  • $D(q \Vert \theta) = 0$ if and only if $q = \theta$