(ML 5.1) (ML 5.2) Exponential families

Definition. Let $\Theta \subset \mathbb{R}^k$. An exponential family is a set $\{p_\theta : \theta \in \Theta\}$ of MPFs or MDFs on $\mathbb{R}$ s.t.

• $\theta \in \mathbb{R}^k$: parameter
• $x \in \mathbb{R}^d$: “distritution variable”
• $\eta_i : \Theta \rightarrow \mathbb{R}$
• $s_i : \mathbb{R}^d \rightarrow \mathbb{R}$ (sufficient statistics)
• $h : \mathbb{R}^d \rightarrow [0, \infty)$ (support & scaling)
• $z : \Theta \rightarrow [0, \infty)$ (partition function: sort of normaizing constant but having very special properties)

($\eta = (\eta_1, \ldots, \eta_m)$, $s = (s_1, \ldots, s_m)$)

Example (Exponential distribution). (PDF)
Exponential($\theta$), $\theta \in \Theta = [0, \infty)$

• $\eta(\theta) = \theta$
• $s(x) = -x$
• $h(x) = I(x\ge 0)$
• $z(\theta) = 1/\theta$
• $k = d= 1$

Example (Bernoulli distribution). (PMF)
Bernoulli($\theta$), $\theta \in \Theta = (0, 1)$

• $\eta_1(\theta) = \log \theta$
• $\eta_2(\theta) = \log (1-\theta)$
• $s_1(x) = I(x=1)$
• $s_2(x) = I(x=0)$
• $h(x) = I_{\{0, 1\}}(x) = I(x \in \{0,1\})$

Not necessarily unique representation! :

• $\eta(\theta) = \log \left(\frac{\theta}{1-\theta}\right)$
• s(x) = I(x=1)
• $z(\theta) \frac{1}{1-\theta}$

Natural/Canonical form

• $\eta(\theta) = \theta$ $\,$ ($m= k$)
• $\eta_1(\theta) = \theta_1, \ldots, \eta_k(\theta) = \theta_k$

Example.

1. (PDF) Exponential, Normal, Beta, Gamma, Chi-square
2. (PMF) Bernoulli, Binomial, Poisson, Geometric, Multinomial

Properties of exponential families:

• Conjugate prior
• Maximum entropy

Example (NOT an exponential )family.

• Uniform$(0, \theta)$

cf. Uniform$(0,a) = \frac{1}{a}I(x \in (0,a))$, for a fixed $a \in (0, \infty)$, gives an exponential family (trivially).

(ML 5.3) (ML 5.4) MLE for an exponential family

$p_\theta(x) = e^ h(x)/z(\theta)$

Given data $D = (x_1, \ldots, x_n)$ with $x_i \in \mathbb{R}^d$ gotten via $X_1, \ldots, X_n \sim p_\theta$ iid.

Wanna compute $\theta_{\rm MLE} = \arg\max_{\theta \in \Theta} p(D \vert \theta)$.

$\log p(D \vert \theta) = -n\log z(\theta) + \theta^T s(D) + \sum_{i=1}^n \log h(x_i)$

$0 = \frac{\partial}{\partial \theta_j} \log p(D|vert\theta) = -n\frac{\partial}{\partial \theta_j} \log z(\theta) + s_j(D)$

where $\theta^Ts(D = \sum_{j=1}^k \theta_j s_j(D)$, $s_j(D) = \sum_{i=1}^n s_j(x_i)$ and $s=(s_i,\ldots, s_k)$.

Note $z(\theta) = \int_{\mathbb{R}^d} e^{\theta^T s(x)}h(x)dx$.

Then

$\log p(D \vert \theta) = -n\log z(\theta) + \theta^T s(D) + \sum_{i=1}^n \log h(x_i) = -nE_\theta s_j(x) +s_j(D)$

implies

$n E_\theta s(X) = s(D) = \sum_{i=1}^n s(x_i)$

implies

$E_{\theta_{\rm MLE}} s(x) = \frac{1}{n}\sum_{i=1}^n s(x_i)$ if MLE exists and $\theta_{\rm MLE} \in {\rm Int} \Theta$.

Example. $p_\theta(x) = \theta e^{-\theta x} I(x \ge 0)$
$z(\theta) = \frac{1}{\theta}$, $s(x) = - x$, $\eta(\theta) = \theta$.

implies $E(X \frac{1}{\theta})$.

Now $-\frac{1}{\theta} = E_\theta s(X) = \frac{1}{n}\sum_{i=1}^n (-x_i)$ implies
$\frac{1}{\theta} = \frac{1}{n}\sum_{i=1}^n x_i$ implies

$\theta_{\rm MLE} =\frac{1}{\frac{1}{n}\sum_{i=1}^nx_i}$