(ML 7.1) Bayesian inference - A simple example

Thomas Bayes
Thomas Bayes

Put distributions on everything, and then use rules of probability!

Bayes' rule
Bayes' rule

Exampl. $D= (x_1, x_2, x_2) = (101, 100.5, 101.5)$ ($n=3$)
$X \sim N(\theta, 1)$ iid given $\theta$

$\theta_{\rm MLE} = \overline{x} = \frac{1}{n}\sum_{i=1}^n = 101$

$ $

$\theta = N(100,1)$ prior
$\theta_{\rm MAP} = 100.75$
To compute $P(\theta < 100 \vert D)$, one needs posterior $p(\theta \vert D)$

$p(\theta \vert D) = \frac{p(D\vert \theta)p(\theta)}{p(D)} = \frac{\prod_{i=1}^np(x_i \vert \theta) p(\theta)}{p(D)}$
$\leadsto$ can be desribed as a normal distribution and so analytically computable!

$p(x \vert D) = \int p(x, \theta \vert D) d \theta = \int p(x\vert\theta)p(\theta \vert D) d \theta$
$\leadsto$ analytically integrable!

(ML 7.2) Aspects of Bayesian inference

  • Bayesian inference : Assume a prior distribution $p(\theta)$ and then use probability rules to work with $p(x\vert \theta)$ to anser questions.
  • Bayesian procedures : Minimize expected loss (averaging over $\theta$).

  • Objective Bayesian : use belief-based priors
  • Subjective Bayesian : use non-informative priors


  • Directly answer certain questions, e.g., can compute $P(99 < \theta < 101)$
  • Avoid some pathologies (associated with frequentist approach)
  • Avoid overfitting
  • Automatically do medel selection (“Occam’s razor”)
  • Bayesian procedures are often admissible


  • Must assume a prior
  • Exact computation (of posterior) can be intractable ($\leadsto$ have to use approximation)


  • Non-informative
  • Improper, e.g., $p(\theta) = 1$ with density
  • Conjugate

(ML 7.3) Proportionality

“Using proportionality is a extraordinarily handy trick (big time-saver) when doing Bayesian inference.”

Notation. $f \propto g$ if there exists $c \neq 0$ such that $g(x) = cf(x)$ for all $x$.

Claim. If $f$ is a PDF and $f \propto g$ then $g$ uniquely determine $f$, and $f(x) = \frac{g(x)}{\int g(x)}dx$.

Proof. $\frac{g(x)}{\int g(x)}dx = \frac{cf(x)}{\int cf(x)}dx = f(x)$.

(ML 7.4) Conjugate priors

Definition. A family $\mathscr{F}$ of (prior) distributions $p(\theta)$ is conjugate to a likelihood $p(D\vert \theta)$ if the posterior $p(\theta \vert D)$ is in $\mathscr{F}$.


  • Beta is conjugate to Bernoulli.
  • Gaussian is conjugate to Gaussian (mean).
  • Any exponential family has a conjugate prior.

(ML 7.5) (ML 7.6) Beta-Bernoulli model

For example, we model a sequence of binary outcomes, like coin flips, as Bernoulli random variables with Beta prior distribution on the probability of the heads. In other words, Beta is a conjugate prior for Bernoulli.

Setup: $X_1, \ldots, X_n \sim {\rm Bern}(\theta)$ independen gieven $\theta$ with prior $\theta \sim {\rm Beta}(a,b)$.
Here the parameters $a, b$ of the prior are called hyperparameters.

For generaic $X \sim {\rm Bern}(\theta)$, we have


From the setup we generate data $D = (x_1, \ldots, x_n)$. Then it’s easy to compute the posterior distribution:

where $n_1 := \sum I(x_i=1)$ and $n_0 := \sum I(x_i=0)$.

Thus, $p(\theta\vert D) = {\rm Beta}(\theta\vert a+n_1, b+n_0)$, and so Beta is conjugate to Bernoulli.

Beta distributions
Beta distributions

See Beta distribution - an introduction by Ox educ for intuition behind Beta prior.

If $\theta \sim {\rm Beta}(a, b)$, one has:

  • $\mathbb{E}(\theta) = \frac{a}{a+b}$;
  • $\sigma^2(\theta) = \frac{ab}{(a+b)^2(a+b+1)}$;
  • mode $=\frac{a-1}{a+b-2}$.
    ($a+n_1, b+n_0$ are called pseudocounts)

For $\theta \vert D \sim {\rm Beta}(\theta\vert a+n_1, b+n_0)$, on has:

  • $\mathbb{E}(\theta\vert D) = \frac{a+n_1}{a+b+n}$;
  • mode $=\frac{a+n_1-1}{a+b+n-2}$.

Now let’s make a few connections. We know that

  • $\theta_{\rm MLE} =$ empirical probability $= \frac{n_1}{n}$ / $\left(\frac{n_0}{n}, \frac{n_1}{n}\right)$
  • $\theta_{\rm MAP} = \frac{a+n_1-1}{a+b+n-2}$

posterior mean $=\frac{a+n_1}{a+b+n} = \frac{a+b}{a+b+n}\cdot \frac{a}{a+b} + \frac{n}{a+b+n}\cdot \frac{n_1}{n}$,
i.e., a convex combination of prior mean & MLE

  • When $n \rightarrow \infty$, posterior mean convergees to the MLE.
  • When $n = 0$, if we have no data, we get back to the prior mean.

Let’s compute now the (posterior) predictive distribution:

(ML 7.7.A1) Dirichlet distribution

Dirichlet, one of the great mathematicians in 1800s.

Dirichlet distribution is a distribution on probability distributinos.

$\theta = (\theta_1, \ldots, \theta_n) \sim {\rm Dir}(\alpha)$ means:

where $\alpha = (\alpha_1, \ldots, \alpha_n)$, $\alpha_i >0$ are the parameters, $S = \{x \in \mathbb{R}^n: x_i \ge 0, \sum_{i=1}^n x_n =1 \}$ is the probability simplex, and $\frac{1}{B(\alpha)} =\frac{\Gamma(\alpha_0)}{\Gamma(\alpha_1)\cdots\Gamma(\alpha_n)}$.

Probability simplex
Probability simplex
Dirichlet distributions
Dirichlet distributions

$E(\theta_i) = \frac{\alpha_i}{\alpha_0}$

${\rm mode} = \left(\frac{\alpha_1-1}{\alpha_0}, \ldots, \frac{\alpha_n-1}{\alpha_0-n}\right)$

$\sigma^2(\theta) = \frac{\alpha_i(\alpha_0-\alpha_i)}{\alpha_0^2(\alpha_0+1)}$


(ML 7.9) (ML 7.10) Posterior distribution for univariate Gaussian

Given data $D = (x_1, \ldots, x_n)$ with $x_i \in \mathbb{R}$,
want to know from which distribution $D$ might come fomm.
Model $D$ as $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$ for some $\mu, \sigma \in \mathbb{R}$,
i.e., $N(\mu, \sigma^2)$ is the “true” distribution wa want.
Becasue we do not know $\mu$, we model it as a random variable $\theta$, which itself follows a normal distribution:

We assume that $\sigma^2, \mu_0, \sigma_0^2$ are known to make the problem (more) tractable.

Then the postesior distribution is given by


We have:

Gaussian posterior distribution



The convergence behavior of the posterior mean (MAP) when $n$ approaches $\infty$:

Gaussian convex combination