Notes on Machine Learning 8: Naive Bayes

(ML 8.1) Naive Bayes classification

Naive Bayes is a family of models that is not necessarily a “Bayesian” method!

Setup: Given data $D = ((x^{(1)}, y_1), \ldots, (x^{(n)}, y_n))$
with $x^{(i)} = (x_1^{(1)}, \ldots, x_1^{(d)}) \in \mathbb{R}^d$ and $y_i \in \mathcal{Y} = \{1, \ldots, m\}$.

Assume a family of joint distributions $p_\theta$ such that

$p_\theta(x, y) = p_\theta(x \vert y) p_\theta(y) = (\prod_{i=1}^n p_\theta(x_i \vert y) )p_\theta(y).$

Let $(X^{(1)}, Y_1) , \ldots, (X^{(n)}, Y_n) \sim p_\theta$, iid, for some $\theta$.

(If $(X, Y) \sim p_\theta$, then $X_1, \ldots, X_d$ are (conditionally) independent given $Y$.)

Goal: For new $x \in \mathbb{R}^d$, pridict it’s $y$.

Algorithm:

Estimate $\hat{\theta}$ from $D$.
Then compute

$\begin{aligned} \hat{y} & = {\rm argmax} _{y \in \mathcal{Y}} \, p_{\hat{\theta}}(y\vert x) \\ & = {\rm argmax}_y \frac{p_{\hat{\theta}}(x\vert y)p_{\hat{\theta}}(y)}{p_{\hat{\theta}}(x)} \\ & = {\rm argmax}_y p_{\hat{\theta}}(x_1\vert y)\cdots p_{\hat{\theta}}(x_n\vert y)p_{\hat{\theta}}(y). \end{aligned}$

(ML 8.2) More about Naive Bayes

Assume that $(X, Y) \sim p_\theta$ with $Y \in \mathscr{Y} \in \{1, \ldots, m\}.$

How to choose $p_\theta$?

$p_\theta(y) = p_\theta(Y = y) := \pi$ with $\pi = (\pi_1, \ldots, \pi_m)$.
$p_\theta(x_i\vert y) = p_\theta(X_i = x \vert Y = y)$

$\theta =$ (all parameters of the distributions)

If $X_i \in \{1, \ldots, N\}$ then e.g. $p_\theta(x_i\vert y) := q(x_i, y)$.

If $X_i \in \{1, 2, \ldots\}$ then e.g. Poisson or Geometric or whatever.

If $X_i \in \mathbb{R}$ then e.g. Gaussian, Gamma or whatever.

How to estimate $\theta$?
MLE, or MAP (assuming priro on $\theta$)
Another option: “Bayesian” naive Nayes

Why make conditional independence assumption?

$p_\theta(x\vert y) = p_\theta(x_1 \vert y)\cdots p_\theta(x_d\vert y)$

Can estimate $\theta$ more accurately with less data.

“Wrong but simple model can be better than correct but complicated model!”

(ML 8.3) (ML 8.4) Bayesian Naive Bayes

Assume that $(X, y) \sim p(x,y\vert \theta)$.

$D = ((x^{(1)}, y_1), \ldots, (x^{(n)},y))$ where
$x^{(i)}= (x_1^{(i)}, \ldots, x_d^{(i)})$ and the $j$-th feature $x_j^{(i)} \in A_j$ and $y \in \mathscr{Y} = \{1, \ldots, m\}$

$(X^{(1)}, Y_1), \ldots, (X^{(n)}, Y_n) \sim p(x,y\vert \theta)$ independent given $\theta = (\pi, \{r_jy\})$

Naive Bayes assumption says that the features are independent given the class & the parameter

$p(x, y\vert \theta) = p(x, y \vert \theta) p(y\vert \theta) = p(y\vert \theta)\prod_{j=1}^dp(x_j\vert y,\theta)$

$p(y\vert \theta) = \pi(y)$ where $\pi = (\pi(1), \ldots, \pi(m))$
$p(x_j\vert y, \theta) = P(X_J = x_j\vert Y=y, \theta) = r_{jy}(x_j)$
($\sum \pi(y)=1, \sum_{k \in A_j}r_{jy} \forall j, y$)

Let’s put some prior: a natural choice is Dirichlet.
(Dirichlet is a conjugate prior for categorical distribution.)

$\begin{aligned} p(\pi) = {\rm Dir}(\pi\vert \alpha) \propto \end{aligned}$

…

(ML 11.1) Estimators

Assume the data $D = (X_1, \ldots, X_n)$ are given as random variables.

Definition. A statistic is a random variable $S = f(D)$ that is a function of the data $D$.

Terminology. An estimator is a statistic intended to approximate a parameter governing the distribution of $D$.

Notation.

$\hat{\theta}$ denotes an estimator of a parameter $\theta$.
$\hat{\theta}_n$ emphasize (the dependence on) $n$

Example. $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$ iid
(Sample mean) $\,\,$ $\hat{\mu} = \bar{X} = \frac{1}{n}\sum_{i=1}^nX_i$ $\,\,$ /cf. $\sigma^2 = \mathbb{E}((X - \mu)^2)$
(“Biased” sample variance) $\,\,$ $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(X_i -\bar{X})^2$
(“unbiased” sample variance) $\,\,$ $s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i -\bar{X})^2$

Definition.

The bias of an estimator $\hat{\theta}$ is $\,$ ${\rm bias}(\hat{\theta}) = \mathbb{E}(\hat{\theta}) - \theta$.
An estimator $\hat{\theta}$ is unbiased if $\,$ ${\rm bias}(\theta) = 0$.

Example.

$\hat{\mu}$ is unbiased: $\mathbb{E}(\hat{\mu}) = \mathbb{E}(\frac{1}{n}\sum_{i=1}^nX_i) =\frac{1}{n}\sum \mathbb{E}(X_i) = \frac{1}{n}\sum \mu = \mu$
$\hat{\sigma}^2$ is biased. (Exercise)
$s^2$ is unbiased. (Exercise)

(ML 11.2) Decision theory terminology in different contexts

General	Estimators	*$^$Regression/Classification**
Decision rule $\delta$	$^*$Estimator function $g$	Prediction function $f$
State $s$ (unknown)	Parameter $\theta$ (unknown)	Target value $Y$ (unknown)
$^*$Data $D$ (observed)	Data $D$ (observed)	Point $X$ (observed)
Action $a = \delta(D)$	Estimator/Estimate $\hat{\theta}=g(D)$	Prediction $\hat{Y} = f(X)$
Loss $L(s, a)$	Loss $L(\theta, \hat{\theta})$	Loss $L(Y, \hat{Y})$

Example. (Estimators)
An estimator is a random rariable: $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n X_i$.
An estimate is a number: $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i = 2.3$.
(In some situation, the procedure $g$ is refered to as an estimator!)

(ML 11.3) Frequentist risk, Bayesian expected loss, and Bayes risk

Loss and Risk. Exciting session to clear up all the mud!

Data: $\,$ $D = (X_1, \ldots, X_n)$, $D \sim p_\theta$
Parameter: $\,$ $\theta \sim \pi$ $\,$, i.e., the parameter $\theta$ is a random variable. Estimator: $\,$ $\hat{\theta} = f(D) = \delta(D)$

Everything begins with : $\,\,\,\,\,$ Loss $=L(\theta, f(D))$.
We wanna minimize the loss but it’s an RV!
Two option to deal with it:

Averaging over $\theta$ given the data : $E(L(\theta, f(D)) \vert D) =:\rho(\pi, f(D))$ Bayesian expected loss
Averaging over the data given $\theta$ : $E(L(\theta, f(D)) \vert \theta) =: R(\theta, f)$ (Frequentist) risk

Bayesians vs. frequentists

(ML 11.4) Choosing a decision rule - Bayesian and frequentist

How to choose $f$.

Bayesian: Assume $\pi$
Case 1. Know $D$. Choose $f(D)$ to minimize $\rho(\pi, f(D))$
Case 2. Don’t know $D$. Choose $f$ to minimize $r(\pi, f)$

Frequentist: Introduce a furthere principle to guide your choice.
(a) Unbiasedness
(b) Admissibility (c) Minimax
(d) Invariance

A frequentist approaches — A frequentist approach

(ML 11.5) Bias-Variance decomposition (MSE $=$ bias$^2$ + var)

“A super impportant port of ML is what’s called model selection and a tool for model selection is the bias-variance decomposition.”

Almost trivial identity but extremely handy.

Definition. Let $D$ be random data. The MSE of an estimator $\hat{\theta} = f(D)$ for $\theta$ is

${\rm MSE}(\hat{\theta}) = E((\hat{\theta} - \theta)^2\vert \theta)$

Put $\vert \theta$ emphasizing we’re not averagning over $\theta$ here (we don’t have a distribution over $\theta$). We’re just averaging over the data.

MSE$\theta$ is nothing but the risk $R(\theta, f)$ under square loss, i.e., when the loss function is the square of the deifference.

Recall. bias$(\hat{\theta}) = E(\theta) -\theta$.

Proposition. MSE$(\theta) = bias(\hat{\theta})^2 + {\rm var}(\hat{\theta})$

Proof:

Silly example. $X \sim N(\theta, 1)$ $\theta$ nonrandom & unknown $D = X

“Natural” estimate of $\theta$: $\delta_1(D) = X \leadsto$ bias$^2 = 0$, var$ = 1$, MSE$ =1$
“Silly” estimate of $\theta$: $\delta_0(D) = X \leadsto$ bias$^2 = \theta^2$, var$ = 0$, MSE$ = \theta^2$

cf. Shrinkage, Stein’s paradox

(ML 11.8) Bayesian decision theory

$\rho(\pi, \delta(D)) = \mathbb{E}(L(\theta, \delta(D)) \vert D)$ $\begin{aligned} r(\pi, \delta) &= \mathbb{E}(L(\theta, \delta(D))) \\ &= \mathbb{E} (\mathbb{E}(L(\theta, \delta(D)) \vert D)) ) \end{aligned}$

Terminology.(Informal)

A generalized Bayes rule is a decision rule $\delta$ that minimizes $\rho(\pi, \delta(D))$ for each $D$.
A Bayes rule minimizes is a decision rule $\delta$ that minimizes $r(\pi, \delta)$.

Remark

GBR $\Rightarrow$ BR, but BR $\not\Rightarrow$ GBR.
If $r(\pi, \delta) = \infty$ for all $\delta$, then anything is a BR, but a GBR still makes sense.
On a set of $\pi$-measure $0$, BR is arbitrary bit GBR is still sensible.

Complete Class Theorems. Under mild conditions:

Every GBR (for a proper $\pi$) is admissible;
Every admissible decision rule is a GBR for some (possibly improper) $\pi$

Check out:

(ML 12.1) Model selection - introduction and examples

“Model” selection really means “complexity” selection!

Here, complexity $\approx$ flexibility to fit/explain data

Example (Linaer regression with MLE for $w$) $f(x) = w^T\varphi(x)$
Given data $x \in \mathbb{R}$, consider polynomial basis $\varphi(x) = x^k$, $\varphi = (\varphi_0, \varphi_1, \ldots, \varphi_B)$

Turns out $B =$ “complexity parameter”

Model selection illustration — Model (in fact complexity) selection illustration

Example (Bayesia linear regression or MAP)

Example ($k$NN)

$k$ “controls” decesion boundaties.

(ML 12.2) Bias-variance in model selection

Bias-variance trade-off, as they say.
MSE $=$ bias$^2 +$ var / $\in$MSE $=$ $\int$bias$^2 +$ $\int$var (only applies for square loss)

Bias-variance "trade-off"

(ML 15.1) Newton’s method (for optimization) - intuition

2nd order method!

(Gradient descent $x_{t+1} = x_t - \alpha_t \nabla f(x_t)$ : 1st order method)

Analogy (1D).

zero-finding: $x_{t+1} = x_t - \frac{f(x_t)}{f’(x_t)}$

min./maximizing: $x_{t+1} = x_t - \frac{f’(x_t)}{f’‘(x_t)}$

Minimizing in 1D

Minimizing in 2D

(ML 15.2) Newton’s method (for optimization) in multiple dimensions

Idea: “Make a 2nd order approximation and minimize tha.”

Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be (sufficiently) smooth.

Taylor’s theorem: for $x$ near a, letting $g = \nabla f(a)$ and $H = \nabla^2f(a) = \left(\frac{\partial^2}{\partial x_i \partial x_j}f(a)\right)_{ij}$,

$\begin{aligned} f(x) &\approx f(a) + g^T(x-a) +\frac{1}{2}(x-a)^TH(x-a) \\ &= \frac{1}{2}x^THx + b^Tx +c =:q(x) \end{aligned}$

Minimize: $0 = \nabla q = Hx + b \Rightarrow x = -H^{-1}b = a -H^{-1}g$

Critical to check: $\nabla^2 q = H$ $\Rightarrow$ minimum if $H$ is PSD.

Algorithm.

Initialize $x \in \mathbb{R}^n$
Iterate: $x_{t+1} = x_t - H^{-1}g$ where $g = \nabla f(x_t), H = \nabla^2 f(x_t)$

Issues.

$H$ may fail to be PSD. (Option: switch gradient descent. A smart way to do it: Levenberg–Marquardt algorithm)
Rather than invert $H$, sove $Hy = g$ for $y$, then use $x_{t+1} = x_t - y$. (More robust approach)
$x_{t+1} = x_t - \alpha_t y$. (small “step size” $\alpha_t>0$)

(ML 17.1) Sampling methods - why sampling, pros and cons

Why sampling?

For approximate expectations (estimate statistics / posterior infernce i.e. computing probability)
For visualization

Why expectations?

Any probability is an expectation: $P[X \in A] = E[I(X \in A)]$.
Approximation is needed for intractable sums/integrals (can be expressed as expectations)

Pros.

Easy (both to implement and understand)
General purpose

Cons.

Too easy - used inappropriately
Slow
Getting “good” samples may be dificult
Difficult to assess

(ML 17.2) Monte Carlo methods - A little history

A little history of MC — A little history of Monte Carlo methods

(ML 17.3) Monte Carlo approximation

Goal: Aprroximate $E[f(X)]$, when intractable.

Definition (Monte Carlo estimator): If $X_1, \ldots, X_n \sim p$ iid then

$\hat{\mu}_n = \frac{1}{n}\sum_{i=1}^nf(X_i)$

is a (basic) Monte Carlo estimator of $E[f(X)]$ where $X \sim p$. (sample mean)

Remark
(1) $E[\hat{\mu}_n] = E[f(X)]$ (i.e. $\hat{\mu}_n$ is an unbiased estimator)
(2)

(ML 17.5) Importance sampling - introduction

It’s not a sampling method but an estimation technique!

It can be thought of as a variant of MC estimation.

Recall. MC estimation (by sample mean):

$\mathbb{E}(f(X)) \approx \frac{1}{n}\sum_{i=1}^nf(X_i)$

under the BIG assumtion that $X \sim p$ and $X_i \sim p$.

Can we do something similar by drawing samples from an alternative distribution $q$?

Yes, and in some cases you can do much much better!

Density $p$ case.

$\begin{aligned} \mathbb{E}(f(X)) &= \int f(x)p(x)dx \\ &= \int f(x)\frac{p(x)}{q(x)}p(x)dx \\ &\approx \frac{1}{n}\sum_{i=1}^nf(X_i)\frac{p(X_i)}{q(X_i)} \end{aligned}$

holds for all pdf $q$ with respect to wchi $q$ is absolutely continuous, i.e., $p(x) = 0$ whenever $q(x)= 0$.

(ML 17.6) Importance sampling - intuition

What makes a good/bad choice of the proposal distrubution $q$?

To be added..