#### (ML 8.1) Naive Bayes classification

Naive Bayes is a family of models that is not necessarily a “Bayesian” method!

Setup: Given data $D = ((x^{(1)}, y_1), \ldots, (x^{(n)}, y_n))$
with $x^{(i)} = (x_1^{(1)}, \ldots, x_1^{(d)}) \in \mathbb{R}^d$ and $y_i \in \mathcal{Y} = \{1, \ldots, m\}$.

Assume a family of joint distributions $p_\theta$ such that

Let $(X^{(1)}, Y_1) , \ldots, (X^{(n)}, Y_n) \sim p_\theta$, iid, for some $\theta$.

(If $(X, Y) \sim p_\theta$, then $X_1, \ldots, X_d$ are (conditionally) independent given $Y$.)

Goal: For new $x \in \mathbb{R}^d$, pridict it’s $y$.

Algorithm:

• Estimate $\hat{\theta}$ from $D$.
• Then compute

#### (ML 8.2) More about Naive Bayes

Assume that $(X, Y) \sim p_\theta$ with $Y \in \mathscr{Y} \in \{1, \ldots, m\}.$

How to choose $p_\theta$?

• $p_\theta(y) = p_\theta(Y = y) := \pi$ with $\pi = (\pi_1, \ldots, \pi_m)$.
• $p_\theta(x_i\vert y) = p_\theta(X_i = x \vert Y = y)$

$\theta =$ (all parameters of the distributions)

If $X_i \in \{1, \ldots, N\}$ then e.g. $p_\theta(x_i\vert y) := q(x_i, y)$.

If $X_i \in \{1, 2, \ldots\}$ then e.g. Poisson or Geometric or whatever.

If $X_i \in \mathbb{R}$ then e.g. Gaussian, Gamma or whatever.

How to estimate $\theta$?
MLE, or MAP (assuming priro on $\theta$)
Another option: “Bayesian” naive Nayes

Why make conditional independence assumption?

Can estimate $\theta$ more accurately with less data.

“Wrong but simple model can be better than correct but complicated model!”

#### (ML 8.3) (ML 8.4) Bayesian Naive Bayes

Assume that $(X, y) \sim p(x,y\vert \theta)$.

$D = ((x^{(1)}, y_1), \ldots, (x^{(n)},y))$ where
$x^{(i)}= (x_1^{(i)}, \ldots, x_d^{(i)})$ and the $j$-th feature $x_j^{(i)} \in A_j$ and $y \in \mathscr{Y} = \{1, \ldots, m\}$

$(X^{(1)}, Y_1), \ldots, (X^{(n)}, Y_n) \sim p(x,y\vert \theta)$ independent given $\theta = (\pi, \{r_jy\})$

Naive Bayes assumption says that the features are independent given the class & the parameter

• $p(y\vert \theta) = \pi(y)$ where $\pi = (\pi(1), \ldots, \pi(m))$
• $p(x_j\vert y, \theta) = P(X_J = x_j\vert Y=y, \theta) = r_{jy}(x_j)$
($\sum \pi(y)=1, \sum_{k \in A_j}r_{jy} \forall j, y$)

Let’s put some prior: a natural choice is Dirichlet.
(Dirichlet is a conjugate prior for categorical distribution.)

#### (ML 11.1) Estimators

Assume the data $D = (X_1, \ldots, X_n)$ are given as random variables.

Definition. A statistic is a random variable $S = f(D)$ that is a function of the data $D$.

Terminology. An estimator is a statistic intended to approximate a parameter governing the distribution of $D$.

Notation.

1. $\hat{\theta}$ denotes an estimator of a parameter $\theta$.
2. $\hat{\theta}_n$ emphasize (the dependence on) $n$

Example. $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$ iid
(Sample mean) $\,\,$ $\hat{\mu} = \bar{X} = \frac{1}{n}\sum_{i=1}^nX_i$ $\,\,$ /cf. $\sigma^2 = \mathbb{E}((X - \mu)^2)$
(“Biased” sample variance) $\,\,$ $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(X_i -\bar{X})^2$
(“unbiased” sample variance) $\,\,$ $s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i -\bar{X})^2$

Definition.

1. The bias of an estimator $\hat{\theta}$ is $\,$ ${\rm bias}(\hat{\theta}) = \mathbb{E}(\hat{\theta}) - \theta$.
2. An estimator $\hat{\theta}$ is unbiased if $\,$ ${\rm bias}(\theta) = 0$.

Example.

1. $\hat{\mu}$ is unbiased: $\mathbb{E}(\hat{\mu}) = \mathbb{E}(\frac{1}{n}\sum_{i=1}^nX_i) =\frac{1}{n}\sum \mathbb{E}(X_i) = \frac{1}{n}\sum \mu = \mu$
2. $\hat{\sigma}^2$ is biased. (Exercise)
3. $s^2$ is unbiased. (Exercise)

#### (ML 11.2) Decision theory terminology in different contexts

 General Estimators $^*$Regression/Classification Decision rule $\delta$ $^*$Estimator function $g$ Prediction function $f$ State $s$ (unknown) Parameter $\theta$ (unknown) Target value $Y$ (unknown) $^*$Data $D$ (observed) Data $D$ (observed) Point $X$ (observed) Action $a = \delta(D)$ Estimator/Estimate $\hat{\theta}=g(D)$ Prediction $\hat{Y} = f(X)$ Loss $L(s, a)$ Loss $L(\theta, \hat{\theta})$ Loss $L(Y, \hat{Y})$

Example. (Estimators)
An estimator is a random rariable: $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n X_i$.
An estimate is a number: $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i = 2.3$.
(In some situation, the procedure $g$ is refered to as an estimator!)

#### (ML 11.3) Frequentist risk, Bayesian expected loss, and Bayes risk

Loss and Risk. Exciting session to clear up all the mud!

Data: $\,$ $D = (X_1, \ldots, X_n)$, $D \sim p_\theta$
Parameter: $\,$ $\theta \sim \pi$ $\,$, i.e., the parameter $\theta$ is a random variable. Estimator: $\,$ $\hat{\theta} = f(D) = \delta(D)$

Everything begins with : $\,\,\,\,\,$ Loss $=L(\theta, f(D))$.
We wanna minimize the loss but it’s an RV!
Two option to deal with it:

1. Averaging over $\theta$ given the data : $E(L(\theta, f(D)) \vert D) =:\rho(\pi, f(D))$ Bayesian expected loss
2. Averaging over the data given $\theta$ : $E(L(\theta, f(D)) \vert \theta) =: R(\theta, f)$ (Frequentist) risk

Bayesians vs. frequentists

#### (ML 11.4) Choosing a decision rule - Bayesian and frequentist

How to choose $f$.

Bayesian: Assume $\pi$
Case 1. Know $D$. Choose $f(D)$ to minimize $\rho(\pi, f(D))$
Case 2. Don’t know $D$. Choose $f$ to minimize $r(\pi, f)$

Frequentist: Introduce a furthere principle to guide your choice.
(a) Unbiasedness
(d) Invariance

#### (ML 11.5) Bias-Variance decomposition (MSE $=$ bias$^2$ + var)

“A super impportant port of ML is what’s called model selection and a tool for model selection is the bias-variance decomposition.”

Almost trivial identity but extremely handy.

Definition. Let $D$ be random data. The MSE of an estimator $\hat{\theta} = f(D)$ for $\theta$ is

Put $\vert \theta$ emphasizing we’re not averagning over $\theta$ here (we don’t have a distribution over $\theta$). We’re just averaging over the data.

MSE$\theta$ is nothing but the risk $R(\theta, f)$ under square loss, i.e., when the loss function is the square of the deifference.

Recall. bias$(\hat{\theta}) = E(\theta) -\theta$.

Proposition. MSE$(\theta) = bias(\hat{\theta})^2 + {\rm var}(\hat{\theta})$

Proof: