(ML 3.1) Decision theory (Basic Framework)

Idea. “Minimize expected loss”

Example. Spam (classification): $x, y, \hat{y}$

Loss function $L(y, \hat{y}) \in \mathbb{R}$

Spam classification

Loss can be thought of as reward or utility depending on the sign of the value.

General framework:

  • State $s$ (unknown)
  • Observation (known) e.g, $x$
  • Actoin $a$
  • Loss $L(s, a)$

“$0$-$1$ loss”:

“Square loss”: $L(y, \hat{y}) = (y-\hat{y})^2$

(ML 3.2) Minimizing conditional expected loss

Decision theory for supervised learning: $D = ((x_1, y_1), \ldots, (x_n, y_n))$, $x, y, \hat{y}$

  1. Given $x$, minimize $L(y, \hat{y}) \ldots$ but we don’t know $y$!
  2. Choose $f$ ($f(x)=y$) to minimize $L(y, f(x)) \ldots$ but don’t knw $x$ or $y$.

Both are ill-posed!

Small loss on average (Probability will come to the rescue!)
Put a probability distribution on theses: $(X, Y) \sim p$
1. Minimizing

is now a well-posed problem!

“0-1” loss:
$E(L(Y, \hat{y})\vert X= x)) = \sum_{y \neq \hat{y}} p(y\vert x)) = 1 - p(\hat{y}\vert x)$

$\hat{y} = \arg\min_y E(L(Y, y)\vert X = x)) = \arg\max_y p(y\vert x)$

$p(y\vert x)$ is the key quantity that we need to solve the problem!

(ML 3.3) Choosing $f$ to minimize expected loss

2. $\hat{Y} = f(X)$ (classification, discrete for now)

Want to minimize:

which is the expecteation for the marginal distribution (by factoring $p(x, y) = p(y\vert x)p(x)$ and letting $g(x) = \sum_y L(y, f(x))p(y\vert x)$).

Suppose for some $x_0, t$ we have $g(x_0, f(x_0)) > g(x_0, t)$.


Note that $g(x, f(x)) \ge g(x, f_0(x))$ for all $x$.

Then we have

Now let’s just go ahead and choose $f^*$ to minimize $g(x, f(x))$:

so that $E^X(g(X, f(X))) \ge E^X(g(X, f^*(X)))$ for all $f$.

Again $p(y\vert x)$ is the key quantity.

A nice observation that is called the law of iterated expectation (LIE):

For LIE in general, see: Law of Iterated Expectations.

(ML 3.4) Square loss

$L(y, \hat{y}) = (y-\hat{y})^2$, $(X, Y) \sim p$, $x$ (regression)

Wanna minimize $E(L(Y, \hat{y})\vert X = x) = \int L(Y, \hat{y}) p(y\vert x) dy = \int (y-\hat{y})^2 p(y\vert x) dy$

Assuming enough smoothness of the integrand,

gives the only critical point $\hat{y} = E(Y \vert X = x)$ and it tunrs out to give the minimum by looking at the second derivative, which is $2$.

Thus, $E(Y \vert X= x) = \arg\min_y E(L(Y, \hat{y}) \vert X = x)$

With regard to the problem of (ML 3.3):

Square loss

(ML 3.5) The Big Picture (part 1)

Goal: Minimize

using the obsrved $p(y\vert x)$ as the key quantities. (Decision-theoretic idea)

  • EL: Expected Loss
  • $y$: true value
  • $f(x)$: our prediction

Core concepts and methods in ML fall out naturally from trying to achieving this goal.

Data: $D = ((x_1, y_), \ldots, (x_n, y_n))$

Discriminative Estimate $p(y\vert x)$ directly using $D$.

  • $k$NN
  • Trees
  • SVM

Generative Estimate $p(y, x)$ using $D$, and then recover $p(y\vert x) = \frac{p(x, y)}{p(x)}$.

Remark While it is much easier to estimate $p(y \vert x)$ than $p(x, y)$ from a statistical perspective, $p(x, y)$ is a richer sort of model.

Parameters / Latent variables $\theta$ introduced, and then consider $p_\theta(x, y) = p(x, y \vert \theta)$

treating data $D$ as RVs.

Big picture
Big picture
  • $p(y\vert x, D, \theta)$ : nice in the sense that a closed form expression can be obtained
  • $p(\theta\vert x, D)$ : often times nasty
  • $\int /\sum$ : nasty

As a whole computing $p(y \vert x, D)$ is often intractible. Many of the techniques in ML come in the course of different ways to solve (parts of) this problem of computation.

An interesting read I encountered while searching for related resources:
What landmarks are, and why they are important

(ML 3.6) The Big Picture (part 2)

Core ideas & methods of ML: (not necessarily disjoint)

  1. Exact inference (usually not possible)
    • Multivariate Gaussian (very nice) / Conjugate priors / Graphical models (use DP)
  2. Point estimates of $\theta$ (simplest)
    • MLE / MAP (Maximum A Posteriori)
      [$\theta_{\rm MAP} = \arg\max_\theta p(\theta \vert x, D), p(y \vert x, D) \approx p(y \vert D, \theta_{\rm MAP})$]
    • Optimization / EM (Expectation Maximization) / (Empirical Bayes)
  3. Deterministic Approximation
    • Laplace Approximation / Variational methods / Expectation Propagation
  4. Stochastic Approximation (very popular, typically very simple to implement)
    • MCMC (Gibbs sampling, MH) / Importance Sampling (Particle filter)

(ML 3.7) The Big Picture (part 3)

The problem of Density estimation (Unsupervised)

Data: $D = (X_1, \ldots, X_n), X_i \in \mathbb{R}^d$, iid.

Goal: Estimate the distribution.

Introduce paramaters $\theta$ as RVs parametrizing $p_\theta$ ($p_\theta(x) = p(x \vert \theta)$)

Parametrizing iid RVs

Then $p(x \vert D) = \int p(x, \theta \vert D)d\theta = \int p(x \vert \theta, D) p(\theta \vert D)d\theta = \int p(x \vert \theta) p(\theta \vert D)d\theta$

Densitiy estimation