Notes on Machine Learning 3: Decision theory
by 장승환
(ML 3.1) Decision theory (Basic Framework)
Idea. “Minimize expected loss”
Example. Spam (classification): $x, y, \hat{y}$
Loss function $L(y, \hat{y}) \in \mathbb{R}$
Loss can be thought of as reward or utility depending on the sign of the value.
General framework:
- State $s$ (unknown)
- Observation (known) e.g, $x$
- Actoin $a$
- Loss $L(s, a)$
“$0$-$1$ loss”:
“Square loss”: $L(y, \hat{y}) = (y-\hat{y})^2$
(ML 3.2) Minimizing conditional expected loss
Decision theory for supervised learning: $D = ((x_1, y_1), \ldots, (x_n, y_n))$, $x, y, \hat{y}$
- Given $x$, minimize $L(y, \hat{y}) \ldots$ but we don’t know $y$!
- Choose $f$ ($f(x)=y$) to minimize $L(y, f(x)) \ldots$ but don’t knw $x$ or $y$.
Both are ill-posed!
Small loss on average (Probability will come to the rescue!)
Put a probability distribution on theses: $(X, Y) \sim p$
1. Minimizing
is now a well-posed problem!
“0-1” loss:
$E(L(Y, \hat{y})\vert X= x)) = \sum_{y \neq \hat{y}} p(y\vert x)) = 1 - p(\hat{y}\vert x)$
$\hat{y} = \arg\min_y E(L(Y, y)\vert X = x)) = \arg\max_y p(y\vert x)$
$p(y\vert x)$ is the key quantity that we need to solve the problem!
(ML 3.3) Choosing $f$ to minimize expected loss
2. $\hat{Y} = f(X)$ (classification, discrete for now)
Want to minimize:
which is the expecteation for the marginal distribution (by factoring $p(x, y) = p(y\vert x)p(x)$ and letting $g(x) = \sum_y L(y, f(x))p(y\vert x)$).
Suppose for some $x_0, t$ we have $g(x_0, f(x_0)) > g(x_0, t)$.
Define
Note that $g(x, f(x)) \ge g(x, f_0(x))$ for all $x$.
Then we have
Now let’s just go ahead and choose $f^*$ to minimize $g(x, f(x))$:
so that $E^X(g(X, f(X))) \ge E^X(g(X, f^*(X)))$ for all $f$.
Again $p(y\vert x)$ is the key quantity.
A nice observation that is called the law of iterated expectation (LIE):
For LIE in general, see: Law of Iterated Expectations.
(ML 3.4) Square loss
$L(y, \hat{y}) = (y-\hat{y})^2$, $(X, Y) \sim p$, $x$ (regression)
Wanna minimize $E(L(Y, \hat{y})\vert X = x) = \int L(Y, \hat{y}) p(y\vert x) dy = \int (y-\hat{y})^2 p(y\vert x) dy$
Assuming enough smoothness of the integrand,
gives the only critical point $\hat{y} = E(Y \vert X = x)$ and it tunrs out to give the minimum by looking at the second derivative, which is $2$.
Thus, $E(Y \vert X= x) = \arg\min_y E(L(Y, \hat{y}) \vert X = x)$
With regard to the problem of (ML 3.3):
(ML 3.5) The Big Picture (part 1)
Goal: Minimize
using the obsrved $p(y\vert x)$ as the key quantities. (Decision-theoretic idea)
- EL: Expected Loss
- $y$: true value
- $f(x)$: our prediction
Core concepts and methods in ML fall out naturally from trying to achieving this goal.
Data: $D = ((x_1, y_), \ldots, (x_n, y_n))$
Discriminative Estimate $p(y\vert x)$ directly using $D$.
- $k$NN
- Trees
- SVM
Generative Estimate $p(y, x)$ using $D$, and then recover $p(y\vert x) = \frac{p(x, y)}{p(x)}$.
Remark While it is much easier to estimate $p(y \vert x)$ than $p(x, y)$ from a statistical perspective, $p(x, y)$ is a richer sort of model.
Parameters / Latent variables $\theta$ introduced, and then consider $p_\theta(x, y) = p(x, y \vert \theta)$
treating data $D$ as RVs.
- $p(y\vert x, D, \theta)$ : nice in the sense that a closed form expression can be obtained
- $p(\theta\vert x, D)$ : often times nasty
- $\int /\sum$ : nasty
As a whole computing $p(y \vert x, D)$ is often intractible. Many of the techniques in ML come in the course of different ways to solve (parts of) this problem of computation.
An interesting read I encountered while searching for related resources:
What landmarks are, and why they are important
(ML 3.6) The Big Picture (part 2)
Core ideas & methods of ML: (not necessarily disjoint)
- Exact inference (usually not possible)
- Multivariate Gaussian (very nice) / Conjugate priors / Graphical models (use DP)
- Point estimates of $\theta$ (simplest)
- MLE / MAP (Maximum A Posteriori)
[$\theta_{\rm MAP} = \arg\max_\theta p(\theta \vert x, D), p(y \vert x, D) \approx p(y \vert D, \theta_{\rm MAP})$] - Optimization / EM (Expectation Maximization) / (Empirical Bayes)
- MLE / MAP (Maximum A Posteriori)
- Deterministic Approximation
- Laplace Approximation / Variational methods / Expectation Propagation
- Stochastic Approximation (very popular, typically very simple to implement)
- MCMC (Gibbs sampling, MH) / Importance Sampling (Particle filter)
(ML 3.7) The Big Picture (part 3)
The problem of Density estimation (Unsupervised)
Data: $D = (X_1, \ldots, X_n), X_i \in \mathbb{R}^d$, iid.
Goal: Estimate the distribution.
Introduce paramaters $\theta$ as RVs parametrizing $p_\theta$ ($p_\theta(x) = p(x \vert \theta)$)
Then $p(x \vert D) = \int p(x, \theta \vert D)d\theta = \int p(x \vert \theta, D) p(\theta \vert D)d\theta = \int p(x \vert \theta) p(\theta \vert D)d\theta$
Subscribe via RSS