Notes on Machine Learning 9: Linear regression

(ML 9.1) Linear regression - Nonlinearity via basis functions

“It’s truly a workhorse of statistics!”
“It’s not just about lines & planes!”

Setup. Given $D = ((x_1, y_1), \ldots, (x_n, y_n))$ with $x_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}$.
Goal. Select “good” $f : \mathbb{R} \rightarrow \mathbb{R}$ for predicting $y$ for new $x$.

Basis functions.

The simplest class of functions $\mathbb{R}^d \rightarrow \mathbb{R}$ one can think of is linear ones. Those are precisely the functions $\mathbb{R}^d \rightarrow \mathbb{R}$ that is given by

$[w] : \mathbb{R}^d \rightarrow \mathbb{R}, \,\,[w](x) = w^Tx$

for some (fixed) vector $w \in \mathbb{R}^d.$

(ML 9.2) Linear regression - Definition & Motivation

Discrminative approach.

Instead of aiming to model the target function $f : \mathbb{R}^d \rightarrow \mathbb{R}$, we take probabilistic approach by modelling the conditional distribution $p(y\vert x)$. We start with a parametized family $p_\theta(y\vert x)$ whith $\theta \in \Theta$ and estimate $\theta$ using the data $D = ((x_1, y_1), \ldots, (x_n, y_n))$. In other words, we figure out what $theta$ the datat comes from.

But what parametrized family should we choose?

One natural choice would be a Gaussian family: We set $p_\theta(y\vert x) = N(y\vert \mu(x), \sigma^2(x))$, meaning that for fixed $x$, the random variable $Y$ corresponding to $x$ is such that $Y \sim N(\mu(x), \sigma^2(x))$. What remains is to decide the dpendency of $\mu(x)$ and $\sigma^2(x)$ upon $\theta$.

We set the parameter to be $\theta = (w, \sigma^2)$ with $w \in \mathbb{R}, \sigma^2 >0$ and take $\mu(x) = w^Tx, \sigma^2(x) = \sigma^2$, so that we have

$p_\theta(y\vert x) = N(y\vert w^Tx, \sigma^2).$

This is called the (Gaussian) linear regression.

In effect, we have modelled the target function $f$ as the random variable

$Y = w^Tx + \varepsilon$

wherer $\varepsilon \sim N(0, \sigma^2)$. Thus, the term “linear.”

(ML 9.3) Choosing f under linear regression

Why linear regression is such a natural model for regresession in some sense.

In other words, why does it make sense to take $\mu(x) = w^Tx$?

For th square loss $L(y, \hat{Y}) = (y-y)^2$, we have

$\begin{aligned} \arg\min_y \mathbb{E}_\theta (L(Y, y)\vert X = x) &= \mathbb{E}_\theta(w^Tx + \varepsilon\vert X=x) \\ &= w^Tx + \mathbb{E}_\theta(\varepsilon\vert X=x) \\ &= w^Tx \end{aligned}$

minimizes the expectred loss. (cf. Ch.3 Decision theory - 3.4 Square loss)

(ML 9.4)(ML 9.5)(ML 9.6) MLE for linear regression

(ML 9.7) Basis functions MLE

How to model nonlinearity using linear regression.