#### (ML 9.1) Linear regression - Nonlinearity via basis functions

“It’s truly a workhorse of statistics!”
“It’s not just about lines & planes!”

Setup. Given $D = ((x_1, y_1), \ldots, (x_n, y_n))$ with $x_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}$.
Goal. Select “good” $f : \mathbb{R} \rightarrow \mathbb{R}$ for predicting $y$ for new $x$.

Basis functions.

The simplest class of functions $\mathbb{R}^d \rightarrow \mathbb{R}$ one can think of is linear ones. Those are precisely the functions $\mathbb{R}^d \rightarrow \mathbb{R}$ that is given by

for some (fixed) vector $w \in \mathbb{R}^d.$

#### (ML 9.2) Linear regression - Definition & Motivation

Discrminative approach.

Instead of aiming to model the target function $f : \mathbb{R}^d \rightarrow \mathbb{R}$, we take probabilistic approach by modelling the conditional distribution $p(y\vert x)$. We start with a parametized family $p_\theta(y\vert x)$ whith $\theta \in \Theta$ and estimate $\theta$ using the data $D = ((x_1, y_1), \ldots, (x_n, y_n))$. In other words, we figure out what $theta$ the datat comes from.

But what parametrized family should we choose?

One natural choice would be a Gaussian family: We set $p_\theta(y\vert x) = N(y\vert \mu(x), \sigma^2(x))$, meaning that for fixed $x$, the random variable $Y$ corresponding to $x$ is such that $Y \sim N(\mu(x), \sigma^2(x))$. What remains is to decide the dpendency of $\mu(x)$ and $\sigma^2(x)$ upon $\theta$.

We set the parameter to be $\theta = (w, \sigma^2)$ with $w \in \mathbb{R}, \sigma^2 >0$ and take $\mu(x) = w^Tx, \sigma^2(x) = \sigma^2$, so that we have

This is called the (Gaussian) linear regression.

In effect, we have modelled the target function $f$ as the random variable

wherer $\varepsilon \sim N(0, \sigma^2)$. Thus, the term “linear.”

#### (ML 9.3) Choosing f under linear regression

Why linear regression is such a natural model for regresession in some sense.

In other words, why does it make sense to take $\mu(x) = w^Tx$?

For th square loss $L(y, \hat{Y}) = (y-y)^2$, we have

minimizes the expectred loss. (cf. Ch.3 Decision theory - 3.4 Square loss)

#### (ML 9.7) Basis functions MLE

How to model nonlinearity using linear regression.