Notes on Deep Learning (Book)

In this page I summarize in a succinct and straighforward fashion what I learn from the book Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville, along with my own thoughts and related resources. I will update this page frequently, like every week, until it’s complete.

Acronyms

DL: Deep Learning
MSE: Mean Squared Error

Ch. 6 Deep Feedforward Networks

(6.1) Example: Learning XOR

Target function:

$y = f^*(x) = f^*([x_1, x_2]^T) = XOR(x_1, x_2)$

where $x = [x_1, x_2]^T \in \{0, 1\}^2$.

Linear Approximator:

$\hat{y} = f(x; \theta) = f(x; (w, b)) = w^Tx + b$

where $w = [w_1, w_2]^T \in \mathbb{R}^2$ and $b \in \mathbb{R}$.

MSE loss function:

$J(\theta) = \frac{1}{4}\sum_{x \in X}(f^*(x) - f(x;\theta))^2$

where $X = \{[0,0]^T, [0,1]^T, [1,0]^T, [1,1]^T\}.$

Optimization: by solving normal equations

$\cdots$

gives $w = [0, 0]^T, b = \frac{1}{2}$ as the optimizer (minimizer).
Meaning that linear model (all by itself) significantly lacks capacity in representing/approximating this particular function.

Need different approach.

Affine feature space transformation:

$\begin{aligned} x= \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \mapsto W^Tx +c &= \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} 0 \\ -1 \end{bmatrix} \\ & = \begin{bmatrix} x_1 + x_2 \\ x_1 + x_2 - 1 \end{bmatrix} = \begin{bmatrix} h_1 \\ h_2 \end{bmatrix} = h \end{aligned}$

Note that the new feature sapce resides in the hidden layer ‘‘$h$.’’

Componetwise ReLU (Rectified Linear Unit) operation:

$h = \begin{bmatrix} h_1 \\ h_2 \end{bmatrix} \mapsto \begin{bmatrix} {\rm ReLU}(h_1) \\ {\rm ReLU}(h_2) \end{bmatrix} = \begin{bmatrix} \max(0, h_1) \\ \max(0, h_2) \end{bmatrix} = \begin{bmatrix} h_1^+ \\ h_2^+ \end{bmatrix} = h^\dagger$

Feed it forward further using the old linear layer:

$\begin{aligned} h^+ = \begin{bmatrix} h_1^+ \\ h_2^+ \end{bmatrix} \mapsto w^Th + b &= \begin{bmatrix} 1 & -2 \end{bmatrix} \begin{bmatrix} h_1^+ \\ h_2^+ \end{bmatrix} + 0 = h_1^+ -2h_2^+ \\ &= (x_1+x_2)^+ - 2(x_1 + x_2 -1)^+ = y \end{aligned}$

$\cdots$

(6.5) Back-Propagation and Other Differentiation Algorithms

Consider variables $x, y, z$ that are related by the functions $f, g$ :

$x \overset{f}{\mapsto} y \overset{g}{\mapsto} z$

If $x, y, x \in \mathbb{R}$, the chain rule says

$\frac{dz}{dx} = \frac{dz}{dy}\frac{dy}{dx}$

Now assume that $x \in \mathbb{R}^m, y \in \mathbb{R}^n, z \in \mathbb{R}$.

By the chain rule again, for all $j \in \{1, \ldots, m\}$ we have

$\frac{\partial z}{\partial x_j}= \sum_{i}\frac{\partial z}{\partial y_i} \frac{\partial y_i}{\partial x_j}$

In matrix form:

$\nabla_x z = \left(\frac{\partial y}{\partial x}\right)^T \nabla_y z$

where $\frac{\partial y}{\partial x} = \left( \frac{\partial y_i}{\partial x_j} \right) = $

In short, $\,\,\,$ gradient $=$ Jacobian$^T\cdot$ gradient

To be added..