# Notes on Deep Learning (Book)

by **장승환**

*In this page I summarize in a succinct and straighforward fashion what I learn from the book Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville, along with my own thoughts and related resources.*
*I will update this page frequently, like every week, until it’s complete.*

**Acronyms**

- DL: Deep Learning
- MSE: Mean Squared Error

Ch. 6 Deep Feedforward Networks

#### (6.1) Example: Learning XOR

**Target function:**

where $x = [x_1, x_2]^T \in \{0, 1\}^2$.

**Linear Approximator:**

where $w = [w_1, w_2]^T \in \mathbb{R}^2$ and $b \in \mathbb{R}$.

**MSE loss function:**

where $X = \{[0,0]^T, [0,1]^T, [1,0]^T, [1,1]^T\}.$

**Optimization:** by solving normal equations

$\cdots$

gives $w = [0, 0]^T, b = \frac{1}{2}$ as the optimizer (minimizer).

Meaning that linear model (all by itself) significantly lacks capacity in representing/approximating this particular function.

**Need different approach.**

**Affine feature space transformation:**

Note that the new feature sapce resides in the hidden layer ‘‘$h$.’’

**Componetwise ReLU (Rectified Linear Unit) operation:**

**Feed it forward further using the old linear layer:**

$\cdots$

#### (6.5) Back-Propagation and Other Differentiation Algorithms

Consider variables $x, y, z$ that are related by the functions $f, g$ :

If $x, y, x \in \mathbb{R}$, the chain rule says

Now assume that $x \in \mathbb{R}^m, y \in \mathbb{R}^n, z \in \mathbb{R}$.

By the chain rule again, for all $j \in \{1, \ldots, m\}$ we have

In matrix form:

where $\frac{\partial y}{\partial x} = \left( \frac{\partial y_i}{\partial x_j} \right) = $

In short, $\,\,\,$ gradient $=$ Jacobian$^T\cdot$ gradient

*To be added..*

**Subscribe via RSS**