In this page I summarize in a succinct and straighforward fashion what I learn from the book Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville, along with my own thoughts and related resources. I will update this page frequently, like every week, until it’s complete.


  • DL: Deep Learning
  • MSE: Mean Squared Error

Ch. 6 Deep Feedforward Networks

(6.1) Example: Learning XOR

Target function:

where $x = [x_1, x_2]^T \in \{0, 1\}^2$.

Linear Approximator:

where $w = [w_1, w_2]^T \in \mathbb{R}^2$ and $b \in \mathbb{R}$.

MSE loss function:

where $X = \{[0,0]^T, [0,1]^T, [1,0]^T, [1,1]^T\}.$

Optimization: by solving normal equations


gives $w = [0, 0]^T, b = \frac{1}{2}$ as the optimizer (minimizer).
Meaning that linear model (all by itself) significantly lacks capacity in representing/approximating this particular function.

Need different approach.

Affine feature space transformation:

Note that the new feature sapce resides in the hidden layer ‘‘$h$.’’

Componetwise ReLU (Rectified Linear Unit) operation:

Feed it forward further using the old linear layer:


(6.5) Back-Propagation and Other Differentiation Algorithms

Consider variables $x, y, z$ that are related by the functions $f, g$ :

If $x, y, x \in \mathbb{R}$, the chain rule says

Now assume that $x \in \mathbb{R}^m, y \in \mathbb{R}^n, z \in \mathbb{R}$.

By the chain rule again, for all $j \in \{1, \ldots, m\}$ we have

In matrix form:

where $\frac{\partial y}{\partial x} = \left( \frac{\partial y_i}{\partial x_j} \right) = $

In short, $\,\,\,$ gradient $=$ Jacobian$^T\cdot$ gradient

To be added..