Notes on Deep Learning (Book)
by 장승환
In this page I summarize in a succinct and straighforward fashion what I learn from the book Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville, along with my own thoughts and related resources. I will update this page frequently, like every week, until it’s complete.
Acronyms
- DL: Deep Learning
- MSE: Mean Squared Error
Ch. 6 Deep Feedforward Networks
(6.1) Example: Learning XOR
Target function:
where $x = [x_1, x_2]^T \in \{0, 1\}^2$.
Linear Approximator:
where $w = [w_1, w_2]^T \in \mathbb{R}^2$ and $b \in \mathbb{R}$.
MSE loss function:
where $X = \{[0,0]^T, [0,1]^T, [1,0]^T, [1,1]^T\}.$
Optimization: by solving normal equations
$\cdots$
gives $w = [0, 0]^T, b = \frac{1}{2}$ as the optimizer (minimizer).
Meaning that linear model (all by itself) significantly lacks capacity in representing/approximating this particular function.
Need different approach.
Affine feature space transformation:
Note that the new feature sapce resides in the hidden layer ‘‘$h$.’’
Componetwise ReLU (Rectified Linear Unit) operation:
Feed it forward further using the old linear layer:
$\cdots$
(6.5) Back-Propagation and Other Differentiation Algorithms
Consider variables $x, y, z$ that are related by the functions $f, g$ :
If $x, y, x \in \mathbb{R}$, the chain rule says
Now assume that $x \in \mathbb{R}^m, y \in \mathbb{R}^n, z \in \mathbb{R}$.
By the chain rule again, for all $j \in \{1, \ldots, m\}$ we have
In matrix form:
where $\frac{\partial y}{\partial x} = \left( \frac{\partial y_i}{\partial x_j} \right) = $
In short, $\,\,\,$ gradient $=$ Jacobian$^T\cdot$ gradient
To be added..
Subscribe via RSS