�ݺ�ߣ

Backpropagation. A Peek into
the Mathematics of Optimization
1 Motivation
In order to get a truly deep understanding of deep neural networks, one must
look at the mathematics of it. As backpropagation is at the core of the optimiza-
tion process, we wanted to introduce you to it. This is definitely not a necessary
part of the course, as in TensorFlow, sk-learn, or any other machine learning
package (as opposed to simply NumPy), will have backpropagation methods
incorporated.
2 The specific net and notation we will examine
Here’s our simple network:
Figure 1: Backpropagation
We have two inputs: x1 and x2. There is a single hidden layer with 3 units
(nodes): h1, h2, and h3. Finally, there are two outputs: y1 and y2. The arrows
that connect them are the weights. There are two weights matrices: w, and u.
The w weights connect the input layer and the hidden layer. The u weights
connect the hidden layer and the output layer. We have employed the letters
w, and u, so it is easier to follow the computation to follow.
You can also see that we compare the outputs y1 and y2 with the targets t1 and
t2.
1

There is one last letter we need to introduce before we can get to the compu-
tations. Let a be the linear combination prior to activation. Thus, we have:
a(1)
= xw + b(1)
and a(2)
= hu + b(2)
.
Since we cannot exhaust all activation functions and all loss functions, we will
focus on two of the most common. A sigmoid activation and an L2-norm
loss.
With this new information and the new notation, the output y is equal to the ac-
tivated linear combination. Therefore, for the output layer, we have y = σ(a(2)
),
while for the hidden layer: h = σ(a(1)
).
We will examine backpropagation for the output layer and the hidden layer
separately, as the methodologies differ.
3 Useful formulas
I would like to remind you that:
L2-norm loss: L =
1
2
X
i
(yi − ti)2
The sigmoid function is:
σ(x) =
1
1 + e−x
and its derivative is:
σ0
(x) = σ(x)(1 − σ(x))
4 Backpropagation for the output layer
In order to obtain the update rule:
u ← u − η∇uL(u)
we must calculate
∇uL(u)
Let’s take a single weight uij. The partial derivative of the loss w.r.t. uij equals:
∂L
∂uij
=
∂L
∂yj
∂yj
∂a
(2)
j
∂a
(2)
j
∂uij
where i corresponds to the previous layer (input layer for this transformation)
and j corresponds to the next layer (output layer of the transformation). The
partial derivatives were computed simply following the chain rule.
∂L
∂yj
= (yj − tj)
2

following the L2-norm loss derivative.
∂yj
∂a
(2)
j
= σ(a
(2)
j )(1 − σ(a
(2)
j )) = yj(1 − yj)
following the sigmoid derivative.
Finally, the third partial derivative is simply the derivative of a(2)
= hu + b(2)
.
So,
∂a
(2)
j
∂uij
= hi
Replacing the partial derivatives in the expression above, we get:
∂L
∂uij
=
∂L
∂yj
∂yj
∂a
(2)
j
∂a
(2)
j
∂uij
= (yj − tj)yj(1 − yj)hi = δjhi
Therefore, the update rule for a single weight for the output layer is given by:
uij ← uij − ηδjhi
5 Backpropagation of a hidden layer
Similarly to the backpropagation of the output layer, the update rule for a single
weight, wij would depend on:
∂L
∂wij
=
∂L
∂hj
∂hj
∂a
(1)
j
∂a
(1)
j
∂wij
following the chain rule.
Taking advantage of the results we have so far for transformation using the
sigmoid activation and the linear model, we get:
∂hj
∂a
(1)
j
= σ(a
(1)
j )(1 − σ(a
(1)
j )) = hj(1 − hj)
and
∂a
(1)
j
∂wij
= xi
The actual problem for backpropagation comes from the term
∂L
∂hj
. That’s due
to the fact that there is no ”hidden” target. You can follow the solution for
weight w11 below. It is advisable to also check Figure 1, while going through
the computations.
3

∂L
∂h1
=
∂L
∂y1
∂y1
∂a
(2)
1
∂a
(2)
1
∂h1
+
∂L
∂y2
∂y2
∂a
(2)
2
∂a
(2)
2
∂h1
=
= (y1 − t1)y1(1 − y1)u11 + (y2 − t2)y2(1 − y2)u12
From here, we can calculate
∂L
∂w11
, which was what we wanted. The final ex-
pression is:
∂L
∂w11
= [(y1 − t1)y1(1 − y1)u11 + (y2 − t2)y2(1 − y2)u12] h1(1 − h1)x1
The generalized form of this equation is:
∂L
∂wij
=
X
k
(yk − tk)yk(1 − yk)ujkhj(1 − hj)xi
6 Backpropagation generalization
Using the results for backpropagation for the output layer and the hidden layer,
we can put them together in one formula, summarizing backpropagation, in the
presence of L2-norm loss and sigmoid activations.
∂L
∂wij
= δjxi
where for a hidden layer
δj =
X
k
δkwjkyj(1 − yj)
Kudos to those of you who got to the end.
Thanks for reading.
4

�ݺ�ߣ

Backpropagation - A peek into the Mathematics of optimization.pdf

More Related Content

Backpropagation - A peek into the Mathematics of optimization.pdf