Backpropagation. A peek into the mathematics of optimization

Iliya Valchanov 2 May 2023 14 min read

1. Motivation

In order to get a truly deep understanding of deep neural networks (which is definitely a plus if you want to start a career in data science), one must look at the mathematics of it. As backpropagation is at the core of the optimization process, we wanted to introduce you to it.  Because TensorFlow, sklearn, or any other machine learning package (as opposed to simply NumPy), will have backpropagation methods incorporated.

2. The specific net and notation we will examine

Here's our simple network:

Backpropagation Figure 1: Backpropagation

We have two inputs: x1 and x2. There is a single hidden layer with 3 units (nodes): y1, y2, and y3. Finally, there are two outputs: y1 and y2. The arrows that connect them are the weights. There are two weights matrices: w, and u. The w weights connect the input layer and the hidden layer. The u weights connect the hidden layer and the output layer. We have employed the letters w, and u, so it is easier to follow the computation to follow. You can also see that we compare the outputs y1 and y2 with the targets t1 and t2.

There is one last letter we need to introduce before we can get to the computations. Let a be the linear combination prior to activation. Thus, we have: а(1) = xw + b(1) and a(2) = hu + b(2).

Since we cannot exhaust all activation functions and all loss functions, we will focus on two of the most common. A sigmoid activation and an L2-norm loss. With this new information and the new notation, the output y is equal to the activated linear combination. Therefore, for the output layer, we have y = σ(a(2)), while for the hidden layer: h = σ(a(1)).

We will examine backpropagation for the output layer and the hidden layer separately, as the methodologies differ.

3. Useful formulas

I would like to remind you that:

Backpropagation 1

The sigmoid function is:

Backpropagation 2

and its derivative is:

Backpropagation 3

4. Backpropagation for the output layer

In order to obtain the update rule:

Backpropagation 4 we must calculate

Backpropagation 5

Let's take a single weight uij . The partial derivative of the loss w.r.t. uij equals:

Backpropagation 6

where i corresponds to the previous layer (input layer for this transformation) and j corresponds to the next layer (output layer of the transformation). The partial derivatives were computed simply following the chain rule.

Backpropagation 7

following the L2-norm loss derivative.

Backpropagation 8

following the sigmoid derivative.

Finally, the third partial derivative is simply the derivative of a(2) = hu + b(2).


Backpropagation 9

Replacing the partial derivatives in the expression above, we get:

Backpropagation 10

Therefore, the update rule for a single weight for the output layer is given by:

Backpropagation 11

5. Backpropagation of a hidden layer

Similarly to the backpropagation of the output layer, the update rule for a single weight, wij would depend on:

Backpropagation 12

following the chain rule. Taking advantage of the results we have so far for transformation using the sigmoid activation and the linear model, we get:

Backpropagation 13


Backpropagation 14

The actual problem for backpropagation comes from the term

Backpropagation 14.1

That's due to the fact that there is no "hidden" target. You can follow the solution for weight w11 below. It is advisable to also check Figure 1, while going through the computations.


From here, we can calculate


Which was what we wanted. The final expression is:


The generalized form of this equation is:


6. Backpropagation generalization

Using the results for backpropagation for the output layer and the hidden layer, we can put them together in one formula, summarizing backpropagation, in the presence of L2-norm loss and sigmoid activations.


where for a hidden layer

Backpropagation 19

Kudos to those of you who got to the end.

Thanks for reading.

If you want to dig deeper into the field of machine learning and deep learning, we encourage you to enrol in our Deep Learning with TensorFlow 2 course.

Learn data science with industry experts

Try For Free

Iliya Valchanov

Co-founder of 365 Data Science

Iliya is a Finance Graduate from Bocconi University with expertise in mathematics, statistics, programming, machine learning, and deep learning. His passion for teaching inspired him to create some of the most popular courses in our program: Introduction to Data and Data Science, Introduction to R Programming, Statistics, Mathematics, Deep Learning with TensorFlow, Deep Learning with TensorFlow 2, and Machine Learning in Python.

Learn data science with industry experts

Comprehensive training, exams, certificates. Find your dream job.

Start your career
413,000+ Reviews