# Explain Backpropagation Like I'm Five

Jul 13 2018

Recently, there are many successful news with the artificial intelligence (AI). Machine translation, object detection, automatic modeling, motion capture and so on... Even the optical illusion that only your brain should be able to perform has been achieved by the AI.

>

Hi, I'm CreativeGP, a fifteen-year-old boy loves programming <3

In this article, I'm going to explain the Backpropagation theory and it's implementation in C++.

Previously, I'll show you the video of demo (sorry for Japanese)

and the source code here.

## First, What is the Backpropagation 🙄 ?

Recently, there are many successful news with the artificial intelligence (AI). Machine translation, object detection, automatic modeling, motion capture and so on...

Even the optical illusion that only your brain should be able to perform has been achieved by the AI.

There is a neural network which is a rough model imitating the brain at the base of them. In the network, there are several neurons and they are connected each other with some weights. If the weight is high, the signal of the previous neuron strongly conveyed.

If you set the input and the output of the network, the Backpropagation algorithm optimizes its weights so that the network can output the arbitrary values when the arbitrary values are fed.

## So, What do the Backpropagation do 😧 ?

* First, find out the Error function.

* Second, for each weight, check how the Error function value moves when the weight moves.

* Now we know how the Error function value moves when the weight moves, so, finally, for each weight, change the weight so that the Error function value turns smaller.

These are all of what the backpropagation actually does.

And what I write from now on is just an explanation of these steps.

Let's break it down!

## Artificial Neuron Models 😲

This is a neuron.If we feed it several inputs, each input multiplies by weights (wi1, wi2, wi3) and the summation of multiplied inputs is fed to an activation function(f).And the activation function value becomes the output signals of this neuron.

$$y = f(\sum_{k=0}^{n} x_kw_k + b)$$

And we use the Sigmoid function as the activation function.

$$\sigma(x) = \frac{1}{1-e^{-x}}$$

This function maps an arbitrary value to 0~1 and it's nonlinear (this is important trait that the activation function must have).

Write the program defines the artificial neuron and the Sigmoid function:
struct Neuron {
double value, in, bias;
vector<Neuron*> backs;
vector<Neuron*> nexts;
vector<double> weights;
};

double s(double x)
{
return 1.0 / (1.0 + exp(-x));
}


and calculate the neuron value from the shallowest layer to the deepest layer (this is called Forward Propagation)
double calculate_value(Neuron& n, bool forcibly = false) {
if (n.backs.size() != 0) {
vector<double> values;
for (auto e : n.backs) {
// Don't calculate already calculated value.
if (e->value == 0 && !forcibly) calculate_value(*e);
values.push_back(e->value);
}

auto tmp = inner_product(values.begin(), values.end(), n.weights.begin(), 0.0f);
tmp += n.bias;
n.in = tmp; // This isn't applied by sigmoid function
tmp = s(tmp);
n.value = tmp;
return tmp;
} else {
return n.value;
}
}

void update_network(vector<vector<Neuron>>& net, bool forcibly = false) {
for (auto &e : net.back())
calculate_value(e, forcibly);
}

Now, let's began following aforementioned the Backpropagation steps.

## Step1: The Error Function 😬

The Error function (aka Object, aka Cost, aka Loss) represents how much different the outputs of the network are from expected outputs. There are several function suitable for this, today I use the most simple Error function,

$$E = \sum_{k=0}^{outputs} \frac{(teacher_k - ouput_k)^2}{2}$$The summation of the half of squared difference between each expected output and each actual output.

Therefore, our goal is to let this Error function value approach zero.

Program:
double error(vector<vector<Neuron>>& net, vector<double> teacher) {
// Update the network
update_network(net);

double sum = 0;
// E(y1,...,yK) = ∑Kk(tk−yk)2/2
for (int k = 0;
k < net.back().size();
++k)
{
sum += pow((teacher[k] - net.back()[k].value), 2) / 2;
}

return sum;
}


## Step2: Storm of the Differential Calculus 😤

This step is hardest to understand. But, simply all we have to do in this step is calculating the derivative for each weight.

First we define a neural network that we're going to work on,Figure2

input layer:3 hidden layer:3*2 output layer: 2.

Layer Name: $I, h1, h2, O$

Layer Index: $l, i, j, k$

Neuron name: [layer index][neuron index] e.g. $i2, i3, k2$

Value name: [layer name][out/in][neuron index] e.g. $h2_{out3}$, $O_{out1}$, $I_3$ (in means the unactivated value and out means the activated value through the activation function)

Weight: W[source neuron name][dst. neuron name] e.g. $w_{l1i1}, w_{j3k2}$

This is very simple network and I'm going to explain how the Backpropagation find derivative through calculating three derivative by YOUR HAND. The Backpropagation starts calculation from deeper network, so let's begin with the weights between $h2$ and $O$.

### h2 -> O 😎

First, we want to get a function represents how the Error function value moves when the weight moves. So, we have to calculate differential, but we have to use the special one, Partial derivative, because the function we want to differentiate (i.e. the Error function) demands multiple variables.

To prevent complications, we won't explain it here. There are many good websites. Google!

What we want to finally get is this:$$\frac{\partial E}{\partial w_{j2k1}}$$I picked $w_{j2k1}$(red line in Figure2). Note that we have to calculate all weights in the implementation.

From chain rule (Google!), we can expand the numerical expression:$$\frac{\partial E}{\partial w_{j2k1}}=\Biggl[\frac{\partial E}{\partial O_{out1}}\frac{\partial O_{out1}}{\partial O_{in1}}\Biggl]\frac{\partial O_{in1}}{\partial w_{j2k1}}$$

Solve for each item.$$\frac{\partial E}{\partial O_{out1}} = O_{out1}-T_1$$$$\frac{\partial O_{out1}}{\partial O_{in1}} = \sigma'(O_{in1}) = \sigma(O_{in1}) × (1-\sigma(O_{in1}))$$$$\frac{\partial O_{in1}}{\partial w_{j2k1}} = h2_{out1}$$

Thus we got:$$\frac{\partial E}{\partial w_{j2k1}} = (O_{out1}-T_1) × \sigma'(O_{in1}) × h2_{out1}$$

You can solve for other weights in this way. And the law of the second and third item have already appeared. Basically,

* The Second Item = Sigmoid'(the destination neuron's input)

* The Third Item = the source neuron's output
law1

Then, let's go to the shallower layer!

You may got tired, but the interesting law don't appears until we reach shallower layer. You will realize why the Backpropagation calculates from deepest layer.

Program for sigmoid derivative:
double d_s(double x)
{
return s(x) * (1 - s(x));
}


### h1 -> h2 😥

We want to get this (the green line in Figure2):$$\frac{\partial E}{\partial w_{i3j2}} = \frac{\partial E}{\partial h2_{out2}}\frac{\partial h2_{out2}}{\partial h2_{in2}}\frac{\partial h2_{in2}}{\partial w_{i3j2}}$$

The second and third item will be calculated soon with law1.

The cumbersome thing is first item, $\frac{\partial E}{\partial h2_{out2}}$. This represents how the Error function value moves when the output of hidden-layer-2 neuron moves. This value influences the next layer($O$)'s all neurons. So you may understand that this value pretty complicated, and this value is actually complicated.

$$\frac{\partial E}{\partial h2_{out2}} = \sum_{e}^{E} \frac{\partial E_e}{\partial h2_{out2}}$$Thus, this value is a summation of derivative of the error with each neuron of $O$ with respect to $h2_{out2}$.

Expanding:$$\frac{\partial E_1}{\partial h2_{out2}} = \frac{\partial E}{\partial O_{out1}}\frac{\partial O_{out1}}{\partial O_{in1}}\frac{\partial O_{in1}}{\partial h2_{out2}}$$$$\frac{\partial E_2}{\partial h2_{out2}} = \frac{\partial E}{\partial O_{out2}}\frac{\partial O_{out2}}{\partial O_{in2}}\frac{\partial O_{in2}}{\partial h2_{out2}}$$

Therefore, $\frac{\partial E}{\partial h2_{out2}}$ for this network is:$$\frac{\partial E}{\partial h2_{out2}} =(\frac{\partial E}{\partial O_{out1}}\frac{\partial O_{out1}}{\partial O_{in1}}\frac{\partial O_{in1}}{\partial h2_{out2}}) + (\frac{\partial E}{\partial O_{out2}}\frac{\partial O_{out2}}{\partial O_{in2}}\frac{\partial O_{in2}}{\partial h2_{out2}})$$But look at the part $\frac{\partial E}{\partial O_{out1}} \frac{\partial O_{out1}}{\partial O_{in1}}$. Surprisingly, this part is the same as the part we calculated in previous chapter (the part enclosed by square brackets).

Thus, we can save the computational complexity by saving the calculation result while working on the deeper layer. And this is why the Backpropagation calculates from deepest layer!

## Step 3: Change the weight 😏

Now we have the derivatives for each weight. Let's change the weight so that the Error function value turns smaller with them.

The value we've got is the gradient of the function. The big gradient suggests that we have to decrease the neuron's signal (i.e. decrease the weight and the bias).

Therefore,$$w_{j2k1} += -\eta ×\frac{\partial E}{\partial w_{j2k1}}$$where $\eta$ is Learning Rate. We can optimize the Backpropagation algorithm by manipulating this.

Don't forget updating the bias.$$b_{k1} += -H ×\frac{\partial E}{\partial w_{j2k1}}$$where $H$ is Bias Learning Rate. We can optimize the Backpropagation algorithm by manipulating this.

Here is the program for the Backpropagation algorithm:
void fit(vector<vector<Neuron>>& net, vector<double> teacher) {
vector<double> last_error_by_in;
// Loop through layers backward
for (int i = net.size()-1; i > 0; --i) {
// Loop through neurons of this layer
for (int j = 0; j < net[i].size(); ++j) {
double error_by_out;
if (i == net.size()-1)
error_by_out = net[i][j].value - teacher[j];
else {
for (int e = 0; e < net[i+1].size(); ++e)
error_by_out += last_error_by_in[e] * net[i+1][e].weights[j];
}

double out_by_in = d_s(net[i][j].in);

// save last calculation result for next layer
last_error_by_in.clear();
last_error_by_in.push_back(error_by_out * out_by_in);

double in_by_weights;
for (int k = 0; k < net[i][j].weights.size(); ++k) {
in_by_weights = net[i-1][k].value;
net[i][j].weights[k] -= LEARNING_RATE * error_by_out * out_by_in * in_by_weights;
net[i][j].bias -= LEARNING_BIAS_RATE * error_by_out * out_by_in * in_by_weights;
}
}
}
}


## Summary 🤓

This algorithm is very complicated, so I think it is important to prepare yourself to capture this labyrinth.

I think it is so meaningful to learn how the Backpropagation algorithm work now. And you can create innovative AI and software with the great understanding of the base.

I apologize that there are many points hard to read.

Thank you.