Explain Backpropagation Like I'm Five

Jul 13 2018

Recently, there are many successful news with the artificial intelligence (AI). Machine translation, object detection, automatic modeling, motion capture and so on... Even the optical illusion that only your brain should be able to perform has been achieved by the AI.

>

Hi, I'm CreativeGP, a fifteen-year-old boy loves programming <3

In this article, I'm going to explain the Backpropagation theory and it's implementation in C++.

Previously, I'll show you the video of demo (sorry for Japanese)

and the source code here.

First, What is the Backpropagation 🙄 ?

Recently, there are many successful news with the artificial intelligence (AI). Machine translation, object detection, automatic modeling, motion capture and so on...

Even the optical illusion that only your brain should be able to perform has been achieved by the AI.

There is a neural network which is a rough model imitating the brain at the base of them. In the network, there are several neurons and they are connected each other with some weights. If the weight is high, the signal of the previous neuron strongly conveyed.

If you set the input and the output of the network, the Backpropagation algorithm optimizes its weights so that the network can output the arbitrary values when the arbitrary values are fed.

So, What do the Backpropagation do 😧 ?

* First, find out the Error function.

* Second, for each weight, check how the Error function value moves when the weight moves.

* Now we know how the Error function value moves when the weight moves, so, finally, for each weight, change the weight so that the Error function value turns smaller.

These are all of what the backpropagation actually does.

And what I write from now on is just an explanation of these steps.

Let's break it down!

Artificial Neuron Models 😲

This is a neuron.If we feed it several inputs, each input multiplies by weights (wi1, wi2, wi3) and the summation of multiplied inputs is fed to an activation function(f).And the activation function value becomes the output signals of this neuron.

$$y = f(\sum_{k=0}^{n} x_kw_k + b)$$

And we use the Sigmoid function as the activation function.

$$\sigma(x) = \frac{1}{1-e^{-x}}$$

This function maps an arbitrary value to 0~1 and it's nonlinear (this is important trait that the activation function must have).

Write the program defines the artificial neuron and the Sigmoid function:
struct Neuron {
    double value, in, bias;
    vector<Neuron*> backs;
    vector<Neuron*> nexts;
    vector<double> weights;
};

double s(double x)
{
    return 1.0 / (1.0 + exp(-x));
}

and calculate the neuron value from the shallowest layer to the deepest layer (this is called Forward Propagation)
double calculate_value(Neuron& n, bool forcibly = false) {
    if (n.backs.size() != 0) {
        vector<double> values;
        for (auto e : n.backs) {
            // Don't calculate already calculated value.
            if (e->value == 0 && !forcibly) calculate_value(*e);
            values.push_back(e->value);
        }

        auto tmp = inner_product(values.begin(), values.end(), n.weights.begin(), 0.0f);
        tmp += n.bias;
        n.in = tmp; // This isn't applied by sigmoid function
        tmp = s(tmp);
        n.value = tmp;
        return tmp;
    } else {
        return n.value;
    }
}

void update_network(vector<vector<Neuron>>& net, bool forcibly = false) {
    for (auto &e : net.back())
        calculate_value(e, forcibly);
}
Now, let's began following aforementioned the Backpropagation steps.

Step1: The Error Function 😬

The Error function (aka Object, aka Cost, aka Loss) represents how much different the outputs of the network are from expected outputs. There are several function suitable for this, today I use the most simple Error function,

$$E = \sum_{k=0}^{outputs} \frac{(teacher_k - ouput_k)^2}{2}$$The summation of the half of squared difference between each expected output and each actual output.

Therefore, our goal is to let this Error function value approach zero.

Program:
double error(vector<vector<Neuron>>& net, vector<double> teacher) {
    // Update the network
    update_network(net);

    double sum = 0;
    // E(y1,...,yK) = ∑Kk(tk−yk)2/2
    for (int k = 0;
         k < net.back().size();
         ++k)
    {
        sum += pow((teacher[k] - net.back()[k].value), 2) / 2;
    }
    
    return sum;
}

Step2: Storm of the Differential Calculus 😤

This step is hardest to understand. But, simply all we have to do in this step is calculating the derivative for each weight.

First we define a neural network that we're going to work on,Figure2

input layer:3 hidden layer:3*2 output layer: 2.

Layer Name: $I, h1, h2, O$

Layer Index: $l, i, j, k$

Neuron name: [layer index][neuron index] e.g. $i2, i3, k2$

Value name: [layer name][out/in][neuron index] e.g. $h2_{out3}$, $O_{out1}$, $I_3$ (`in` means the unactivated value and `out` means the activated value through the activation function)

Weight: W[source neuron name][dst. neuron name] e.g. $w_{l1i1}, w_{j3k2}$

This is very simple network and I'm going to explain how the Backpropagation find derivative through calculating three derivative by YOUR HAND. The Backpropagation starts calculation from deeper network, so let's begin with the weights between $h2$ and $O$.

h2 -> O 😎

First, we want to get a function represents how the Error function value moves when the weight moves. So, we have to calculate differential, but we have to use the special one, Partial derivative, because the function we want to differentiate (i.e. the Error function) demands multiple variables.

To prevent complications, we won't explain it here. There are many good websites. Google!

What we want to finally get is this:$$\frac{\partial E}{\partial w_{j2k1}}$$I picked $w_{j2k1}$(red line in Figure2). Note that we have to calculate all weights in the implementation.

From chain rule (Google!), we can expand the numerical expression:$$\frac{\partial E}{\partial w_{j2k1}}=\Biggl[\frac{\partial E}{\partial O_{out1}}\frac{\partial O_{out1}}{\partial O_{in1}}\Biggl]\frac{\partial O_{in1}}{\partial w_{j2k1}}$$

Solve for each item.$$\frac{\partial E}{\partial O_{out1}} = O_{out1}-T_1$$$$\frac{\partial O_{out1}}{\partial O_{in1}} = \sigma'(O_{in1}) = \sigma(O_{in1}) × (1-\sigma(O_{in1}))$$$$\frac{\partial O_{in1}}{\partial w_{j2k1}} = h2_{out1}$$

Thus we got:$$\frac{\partial E}{\partial w_{j2k1}} = (O_{out1}-T_1) × \sigma'(O_{in1}) × h2_{out1}$$

You can solve for other weights in this way. And the law of the second and third item have already appeared. Basically,

* The Second Item = Sigmoid'(the destination neuron's input)

* The Third Item = the source neuron's output
law1

Then, let's go to the shallower layer!

You may got tired, but the interesting law don't appears until we reach shallower layer. You will realize why the Backpropagation calculates from deepest layer.

Program for sigmoid derivative:
double d_s(double x)
{
    return s(x) * (1 - s(x));
}

h1 -> h2 😥

We want to get this (the green line in Figure2):$$\frac{\partial E}{\partial w_{i3j2}} = \frac{\partial E}{\partial h2_{out2}}\frac{\partial h2_{out2}}{\partial h2_{in2}}\frac{\partial h2_{in2}}{\partial w_{i3j2}}$$

The second and third item will be calculated soon with law1.

The cumbersome thing is first item, $\frac{\partial E}{\partial h2_{out2}}$. This represents how the Error function value moves when the output of hidden-layer-2 neuron moves. This value influences the next layer($O$)'s all neurons. So you may understand that this value pretty complicated, and this value is actually complicated.

$$\frac{\partial E}{\partial h2_{out2}} = \sum_{e}^{E} \frac{\partial E_e}{\partial h2_{out2}}$$Thus, this value is a summation of derivative of the error with each neuron of $O$ with respect to $h2_{out2}$.

Expanding:$$\frac{\partial E_1}{\partial h2_{out2}} = \frac{\partial E}{\partial O_{out1}}\frac{\partial O_{out1}}{\partial O_{in1}}\frac{\partial O_{in1}}{\partial h2_{out2}}$$$$\frac{\partial E_2}{\partial h2_{out2}} = \frac{\partial E}{\partial O_{out2}}\frac{\partial O_{out2}}{\partial O_{in2}}\frac{\partial O_{in2}}{\partial h2_{out2}}$$

Therefore, $\frac{\partial E}{\partial h2_{out2}}$ for this network is:$$\frac{\partial E}{\partial h2_{out2}} =(\frac{\partial E}{\partial O_{out1}}\frac{\partial O_{out1}}{\partial O_{in1}}\frac{\partial O_{in1}}{\partial h2_{out2}}) + (\frac{\partial E}{\partial O_{out2}}\frac{\partial O_{out2}}{\partial O_{in2}}\frac{\partial O_{in2}}{\partial h2_{out2}})$$But look at the part $\frac{\partial E}{\partial O_{out1}} \frac{\partial O_{out1}}{\partial O_{in1}}$. Surprisingly, this part is the same as the part we calculated in previous chapter (the part enclosed by square brackets).

Thus, we can save the computational complexity by saving the calculation result while working on the deeper layer. And this is why the Backpropagation calculates from deepest layer!

Step 3: Change the weight 😏

Now we have the derivatives for each weight. Let's change the weight so that the Error function value turns smaller with them.

The value we've got is the gradient of the function. The big gradient suggests that we have to decrease the neuron's signal (i.e. decrease the weight and the bias).

Therefore,$$w_{j2k1} += -\eta ×\frac{\partial E}{\partial w_{j2k1}}$$where $\eta$ is Learning Rate. We can optimize the Backpropagation algorithm by manipulating this.

Don't forget updating the bias.$$b_{k1} += -H ×\frac{\partial E}{\partial w_{j2k1}}$$where $H$ is Bias Learning Rate. We can optimize the Backpropagation algorithm by manipulating this.

Here is the program for the Backpropagation algorithm:
void fit(vector<vector<Neuron>>& net, vector<double> teacher) {
    vector<double> last_error_by_in;
    // Loop through layers backward
    for (int i = net.size()-1; i > 0; --i) {
	    // Loop through neurons of this layer
        for (int j = 0; j < net[i].size(); ++j) {
            double error_by_out;
            if (i == net.size()-1)
                error_by_out = net[i][j].value - teacher[j];
            else {
                for (int e = 0; e < net[i+1].size(); ++e)
                    error_by_out += last_error_by_in[e] * net[i+1][e].weights[j];
            }
            
            double out_by_in = d_s(net[i][j].in);
            
            // save last calculation result for next layer
            last_error_by_in.clear();
            last_error_by_in.push_back(error_by_out * out_by_in);
            
            double in_by_weights;
            for (int k = 0; k < net[i][j].weights.size(); ++k) {
                in_by_weights = net[i-1][k].value;
                net[i][j].weights[k] -= LEARNING_RATE * error_by_out * out_by_in * in_by_weights;
                net[i][j].bias -= LEARNING_BIAS_RATE * error_by_out * out_by_in * in_by_weights;
            }
        }
    }
}

Summary 🤓

This algorithm is very complicated, so I think it is important to prepare yourself to capture this labyrinth.

I think it is so meaningful to learn how the Backpropagation algorithm work now. And you can create innovative AI and software with the great understanding of the base.

I apologize that there are many points hard to read.

Thank you.

Read more?

数学 I(三角法) を勉強していく上で思ったこととかメモ

Oct 11 2018

黄茶の旅日記の様なものです。数Iの三角法です。

画面指紋認証の仕組みって

Jun 11 2018

最近、画面指紋認証の記事を見かけまして、「そういやこれどうやって出来てるのかな...」と思って記事探してみてもなかったので(英語でも少なかった)、幾つかのページ(参考文献)から引っ張り出してきた情報を元に記事を書きます

Rustにおけるファイル分割

Rustは、モジュールシステムも色々決まりがあって整っているので、単にファイルを分けて書こうと思った場合にも、整った綺麗な書き方をするには難しい言語に感じます。自分も、丁度Rustでプログラムを書いていた時に迷って良いリポジトリから良い書き方を見つけたのでその紹介をしようと思います

Jul 19 2018

Read more?

数学 I(三角法) を勉強していく上で思ったこととかメモ

Oct 11 2018

黄茶の旅日記の様なものです。数Iの三角法です。

画面指紋認証の仕組みって

Jun 11 2018

最近、画面指紋認証の記事を見かけまして、「そういやこれどうやって出来てるのかな...」と思って記事探してみてもなかったので(英語でも少なかった)、幾つかのページ(参考文献)から引っ張り出してきた情報を元に記事を書きます

Rustにおけるファイル分割

Rustは、モジュールシステムも色々決まりがあって整っているので、単にファイルを分けて書こうと思った場合にも、整った綺麗な書き方をするには難しい言語に感じます。自分も、丁度Rustでプログラムを書いていた時に迷って良いリポジトリから良い書き方を見つけたのでその紹介をしようと思います

Jul 19 2018

042933964230の暗号について調べてみた

Apr 23 2018

謎のサイトと言われて少し前に話題になったやつを調べてみました

日本一早いサイト選手権

阿部寛のホームページが爆速なのでそれよりはやいページ無いかなぁと適当に探してみた結果です。

Sep 22 2018

[円柱の体積文字列] 英単語をgrepできるサイトを作りました

Sep 17 2018

円の体積文字列で別の例を探すために英単語をgrep出来るサイトを作りました

数学 I(二次関数) を勉強していく上で思ったこととかメモ

黄茶の二次関数が終わったので、その学習中に思ったことを適当にメモしておきます

Sep 16 2018

IEは最高のブラウザ / IE is the best browser

Sep 16 2018

世間では何故かIEが蔑まれているようですが、そんなブラウザにも素晴らしいところが沢山あるのです

線の長さを積分で考えてみる

Sep 11 2018

今回は、前回積分を使って面積(小さな面積の集合)を求めたので、今回は線(小さな線の集合)の長さを求めていければ良いなぁみたいな感じでやっていけたらと思っています。

積分で原始関数の差を求めると元の函数の面積が出る理由

今回は積分の面白い性質である、原始関数の差が元の函数の面積となるという性質についてそうなる理由を気持ちだけでも書いていきます。

Sep 04 2018

「国語に関する世論調査」を読んで思ったこととか

Aug 28 2018

国の日本語に関する調査を見て最近の日本語の変化について知りました。

新しいブログシステムを構築したのでやったことを記録

前の古臭いブログシステムのままだと嫌だったので、新しいブログシステムを構築しました。

Aug 22 2018

共用サーバーで非同期通信チャットアプリを作る

今日は、さくらサーバーのライトプランという制限が多いサーバーでなんとか非同期通信を実現すべく奮闘したので、その記録をしたいと思います。

Aug 21 2018

前にブログに追加した機能

前にブログに追加した機能が使えるかどうかのテスト用.

Aug 21 2018

[素因数分解] Trial Division / 試し割り法

試し割り法は素因数分解の一番愚直なアルゴリズムです。この解説をします。

Aug 19 2018