Deriving the Math of Artificial Neural Networks

4 min read

In this article I want to derive mathematically the back propagation algorithm used by Artificial Fully Connected Neural Networks to optimise a cost function commonly used in Deep Learning.

After giving intuition what kind of a mathematical model a Neural Network is, I want to introduce a notation for an algebraic representation of the ANN. Based upon the notation the feedforward algorithm, that is used to make predictions with the ANN, will be covered. Last but not least, I want to derive the chain of derivatives of the back propagation algorithm to adapt the weights and biases of the ANN to minimise the cost function.

But what even is an Artificial Neural Network and how is it represented?

Image Figure1 missing…

An ANN consists of an input layer, at least one hidden layer and an output layer of neurons. Each neuron has an activation value and is connected to all of the neurons in the following layer via weights. In the hidden layer and the output layer each neuron is connected to a local bias. The architecture of the ANN in Figure 1 suggests that a tuple of 2 real numbers, which can be geometrically interpreted as an point in 2D space, gets fed into the network to activate one of the two neurons in the output layer. In simple terms, this special ANN Architecture binary classifies concrete 2D points into one of two different categories. More broadly, an Artificial Neural Network is an universal function Approximator that minimises a predefined cost function during a training process by adapting its weights and biases to either classify or regress a vector of data-points. This is quite a technical explanation that requires an even clearer definition of the activation values of a neurone, its biases and the weights connecting the neurons in-between layers.

Algebraic notation of each building block of a Neural Network

Suppose the Neural Network consists of \(n\) layers, whereas \(i\). layer has \(\#(i)\) neurons, with \(i\), \(n\) \(\in N\).

\( a_j^i \) : Activation value of \(j\). neuron in \(i\). layer

\( b_j^i \) : Bias value of \(j\). neuron in \(i\). layer

\( w_{j,k}^{i-1,i} \) : Weight value of connection from \(j\). neuron of Layer \(i-1\) to \(k\). neuron of layer \(i\)

Column vector that contains all activation values of all neurons in layer \(i\):

\( \vec{a}^i := \begin{pmatrix} a_1^i \\ a_2^i \\ ... \\ a_{\#(i)}^i \end{pmatrix} \)

Column vector that contains all bias values of all neurons in layer \(i\):

\( \vec{b}^i := \begin{pmatrix} b_1^i \\ b_2^i \\ ... \\ b_{\#(i)}^i \end{pmatrix} \)

Row vector that contains all weight values from all neurons of layer \(i-1\) to \(j\). neuron of layer \(i\):

\( \vec{w}_{*,j}^{i-1,i} := \begin{pmatrix} w_{1,j}^{i-1,i} & w_{2,j}^{i-1,i} & ... & w_{\#(i-1),j}^{i-1,i} \end{pmatrix} \)

Matrix that contains all weight values from all neurons of layer \(i-1\) to all neurons of layer \(i\):

\( W^{i-1,i} := \begin{bmatrix} \vec{w}_{*,1}^{i-1,i} \\ \vec{w}_{*,2}^{i-1,i} \\ ... \\ \vec{w}_{*,\#(i)}^{i-1,i} \end{bmatrix} = \begin{bmatrix} w_{1,1}^{i-1,i} & w_{2,1}^{i-1,i} & ... & w_{\#(i-1),1}^{i-1,i} \\ w_{1,2}^{i-1,i} & w_{2,2}^{i-1,i} & ... & w_{\#(i-1),2}^{i-1,i} \\ ... & ... & & ... \\ w_{1,\#(i)}^{i-1,i} & w_{2,\#(i)}^{i-1,i} & ... & w_{\#(i-1),\#(i)}^{i-1,i} \end{bmatrix} \)

Solution vector given by the data set. All elements in the vector are zero, except for one element that is assigned a one, which represents the affiliation of a certain class:

\( \vec{y} := \begin{pmatrix} y_1 \\ y_2 \\ ... \\ y_{\#(n)} \end{pmatrix} \)

How does an ANN make a prediction given it’s architecture?

FeedForward Algorithm explanantion …
JS FeedForward Animation…

def feedforward(\(\vec{a}^1\)):
    for \(i\) in \(\{1,..,n-1\}\):
        \( \vec{a}^{i+1} \gets \sigma ( W^{i,i+1} \times \vec{a}^i + \vec{b}^{i+1} ) \)
    return \(\vec{a}^n\)

How does an ANN learn?

Backpropagation and Gradient Descent Algorithms …
Picture of a mountain (gradient descent)…
JS Backpropagation Animation…

Computing the Gradient of each weight and bias with respect to the costfunction using the method of the chain of partial derivatives

Partial Derivatives …



Marvin Fuchs Enthusiastic about Machine Learning - the intersection of Math, CS and Programming