Neural Network#

Note

A neural network is a network or circuit of neurons, composed of artificial neurons or nodes.
We will mainly discuss backpropagation here.
We will discuss it in more detail in the course “deep learning”.

Neuron#

a neuron takes input \(x \in \mathbb{R}^{d}\), multiply \(x\) by weights \(w\) and add bias term \(b\), finally use a activation function \(g\).

that is:

\[f(x) = g(w^{T}x + b)\]

it is analogous to the functionality of biological neuron.

jupyter

some useful activation function:

\[\begin{split} \begin{equation} \begin{split} \text{sigmoid:}\quad &g(z) = \frac{1}{1 + e^{-z}} \\ \text{tanh:}\quad &g(z) = \frac{e^{z}-e^{-z}}{e^{z} + e^{-z}} \\ \text{relu:}\quad &g(z) = max(z,0) \\ \text{leaky relu:}\quad &g(z) = max(z, \epsilon{z})\ ,\ \epsilon\text{ is a small positive number}\\ \text{identity:}\quad &g(z) = z \end{split} \end{equation} \end{split}\]

linear regression’s forward process is a neuron with identity activation function.

logistic regression’s forward process is a neuron with sigmoid activation function.

Structure#

building neural network is analogous to lego bricks: you take individual bricks and stack them together to build complex structures.

jupyter

we use bracket to denote layer, we take the above as example

\([0]\) denote input layer, \([1]\) denote hidden layer, \([2]\) denote output layer

\(a^{[l]}\) denote the output of layer \(l\), set \(a^{[0]} := x\)

\(z^{[l]}\) denote the affine result of layer \(l\)

we have:

\[z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}\]
\[a^{[l]} = g^{[l]}(z^{[l]})\]

where \(W^{[l]} \in \mathbb{R}^{d[l] \times d[l-1]}\), \(b^{[l]} \in \mathbb{R}^{d[l]}\).

Prerequesities for Back-Propagation#

suppose in forward-propagation \(x \to y \to l\), where \(x \in \mathbb{R}^{n}\), \(y \in \mathbb{R} ^{m}\), loss \(l \in \mathbb{R}\).

then:

\[\begin{split} \frac{\partial l}{\partial y} = \begin{bmatrix} \frac{\partial l}{\partial y_{1}} \\ ...\\ \frac{\partial l}{\partial y_{m}} \end{bmatrix} \quad \frac{\partial l}{\partial x} = \begin{bmatrix} \frac{\partial l}{\partial x_{1}} \\ ...\\ \frac{\partial l}{\partial x_{n}} \end{bmatrix} \end{split}\]

by total differential equation:

\[ \frac{\partial l}{\partial x_{k}} = \sum_{j=1}^{m}\frac{\partial l}{\partial y_{j}}\frac{\partial y_{j}}{\partial x_{k}} \]

then we can connect \(\frac{\partial l}{\partial x}\) and \(\frac{\partial l}{\partial y}\) by:

\[\begin{split} \frac{\partial l}{\partial x} = \begin{bmatrix} \frac{\partial l}{\partial x_{1}} \\ ...\\ \frac{\partial l}{\partial x_{n}} \end{bmatrix} = \begin{bmatrix} \frac{\partial y_{1}}{\partial x_{1}} & ... & \frac{\partial y_{m}}{\partial x_{1}}\\ \vdots & \ddots & \vdots \\ \frac{\partial y_{1}}{\partial x_{n}}& .... & \frac{\partial y_{m}}{\partial x_{n}} \end{bmatrix} \begin{bmatrix} \frac{\partial l}{\partial y_{1}} \\ ...\\ \frac{\partial l}{\partial y_{m}} \end{bmatrix} = (\frac{\partial y}{\partial x})^{T}\frac{\partial l}{\partial y} \end{split}\]

here \(\frac{\partial y}{\partial x}\) is the jacobian matrix.

unlike other activation functions, calculate softmax depend on other neurons, so jacobian of softmax.

\[ \frac{\partial a_{i}}{\partial z_{j}} = \frac{\partial}{\partial z_{j}}\left(\frac{exp(z_{i})}{\sum_{s=1}^{k}exp(z_{s})}\right) \]

it is easy to check the jacobian of matrix-multiplication:

\[\frac{\partial Mx}{\partial x} = M\]

Back-Propagation#

gradient descent update rule:

\[W^{[l]} = W^{[l]} - \alpha\frac{\partial{L}}{\partial{W^{[l]}}}\]
\[b^{[l]} = b^{[l]} - \alpha\frac{\partial{L}}{\partial{b^{[l]}}}\]

to proceed, we must compute the gradient with respect to the parameters.

we can define a three-step recipe for computing the gradients as follows:

1.for output layer, we have:

\[ \frac{\partial L(\hat{y}, y)}{\partial z^{[N]}} = (\frac{\partial \hat{y}}{\partial z^{[N]}})^{T}\frac{\partial L(\hat{y}, y)}{\partial \hat{y}} \]

if \(g^{[N]}\) is softmax.

\[ \frac{\partial L(\hat{y}, y)}{\partial z^{[N]}} = \frac{\partial L(\hat{y}, y)}{\partial \hat{y}} \odot {g^{[N]}}'(z^{[N]}) \]

if not softmax.

the above computations are all straight forward.

2.for \(l=N-1,...,1\), we have:

\[z^{[l + 1]} = W^{[l + 1]}a^{[l]} + b^{[l + 1]}\]

so by our prerequesities:

\[ \frac{\partial L}{\partial a^{[l]}} = (\frac{\partial z^{[l+1]}}{\partial a^{[l]}})^{T}\frac{\partial L}{\partial z^{[l+1]}} = (W^{[l+1]})^{T}\frac{\partial L}{\partial z^{[l+1]}} \]

we also have:

\[a^{[l]} = g^{[l]}z^{[l]}\]

we do not use softmax activation in hidden layers, so the dependent is direct:

\[\frac{\partial L}{\partial z^{[l]}} = \frac{\partial L}{\partial a^{[l]}} \odot {g^{[l]}}'(z^{[l]})\]

combine two equations:

\[\frac{\partial L}{\partial z^{[l]}} = (W^{[l+1]})^{T}\frac{\partial L}{\partial z^{[l+1]}} \odot {g^{[l]}}'(z^{[l]})\]

3.final step, because:

\[z^{[l]} = W^{[l]}a^{[l - 1]} + b^{[l]}\]

so:

\[\frac{\partial L}{\partial W^{[l]}} = \frac{\partial L}{\partial z^{[l]}}(a^{[l - 1]})^{T}\]
\[\frac{\partial L}{\partial b^{[l]}}=\frac{\partial L}{\partial z^{[l]}}\]

Examples#

"""mlp classification"""
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=100, random_state=1)
# stratify=y makes sure train & test set have the same positive proportion
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

clf = MLPClassifier(hidden_layer_sizes=(100, 50),
                    activation="relu",
                    max_iter=300)
clf.fit(X_train, y_train)
# score return the mean accuracy on the given test data and labels.
clf.predict_proba(X_test[:1]), clf.score(X_test, y_test)
(array([[0.02858299, 0.97141701]]), 0.96)
"""mlp regression"""
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=200, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

regr = MLPRegressor(hidden_layer_sizes=(128, 64),
                    solver='adam', 
                    max_iter=1000)
regr.fit(X_train, y_train)

# return 1 - u=(y_true - y_pred)**2 / v=(y_true - y_true.mean())**2
regr.predict(X_test[:2]), regr.score(X_test, y_test)
(array([15.80479452, 30.59838355]), 0.5456425414797952)