# A shallow neural network for simple nonlinear classification

Classification problems are a broad class of machine learning applications devoted to assigning input data to a predefined category based on its features. If the boundary between the categories has a linear relationship to the input data, a simple logistic regression algorithm may do a good job. For more complex groupings, such as in classifying the points in the diagram below, a neural network can often give good results.

In a shallow neural network, the values of the feature vector of the data to be classified (the input layer) are passed to a layer of nodes (also known as neurons or units) (the hidden layer) each of which generates a response according to some activation function, $g$, acting on the weighted sum of those values, $z$. The responses of each unit in the hidden layer is then passed to a final, output layer (which may consist of a single unit), whose activation produces the classification prediction output.

Identifying the layers with superscript labels $[0]$ (input layer), $[1]$ (hidden layer) and $[2]$ (output layer) so that $a^{[0]}_j \equiv x_j$ are the input feature values, and $a^{[2]} \equiv \hat{y}$ is the output classification we can write: $$z^{[\ell]}_j = \sum_{k=1}^{n^{[\ell]}} w_{jk}^{[\ell]} a_k^{[\ell-1]} + b_j^{[\ell]}; \quad a^{[\ell]} = g^{[\ell]}(z^{[\ell]}_j).$$ Here, $n^{[\ell]}$ is the number of units in the $\ell$th layer; we will consider a binary classification case for which $n^{[2]} = 1$.

To train the neural network, it is presented with $m$ training examples, $\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \ldots, \boldsymbol{x}^{(m)}$ corresponding to known classifications $y^{(1)}, y^{(2)}, \ldots, y^{(m)}$ and the difference between $\hat{y}$ and $y$ (expressed in terms of some cost function) is minimized with respect to the parameters $w_{jk}^{[\ell]}$ and $b_j^{[\ell]}$ ($\ell=1,2$) as with regular logistic regression.

The $m$ column input feature vectors can be packed into a $m \times n^{[0]}$ matrix, $\boldsymbol{X}\equiv \boldsymbol{A}^{[0]}$, their classifications into a $1 \times m$ row vector, $\boldsymbol{Y}$, the unit weights into $n^{[\ell]} \times n^{[\ell-1]}$ matrices, $\boldsymbol{W}^{[\ell]}$ (for $\ell=1,2$) and the bias constants expressed as $n^{[\ell]}\times 1$ column vectors, $\boldsymbol{b}^{[\ell]}$. This allows the above equations to be implemented in matrix algebra as: \begin{align*} \boldsymbol{Z}^{[1]} &= \boldsymbol{W}^{[1]}\boldsymbol{A}^{[0]} + \boldsymbol{b}^{[1]}, \quad \boldsymbol{A}^{[1]} = g^{[1]}(\boldsymbol{Z}^{[1]})\\ \boldsymbol{Z}^{[2]} &= \boldsymbol{W}^{[2]}\boldsymbol{A}^{[1]} + \boldsymbol{b}^{[2]}, \quad \boldsymbol{\hat{y}} = \boldsymbol{A}^{[2]} = g^{[2]}(\boldsymbol{Z}^{[2]}) \end{align*} There are many nonlinear response functions to choose for $g^{[\ell]}$, including the logistic function, $\sigma(z) = (1+\mathrm{e}^{-z})^{-1}$, the closely-related hyperbolic tangent function, $\mathrm{tanh}$, and the rectified linear unit (ReLU) function, $x^+ = \mathrm{max}(0,x)$.

The code below implements a shallow neural network to classify the points in the file pts.txt. Progress in training the algorithm is animated by plotting the decision boundary for the set of parameters $\boldsymbol{W}^{[\ell]}$ and $\boldsymbol{b}^{[\ell]}$ at each 100th iteration of the cost function minimization.

import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.linear_model

# Training rate.
alpha = 2

# Load the points and ensure shapes are X: (2, m) and Y: (1, m).
X, Y = pts[:,:2].T, pts[:,2][None, :]

# Number of feature vector components (should be 2!).
n0 = X.shape[0]
# Number of training examples.
m = X.shape[1]

# Number of hidden layer units.
n1 = 16
# Number of output units.
n2 = 1

# Activation function for the hidden layer: hyperbolic tan.
g1 = np.tanh
# Activation function for the output layer: logistic function.
def g2(Z):
return 1 / (1 + np.exp(-Z))

def predictions(X, W1, b1, W2, b2):
"""
Return the predictions, A2 (and various intermediate quantities for
caching) for the parameters W1, b1, W2, b2 on the input matrix, X.
"""

Z1 = W1 @ X + b1
A1 = g1(Z1)
Z2 = W2 @ A1 + b2
A2 = g2(Z2)
return Z1, A1, Z2, A2

def cost(Yhat, Y):
"""Evaluate and return the cost function J(Yhat, Y)."""
return -np.sum(Y * np.log(Yhat) + (1 - Y) * np.log(1 - Yhat)) / m

def backprop(A1, Z1, A2, Z2, W1, b1, W2, b2):
"""Backpropagation to update the fit parameters."""

# The derivatives of the cost function with respect to the various
# quantities, dJ/dQ are assigned to variables named dQ.

# For g2 = sigmoid logistic function.
dZ2 = A2 - Y
dW2 = dZ2 @ A1.T / m
db2 = np.sum(dZ2, axis=1, keepdims=True) / m
# For g1 = tanh activation function.
dZ1 = W2.T @ dZ2 * (1 - A1**2)
dW1 = dZ1 @ X.T / m
db1 = np.sum(dZ1, axis=1, keepdims=True) / m

W1 -= alpha * dW1
b1 -= alpha * db1
W2 -= alpha * dW2
b2 -= alpha * db2
return W1, b1, W2, b2

def initialize_params():
"""Randomly initialize the W coefficients, initialize the biases to zero."""
W1 = np.random.randn(n1, n0) * 0.01
b1 = np.zeros((n1, 1))
W2 = np.random.randn(n2, n1) * 0.01
b2 = np.zeros((n2, 1))
return W1, b1, W2, b2

def plot_decision_boundary(X, y, params):
"""Plot the decision boundary for prediction trained on X, y."""

# Set min and max values and give it some padding.
x_min, x_max = X[0, :].min() - 0.1, X[0, :].max() + 0.1
y_min, y_max = X[1, :].min() - 0.1, X[1, :].max() + 0.1
h = 0.01
# Generate a meshgrid of points with orthogonal spacing.
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Prediction of the classified values across the whole grid.
Z = np.round(predictions(np.c_[xx.ravel(), yy.ravel()].T, *params)[-1])
Z = Z.reshape(xx.shape)
# Plot the decision boundary as a contour plot and training examples.
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[0, :], X[1, :], c=y, cmap=plt.cm.Spectral, s=8)
plt.ylabel('x2')
plt.xlabel('x1')

def train(X, Y, max_it=1000):
"""Train the neural network on the features X classified as Y."""

W1, b1, W2, b2 = initialize_params()

for it in range(max_it):
Z1, A1, Z2, A2 = predictions(X, W1, b1, W2, b2)
J = cost(A2, Y)
W1, b1, W2, b2 = backprop(A1, Z1, A2, Z2, W1, b1, W2, b2)
if not it % 100:
# Provide an update on the progress we have made so far.
print('{}: J = {}'.format(it, J))
# Clear the current figure and redraw with the latest prediction.
plt.clf()
plot_decision_boundary(X, Y, (W1, b1, W2, b2))
plt.xlim(0,1)
plt.ylim(0,1)
plt.savefig('frames/{:05d}.png'.format(it))

return W1, b1, W2, b2

W1, b1, W2, b2 = train(X, Y, 7000)

plot_decision_boundary(X, Y, (W1, b1, W2, b2))
plt.xlim(0,1)
plt.ylim(0,1)
plt.show()


The frames of the animation are saved as PNG image files to the directory frames/ and can be turned into an animated GIF with, e.g. ImageMagick's convert utility.

Current rating: 3.8

#### J J Suriya Prakash 1 year, 5 months ago

Hello there what is the pt.txt that you have mentioned in the first line. Can you share the txt document

Current rating: 5

#### christian 1 year, 5 months ago

Hello. There is a link to pts.txt in the text, just above the animation.