Machine Learning for Beginners: Understanding the basics through a simple neural network built and trained using NumPy
keywords: machine learning, neural networks, back-propagation, polynomial regression
author: Shubham Shrivastava
Machine Learning can simply be defined as a process for machines to learn from experiences (data) given a specific task and objective. These tasks could be as complex as understanding 3D geometry of scene around you by looking at just one image captured from the camera, or as simple as predicting which quadrant (in 2D xy euclidean space) does a data point fall in.
The taxonomy of Machine Learning defines various types including supervised, unsupervised/self-supervised, semi-supervised, reinforcement learning, etc. Here, we will focus on supervised learning, and implement a simple non-linear regression pipeline in numpy from scratch. Non-linear regression problem is very simple to linear regression, except for the use of a non-linear activation function such as ReLU, Tanh, GELU, Sigmoid, etc., and allows us to model any non-linear function, f(x).
The simplest form of a neural network contains an input layer, an output layer, and a few hidden layers. Each hidden layer is composed of many basic computational blocks called neurons, which basically weighs its input and applies a bias, followed by a non-linear activation function. These weights and biases for each neuron is learnt during the model training process. Many such blocks combined together allows the neural network to learn complex non-linear models. Two such networks are shown below.
So now that we have defined a neural network, how do we learn its parameters (weights and biases), so that the model approximates an arbitrary function f(x) ?
Gradient-descent through BACKPROPAGATION
A neural network, f(x), parameterized by weights Wi and biases bi, can be trained to approximate an arbitrary function, so that the neural network predictions (ŷ) generated via a feed-forward operation with input data samples (X) gets very close to the true labels (y). This training involves learning the parameters of all the neurons present within the network. To learn these parameters, we start from randomly initialized weights and biases, and make predictions for input data samples. These predictions are then used to compute error with respect to the expected true output for corresponding data samples. These error functions depend on the objective we are trying to optimize, for example, a common choice for regression problems are L1, SmoothL1, L2 Loss, etc., whereas Cross-Entropy, Maximum-Log-Likelihood, etc. are a common choice for classification problems.
Gradients of this error function with respect to neural network parameters provides a proxy for understanding the direction where each parameter need to be nudged in order to get predictions closer to the ground-truth. Specifically, moving each parameter in the direction of negative gradients moves the predictions closer to expected output. One way to think about these gradients is that they actually tell you how much modifying each parameter changes the output predictions. For example, if the gradient of loss function with respect to weights, Wi (∂L/∂Wi) comes out to be -1.5, that means that any changes made to the parameter Wi while keeping all other parameters constant, reduces the final model output by -1.5x. Although, these gradients provides us with a direction and magnitude for each parameter to move in during the optimization process, in reality, it makes the optimization process unstable. To stabilize the optimization process, we usually take a small step in the loss minization direction which is dictated by a parameter called learning rate. So, during the training process, (1) we take the feed-forward output of neural network for input data samples, (2) compute error with respect to the true expected output, e = L(ŷ, y), where, ŷ = f(x), and y = ground-truth, (3) compute gradients of the loss function with respect to neural network parameters, (4) update parameters in the negative direction of gradients, (5) repeat. An example of the loss function landscape is shown in the figure below, where the objective of the model is to move towards the global minima.
One thing to consider during the optimization is that, using all available samples to first compute gradients and then updating parameters is a computationally expensive process. This could rather be replaced by a stochastic approximation which computes the gradients over a mini-batch of data samples and then updates the model parameters. This is known as Stochastic Gradient Descent (SGD), and provides a really good trade-off between speed and convergence rate. While SGD works great in practice, there have been many algorithm proposed to achieve faster and better convergence such as Adam, AdamW, RMSprop, etc.
With the basics out of the way, let's now focus on using these concepts for building and training a neural network from scratch for real problem.
Let's now consider a toy problem of Polynomial Function Regression, where we define an n-th order polynomial function, and then generate noisy data points from this function. The goal of the neural network is then to learn the coefficients of the polynomial function that best fit the data through a simple one layer neural network as demonstrated in the figure below.
Traditionally, Linear least-squares are used for polynomial function co-efficient estimation, however, one needs to make explicit assumption about the model and polynomial order. Neural Networks on the other hand makes no assumption about the function that generated the data, it rather simply tries to approximate the model from available noisy dataset.
To set up this toy problem at hand, let's first generate a few data samples from a 6-th order polynomial function and add noise with a standard deviation of 1.0.
These data samples generated then serves as our training dataset. Let's now define a neural network with one input node (X), one output node (ŷ), and one hidden layer with n neurons. The objective of this function is then to predict y given X, the neural network thus approximates this polynomial function. We do need to compute gradients of the loss function (MSE/L2 used in this example) with this neural network parameters in order to backpropagate and update the parameters during optimization process. This computation is shown below, although in reality they are computed automatically using automatic differentiation engines shipped with libraries such as PyTorch and TensorFlow.
Now that we have computed gradients of the loss function with neural network parameters (W1, b1, W2, b2), we can go ahead and implement the neural network forward and backward functions. The former performs the feed-forward operation, whereas the latter performs back-propagation to compute gradients with respect to all parameters, these gradients are then used to update model parameters.
That's it!! This implements a Polynomial Regression neural network which can be used to approximate an arbitrary function through training. It uses Tanh(.) as the activation function which allows it to model non-linearity. Chain rule is used for gradient computation of (W1, b1) parameters above and is generally how gradients are computed in a deep neural network.
Go ahead and try this yourself, feel free to clone the GitHub repo and give it a spin. Happy Learning!