PRACTICAL AI Research Blog

Transformers have been shown to be a great neural network architecture and works well in various domains, particularly proving itself in the field of NLP and Vision. This is an encoder-decoder architecture, with encoder block generating keys, and values for the decoder block through multi-headed self-attention, and the decoder block generating queries through a masked multi-headed self-attention block. The key-value pair from encoder and query from decoder is then passed through another multi-headed self-attention (effectively cross-attention) to generate output probability vectors.

Original transformers introduced by Vaswani et al. int the paper "Attention Is All You Need" (, contains an encoder-decoder architecture. Several workstreams however explores an encoder-only or a decoder-only architecture which performs better on specific tasks while costs less memory and time for training. An example of encoder-only architecture is BERT where certain tokens are masked in the input and the network is tasked to predict those missing tokens. Similarly, GPT-n is an example of decoder-only architecture, where, given past n tokens, the network is trained to predict (n+1)th token. For a simple image classification example in consideration here, an encoder-only architecture will work pretty well, however, here I am building an encoder-decoder architecture, as well as adding support for encoder-only architecture for complete understanding of how to build and train such a system. I train both networks in this notebook and visualize attention layers.

tags: [machine learning, polynomial regression, non-linear regression, back-propagation]

This blog post provides a basic understanding of neural network components and then uses these concepts for building and training a neural network from scratch for real problem using NumPy.

Traditionally, Linear least-squares are used for polynomial function co-efficient estimation, however, one needs to make explicit assumption about the model and polynomial order. Neural Networks on the other hand makes no assumption about the function that generated the data, it rather simply tries to approximate the model from available noisy dataset.

In this blog, we consider a toy problem of Polynomial Function Regression, where we define an n-th order polynomial function, and then generate noisy data points from this function. The goal of the neural network is then to learn the coefficients of the polynomial function that best fit the data through a simple one layer neural network.

tags: [machine learning, back-propagation, neural networks, deep learning]

MuTorch is a deep learning framework built for educational purposes and building a deep understanding of how neural networks work. This is not intended to be efficient, but rather very simplistic. The goal is to build a framework that is easy to understand, modify, and learn from. It does not use any external libraries, and is built from scratch using only Python lists and operators.

The framework is built in a modular way, and can be extended to include new layers, activation functions, and optimizers. Examples of how to use the framework to build node-level, tensor-level, or a full fledged Sequential MLP are provided in the Demo Notebook. Loss functions and optimizer implementations are also provided to build an end-to-end understanding of neural network training process. The framework also provides a simple way to visualize the computational graph.

tags: [nlp, natural language processing, word2vec, skip-gram, cbow]

word2vec is a family of algorithms introduced about a decade ago by Mikolov et al. [1][2] at Google, and describes a way of learning word embeddings from large datasets in an unsupervised way. These have been tremendously popular recently and has paved a path for machines to understand contexts in natural language. These embeddings are essentially a lower-dimensional vector representation of each word in a given vocabulary within certain context and helps us capture its semantic and syntactic meaning.

This is a two-part blog post where we will first build an intuition of how to formulate such word embeddings, and then move on to explore learning these embeddings in a self-supervised way using machine learning (ML). We will build the complete word2vec pipeline without using any ML framework for a better understanding of all underlying mechanics. Towards this goal, we will also derive various equations to be used within Stochastic Gradient Descent optimizer implementation.