What is 'word2vec' and how to build it from scratch?
Part 1 - Intuition and Theory

keywords: word2vec, co-occurrence matrix, continuous-bag-of-words (CBOW), skip-gram, negative-sampling
author: Shubham Shrivastava

What is word2vec ?

word2vec is a family of algorithms introduced about a decade ago by Mikolov et al. [1][2] at Google, and describes a way of learning word embeddings from large datasets in an unsupervised way. These have been tremendously popular recently and has paved a path for machines to understand contexts in natural language. These embeddings are essentially a lower-dimensional vector representation of each word in a given vocabulary within certain context and helps us capture its semantic and syntactic meaning. Taking it one step further, these vector representations can also formulate relationships with other words, e.g. woman + king - man = queen. Isn't this amazing?

Word Vectors (or Word Embeddings) have founds its way as the basis of several applications such as question answering, text generation, translation, sentiment analysis, recommendation engines etc., and is considered to be one of the major building blocks for any language modeling task.

Here, we will first build an intuition of how to formulate such word embeddings, and then move on to explore learning these embeddings in a self-supervised way using machine learning (ML). We will build the complete word2vec pipeline without using any ML framework for a better understanding of all underlying mechanics. Towards this goal, we will also derive various equations to be used within Stochastic Gradient Descent optimizer implementation.

Table of Contents

How to build such a model?

John Rupert Firth famously said, "You shall know a word by the company it keeps", which derives the idea that similar words occur together in a corpus within the same context. So, can we use this idea to develop a simple model for word embeddings?

Let's start small: co-occurrence matrices

The most naive way to build a word vector would be to simply build a co-occurrence matrix of words within a context window in a corpus. To build this matrix we simply take all the words occurring in a corpus to build our vocabulary and then count how many times each word in that vocabulary co-occurs with every other word within a fix window size. What this accomplishes is, if a word (wi) co-occurs with another word (wk) within a context window of n words most of the time, that means they are highly correlated and thus receives a high score in our co-occurrence matrix. More concretely, in a co-occurrence matrix M, the element Mik is the number of times word wk appears inside word wi's window across all corpora.

Let's take a simple example of two sentences to build our co-occurrence matrix with a context window of size 1:

  1. "all that glitters is not gold"

  2. "all is well that ends well"

In NLP, <START> and <END> are special tokens used to often indicate the start and end of a sentence, document, or corpus. Consequently, we also consider them while building up matrix M. Our resulting co-occurrence matrix is shown below:

The co-occurrence matrix is symmetric in nature, and each row of the matrix gives us an embedding for the corresponding word. This however, does not scale too well; i.e. for a vocabulary size of N, each word vector size is also N. In english vocabulary, there are about 1 Million words, this would mean that we would need a vector of the same size to represent each word, this is not at all efficient. One could also argue that we could represent each word in vocabulary by encoding them as a one-hot vector, but simply put, that does not provide us with a means of relating various words and finding out similarities or dissimilarities between them since each word vector will be orthogonal to every other word vectors.

Can we think of a way to reduce dimensionality of word vectors? SVD (Singular Value Decomposition) to the rescue.

SVD is often used to reduce high dimensional, highly variable set of data points into a lower-dimensional space that exposes the substructure of the original data more clearly and orders it from most variation to the least. This is visualized in the figure below, and in this figure, our co-occurrence matrix is A with n rows corresponding to n words. We can obtain a full matrix decomposition, with the singular values ordered in the diagonal S matrix, and our new, shorter word vectors of length k can then be given as Uk.

If you want to learn more about SVD, [here] is a good reference.

Turns out, with this simple implementation, we can get pretty far and obtain word vectors such that similar words formulate a cluster in lower-dimensional space. Visualization for a few words in a 2-dimensional space is shown below, although in reality these vector tends to be somewhere between 128 to 1024. Reuters (business and financial news) corpus, which consists of 10,788 news documents totaling 1.3 million words, was used to build these word embeddings. For more details, please refer to [this].

You can see the implementation of these: [Here]

These word vectors allows us to compute similarity and dissimilarity between words. To do this, one could simply compute the angles between two vectors, i.e. theta = arccos((p.q) / (||p||.||q||)). Another approach would be to specify similarity in terms of cos(theta) instead of computing the actual angle between vectors and is formally known as cosine-similarity; this equation is shown below.

While co-occurrence matrix captures semantic and syntactic similarity across words by benefitting from efficient use of statistics, it does not capture complex patterns between words well. Also, since these matrices are built based on count of co-occurrences, most frequently occurring word pairs get disproportionately large importance. This is where learning based approaches come into picture and helps alleviate thise shortcomings.

Learning-based approaches

The statistics based approach described earlier requires one to compute and store global information about large set of corpora containing billions of words. We could rather create a model that gets updated iteratively through back-propagation and eventually be able to encode the probability of a word within certain context. One simple idea is to design a model such that its parameters are the word vectors themselves, which gets updated through optimization so that they yield high likelihoods for words truly present within a given context and low likelihoods for words not present within the context. There are two probabilistic methods presented by Milolov et al. [1][2] to this end, namely, Continuous Bag-of-Words (CBOW), and Skip-Gram. CBOW aims to predict a center word from the surrounding context in terms of word vectors, whereas, Skip-gram does the opposite, and predicts the probability of context words given a center word.

1. Continuous Bag-of-Words (CBOW)

In CBOW, we maintain a context window of n words surrounding a center word (a maximum of 2n+1 words), and goal of the model is to predict a probability distribution over all words in the vocabulary such that the likelihood of center word is maximized.

Let's look at a concrete example to understand this more clearly.

Given a sentence: "I love transformers when used in computer vision applications", sequences of center and context words within a window of size 2 can be given as:

I love transformers
I love transformers when
I love transformers when used
love transformers when used in
transformers
when used in computer
when used
in computer vision
used in
computer vision applications
in computer
vision applications
computer vision
applications

If the complete dataset was made of only this one sentence, then the vocabulary will contain a total of 9 words, and each can be represented with a one-hot vector.

We maintain two matrices, W and W', each row of matrix W represents a N-dimensional word vector for the context (outside) words that can be looked up using corresponding one-hot vector; the second matrix W' contains word vectors for center words organized as columns. Dimensionality of W is VxN, where V is the total number of words (9 in this example) and N is the length of word-vectors. As an example, if the outside word is "transformers" represented by x3 = [0,0,1,0,0,0,0,0,0], then, dot-product of the two will give us a N-dimensional vector representing the word embedding for "transformers". Consider the CBOW architecture shown below, if the number of input words represented by xik is just one, then the dot product simply copies/projects the word vector over to hidden layer hi. For multiple words, we either average the word vectors to get another N-dimensional vector projected as hidden layer.

Continuous Bag-of-Words [source: https://arxiv.org/abs/1411.2738]

If we take the center word to be "transformers", then the corresponding outside words will be ["I", "love", "when", "used"]. We look up corresponding rows from matrix W, and compute our hidden layer output as given below.

The dot product of hidden layer output with the second matrix defined for center words, W', results in a V-dimensional vector representing a probability distribution over all the words in the vocabulary for center word given context words. Through dot-product we maximize the likelihood of similar words and minimize it for dissimilar words.

So how do we train this model?

We described above how to compute probability distribution for center word over the whole vocabulary. Given a one-hot vector for the ground-truth center word, we can setup our minimization objective as a cross-entropy function as shown below which further simplifies to a negative log-likelihood.

2. Skip-Gram

Skip-Gram model is opposite to CBOW, where it predicts the probability of context words given center word. This modifies our objective function slightly, where, instead of computing negative log-likelihood for a single center word, we would now break out the probabilities by invoking a Naive Bayes assumption of conditional independence. Our objective function thus can be given as shown below.

Skip-Gram [source: https://arxiv.org/abs/1411.2738]

Negative Sampling

If we closely examine our objective function, we realize that to evaluate the negative log-likelihood, we need to first perform a softmax operation over our model output which requires summation over all words in the vocabulary. This is a huge computational overhead and scales linearly in O(|V|) which happens to be in millions. But what if we could approximate these with Monte-Carlo methods?

One simple idea is to sample k negative examples for every iteration instead of using all negative examples during for minimizing our objective function. This is still based on Skip-Gram model, however, our objective function will have to change to accomodate negative sampling. Instead of evaluating the probability of context words given center word, we can simply take a pair of context-center words and classify whether or not the pair came from our training set. All sample pairs taken from the training set should now receive high probability whereas all pairs sourced by negative sampling should receive very low probability.

We denote positive samples as P( D = 1 | w, c ) which signifies the probability that the center word (w) and context word (c) pair came from the corpus. Similarly, P( D = 0 | w, c ) denotes that the pair (w, c) did not come from the corpus. With θ being the model parameters, we can then define our objective function to jointly maximize these two likelihoods. i.e.

Let's try to understand what this means. In language models, we often want to model the probabilities of a certain word or a number of words occurring in a sequence. These, however do not occur independently, and can be modeled using chain rule of probability as shown below:

We can further make Markovian assumption within a context of n words (n-gram) and rewrite the above equation as:

This relationship can then be given for unigram, bigram, trigram, etc. as shown below:

Now, let's take an example corpus to understand the output of a unigram model better: "one cat two cat three cat four cat five cat"

In the above corpus, count(cat)=5, count(one)=1, number of words=10
A unigram model will yield the following probabilities: P(cat)=5/10=0.5, P(one)=1/10=0.1

Raising the frequencies of words by a power of 3/4, here is how we compute these probabilities:

So what do raising the model by a power of 3/4 achieve? well let's see what our new probabilities look like.

As we see here, the model probabilities are more of a smoothened version of the real probabilities, where words with low frequency gets slightly higher probability of being sampled and the words with high frequency gets slightly lower probability. While many approaches could be taken to achieve this, raising a unigram model to the power of 3/4 works well in practice.

With the given objective function and negative sampling technique described above, we can train both CBOW and Skip-Gram models using gradient-based optimization methods. Figure below visualizes the word vectors for a few words on a 2-dimensional plane through rank-2 approximation computed using SVD. Next part of the post is focused towards how to compute various gradients and implement this word2vec model in python.

References

  1. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

  2. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).

  3. Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014).

  4. https://web.stanford.edu/class/cs224n/index.html