What is 'word2vec' and how to build it from scratch?
Part 2 - Skip-Gram Implementation

keywords: word2vec, co-occurrence matrix, continuous-bag-of-words (CBOW), skip-gram, negative-sampling
author: Shubham Shrivastava

Objective Function Gradients: naive softmax case

Here we will work with the Skip-Gram model's objective function and compute its gradients with respect to center and outside word vectors so that we can walk in the negative direction of gradients during our optimization process. As defined earlier, the probability of outside word given a center word, [i.e. P(O=0 | C=c) ] is given as:

We can then define a naive softmax based objective function to minimize during the optimization process which further simplifies to a negative log-likelihood function.

U in the above equation is a matrix, k-th column of which (uk) represents the word vector of outside word indexed by k. Note: This is referred to as W earlier.

Gradients with respect to the center word vector, vc

Gradients with respect to each of the outside vectors, uw

Gradients with respect to all outside word vectors, U

Implementing naive-softmax based loss function and gradient computation

Now that we have derived gradients of the loss function with respect to its parameters, we can implement it very easily in python.

Objective Function Gradients: negative-sampling case (simple assumption)

Now, let's consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. For this we will assume that K negative samples are drawn from the vocabulary which we will refer to as w1,w2, ..., wK, and the corresponding outside vectors as uw1,uw2, ..., uwK.

Gradients with respect to the center word vector, vc

Gradients with respect to the outside word vector, uo

Gradients with respect to s-th negative sample vector, uws

Objective Function Gradients: negative-sampling case

Dropping this simplistic assumption only effects our partial derivatives with respect to the negative samples. Let's recompute this without any assumption.

Gradients with respect to s-th negative sample vector, uws

Implementing negative-sampling based loss function and gradient computation

Let's see how to implement this loss function and gradient computation with respect to its parameters looks like.

Skip-Gram model implementation

Now that we have implemented both naive-softmax and negative-sampling loss and gradients computation functions, we can proceed to implement the skip-gram model such that it takes as inputs - center word, window size, outside words, corresponding word vectors, and returns the loss along with its gradients with respect to center, outside, and negative sampled word vectors. This implementation is given below.

That's it! So far we saw how to compute various loss functions and its gradients with respect to model parameters, but we need to implement various other functions such as an optimizer, dataset loader, training pipeline, and so on. Please refer to the GitHub repo to see complete implementation of the word2vec model with Stanford Sentiment Analysys dataset, and feel free to give it a spin. Happy Learning!

References

  1. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

  2. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).

  3. Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014).

  4. https://web.stanford.edu/class/cs224n/index.html