word2vec - Part2

What is 'word2vec' and how to build it from scratch?
Part 2 - Skip-Gram Implementation

keywords: word2vec, co-occurrence matrix, continuous-bag-of-words (CBOW), skip-gram, negative-sampling
author: Shubham Shrivastava

This is the second part of the post which focuses on the implementation of a Skip-Gram word2vec model. Here, we compute gradients of the objective function with respect to various parameters, and then we move on to doing the same for negative-sampling method. We then implement an optimizer and setup the model training pipeline through gradient-based optimization procedure.

Go back to Part 1: Theory and Implementation

Objective Function Gradients: naive softmax case

Gradients with respect to the center word vector, vc

Gradients with respect to each of the outside vectors, uw

Gradients with respect to all outside word vectors, U

Implementing naive-softmax based loss function and gradient computation

Objective Function Gradients: negative-sampling case (simple assumption)

Gradients with respect to the center word vector, vc

Gradients with respect to the outside word vector, uo

Gradients with respect to s-th negative sample vector, uws

Objective Function Gradients: negative-sampling case

Gradients with respect to s-th negative sample vector, uws

Implementing negative-sampling based loss function and gradient computation

Skip-Gram model implementation

References

Objective Function Gradients: naive softmax case

Here we will work with the Skip-Gram model's objective function and compute its gradients with respect to center and outside word vectors so that we can walk in the negative direction of gradients during our optimization process. As defined earlier, the probability of outside word given a center word, [i.e. P(O=0 | C=c) ] is given as:

We can then define a naive softmax based objective function to minimize during the optimization process which further simplifies to a negative log-likelihood function.

U in the above equation is a matrix, k-th column of which (uk) represents the word vector of outside word indexed by k. Note: This is referred to as W earlier.

Gradients with respect to the center word vector, vc

Gradients with respect to each of the outside vectors, uw

Gradients with respect to all outside word vectors, U

Implementing naive-softmax based loss function and gradient computation

Now that we have derived gradients of the loss function with respect to its parameters, we can implement it very easily in python.

Objective Function Gradients: negative-sampling case (simple assumption)

Now, let's consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. For this we will assume that K negative samples are drawn from the vocabulary which we will refer to as w1,w2, ..., wK, and the corresponding outside vectors as uw1,uw2, ..., uwK.

Gradients with respect to the center word vector, vc

Gradients with respect to the outside word vector, uo

Gradients with respect to s-th negative sample vector, uws

Objective Function Gradients: negative-sampling case

Dropping this simplistic assumption only effects our partial derivatives with respect to the negative samples. Let's recompute this without any assumption.

Gradients with respect to s-th negative sample vector, uws

Implementing negative-sampling based loss function and gradient computation

Let's see how to implement this loss function and gradient computation with respect to its parameters looks like.

Skip-Gram model implementation

Now that we have implemented both naive-softmax and negative-sampling loss and gradients computation functions, we can proceed to implement the skip-gram model such that it takes as inputs - center word, window size, outside words, corresponding word vectors, and returns the loss along with its gradients with respect to center, outside, and negative sampled word vectors. This implementation is given below.

That's it! So far we saw how to compute various loss functions and its gradients with respect to model parameters, but we need to implement various other functions such as an optimizer, dataset loader, training pipeline, and so on. Please refer to the GitHub repo to see complete implementation of the word2vec model with Stanford Sentiment Analysys dataset, and feel free to give it a spin. Happy Learning!

GitHub

References

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).
Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014).
https://web.stanford.edu/class/cs224n/index.html

Google Sites

Report abuse

What is 'word2vec' and how to build it from scratch?Part 2 - Skip-Gram Implementation

Table of contents

Objective Function Gradients: naive softmax case

Gradients with respect to the center word vector, vc

Gradients with respect to each of the outside vectors, uw

Gradients with respect to all outside word vectors, U

Implementing naive-softmax based loss function and gradient computation

Objective Function Gradients: negative-sampling case (simple assumption)

Gradients with respect to the center word vector, vc

Gradients with respect to the outside word vector, uo

Gradients with respect to s-th negative sample vector, uws

Objective Function Gradients: negative-sampling case

Gradients with respect to s-th negative sample vector, uws

Implementing negative-sampling based loss function and gradient computation

Skip-Gram model implementation

References

What is 'word2vec' and how to build it from scratch?
Part 2 - Skip-Gram Implementation