2 minute read

Understanding how words relate to each other is crucial in natural language processing (NLP). In machine learning, these relationships also need to be quantified. word2vec aims to address both concerns by learning word embeddings using neural networks. It does this by learning vector representations for each word in a vocabulary, such that it is good at predicting other words appearing in its context.

Two model architectures are put forward in the original word2vec paper: skip-gram and continuous bag of words (CBOW). We will focus on skip-gram in this article.

Skip-gram

Skip-gram aims to learn word relationships by analyzing co-occurrence patterns, identifying which words frequently occur together within a given corpus of text.

Let w={w1,w2,,wT} be our corpus of text. For each word, we attempt to predict the surrounding words in a window of "radius" m. For example, if our window size is 2, and our word of interest is w10, we try to predict w8,w9,w11, and w12 given w10. We refer to the word of interest (in this case, w10) as the 'center word' and the surrounding words as 'context words'. Each word has two vector representations - a center word representation and a context word representation.

To illustrate, assume our text corpus is "I love learning NLP". Then,

w1=I,w2=love,w3=learning,w4=NLP.

As mentioned, each word has two vector representations. Let U correspond to context word representations and let V correspond to center word representations. These matrices could look something like this,

U=[0.40.3730.150.46] V=[0.30.10.72520.90.4]

where each column corresponds to a single word in the vocabulary (the dimensionality of each word vector is arbitrary).

Training

Given a window size m and a text corpus w={w1,w2,,wT}, the objective is to maximize the probability of any context word given the current center word. We do this for all words in the text corpus. Hence, the objective function is,

J(θ)=Πt=1TΠmjmp(wt+jwt;θ) with the negative log likelihood being, J(θ)=1TΣt=1TΣmjmlogp(wt+jwt).

The first sum (Σt=1T) iterates over all words in the text corpus, whereas the second sum (Σmjm) iterates over all the context words in the window of "radius" m. p(wt+jwt) is some measure of similarity between words. A common choice is the softmax,

p(wt+jwt)=exp(uoTvc)Σw=1vexp(uwTvc), where o is the index in the context vector matrix U for word wt+j and c is the index in the center vector matrix V for word wt. Therefore, vc and uo are the "center" and "context" vectors for wt and wt+j, respectively. Using this framework, we can train a neural network to minimize the negative log likelihood.