2 minute read

Understanding how words relate to each other is crucial in natural language processing (NLP). In machine learning, these relationships also need to be quantified. word2vec aims to address both concerns by learning word embeddings using neural networks. It does this by learning vector representations for each word in a vocabulary, such that it is good at predicting other words appearing in its context.

Two model architectures are put forward in the original word2vec paper: skip-gram and continuous bag of words (CBOW). We will focus on skip-gram in this article.

Skip-gram

Skip-gram aims to learn word relationships by analyzing co-occurrence patterns, identifying which words frequently occur together within a given corpus of text.

Let \(w=\{w_1, w_2, \ldots, w_T\}\) be our corpus of text. For each word, we attempt to predict the surrounding words in a window of "radius" \(m\). For example, if our window size is 2, and our word of interest is \(w_{10}\), we try to predict \(w_8, w_9, w_{11},\) and \(w_{12}\) given \(w_{10}\). We refer to the word of interest (in this case, \(w_{10}\)) as the 'center word' and the surrounding words as 'context words'. Each word has two vector representations - a center word representation and a context word representation.

To illustrate, assume our text corpus is "I love learning NLP". Then,

$$w_1=\text{I}, w_2=\text{love}, w_3=\text{learning}, w_4=\text{NLP}.$$

As mentioned, each word has two vector representations. Let \(U\) correspond to context word representations and let \(V\) correspond to center word representations. These matrices could look something like this,

$$U=\begin{bmatrix}0.4 & 0.3 & 7 & 3\\0.1 & 5 & 0.4 & 6\end{bmatrix}$$ $$V=\begin{bmatrix}0.3 & 0.1 & 0.7 & 2\\5 & 2 & 0.9 & 0.4\end{bmatrix}$$

where each column corresponds to a single word in the vocabulary (the dimensionality of each word vector is arbitrary).

Training

Given a window size \(m\) and a text corpus \(w=\{w_1, w_2, \ldots, w_T\}\), the objective is to maximize the probability of any context word given the current center word. We do this for all words in the text corpus. Hence, the objective function is,

$$J'(\theta)=\Pi_{t=1}^T\Pi_{-m\leq j\leq m}p(w_{t+j} \mid w_t;\theta)$$ with the negative log likelihood being, $$J(\theta)=-\frac{1}{T}\Sigma_{t=1}^T\Sigma_{-m\leq j\leq m}\log{p(w_{t+j}\mid w_t)}.$$

The first sum (\(\Sigma_{t=1}^T\)) iterates over all words in the text corpus, whereas the second sum (\(\Sigma_{-m\leq j\leq m}\)) iterates over all the context words in the window of "radius" \(m\). \(p(w_{t+j}\mid w_t)\) is some measure of similarity between words. A common choice is the softmax,

$$p(w_{t+j}\mid w_t)=\frac{exp(u_o^Tv_c)}{\Sigma_{w=1}^vexp(u_w^Tv_c)},$$ where \(o\) is the index in the context vector matrix \(U\) for word \(w_{t+j}\) and \(c\) is the index in the center vector matrix \(V\) for word \(w_t\). Therefore, \(v_c\) and \(u_o\) are the "center" and "context" vectors for \(w_{t}\) and \(w_{t+j}\), respectively. Using this framework, we can train a neural network to minimize the negative log likelihood.