티스토리 뷰
The Skip-gram Model
As an example, let's consider the dataset
the quick brown fox jumped over the lazy dog
We first form a dataset of words and the contexts in which they appear. We could define 'context' in any way that makes sense, and in fact people have looked at syntactic contexts (i.e. the syntactic dependents of the current target word, see e.g. Levy et al.), words-to-the-left of the target, words-to-the-right of the target, etc. For now, let's stick to the vanilla definition and define 'context' as the window of words to the left and to the right of a target word. Using a window size of 1, we then have the dataset
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
of (context, target)
pairs. Recall that skip-gram inverts contexts and targets, and tries to predict each context word from its target word, so the task becomes to predict 'the' and 'brown' from 'quick', 'quick' and 'fox' from 'brown', etc. Therefore our dataset becomes
(quick, the), (quick, brown), (brown, quick), (brown, fox), ...
of (input, output)
pairs. The objective function is defined over the entire dataset, but we typically optimize this with stochastic gradient descent (SGD) using one example at a time (or a 'minibatch' of batch_size
examples, where typically 16 <= batch_size <= 512
). So let's look at one step of this process.