Search code examples
pythonmachine-learningnlpkerastext-processing

I cannot understand the skipgrams() function in keras


I am trying to understand the skipgrams() function in keras by using the following code

from keras.preprocessing.text import *
from keras.preprocessing.sequence import skipgrams

text = "I love money" #My test sentence
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
word2id = tokenizer.word_index
wids = [word2id[w] for w in text_to_word_sequence(text)]
pairs, labels = skipgrams(wids, len(word2id),window_size=1)

for i in range(len(pairs)): #Visualizing the result
    print("({:s} , {:s} ) -> {:d}".format(
          id2word[pairs[i][0]], 
          id2word[pairs[i][1]], 
          labels[i]))

For the sentence "I love money", I would expect the following (context, word) pairs with the window size=1 as defined in keras:

([i, money], love)
([love], i)
([love], money)

From what I understand in Keras' documentation, it will output the label of 1 if (word, word in the same window) , and the label of 0 if (word, random word from the vocabulary).

Since I am using the windows size of 1, I would expect the label of 1 for the following pairs:

(love, i)
(love, money)
(i, love)
(money, love)

And the label of 0 for the following pairs

(i, money)
(money, i)

Yet, the code give me the result like this

(love , i ) -> 1
(love , money ) -> 1
(i , love ) -> 1
(money , love ) -> 1    
(i , i ) -> 0
(love , love ) -> 0
(love , i ) -> 0
(money , love ) -> 0

How can the pair (love , i ) and (money , love ) be labelled as both 0 and 1? and also where is the (i, money) and (money, i) result?

Am I understanding things wrongly that the labels of 0 are all out of my expectation? But it seems I understand the label of 1 quite well.


Solution

  • That's because your vocabulary is very small: it's the same ("love", "i", "money") words. That's why "random word from the vocabulary" is always from the same sentence and moreover from the same context.

    As an experiment, do this:

    text = "I love money" #My test sentence
    text2 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, " \
            "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua"
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts([text, text2])
    ...
    

    Basically, let the tokenizer know that there are more words in the text. You should see that the negative examples are now generated mostly from the second sentence, for example:

    (i , sit ) -> 0
    (love , i ) -> 1
    (love , money ) -> 1
    (love , ut ) -> 0
    (love , sit ) -> 0
    (money , consectetur ) -> 0
    (money , love ) -> 1
    (i , love ) -> 1