Search code examples
pythonnlpn-gramlanguage-model

How to compute skipgrams in python?


A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python?

Following is the code i tried but it is not doing as expected:

<pre>
    input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
    def find_skipgrams(input_list, N,K):
  bigram_list = []
  nlist=[]

  K=1
  for k in range(K+1):
      for i in range(len(input_list)-1):
          if i+k+1<len(input_list):
              nlist=[]
              for j in range(N+1):
                  if i+k+j+1<len(input_list):
                    nlist.append(input_list[i+k+j+1])

          bigram_list.append(nlist)
  return bigram_list

</pre>

The above code is not rendering correctly, but print find_skipgrams(['all', 'this', 'happened', 'more', 'or', 'less'],2,1) gives following output

[['this', 'happened', 'more'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['less']]

The code listed here also does not give correct output: https://github.com/heaven00/skipgram/blob/master/skipgram.py

print skipgram_ndarray("What is your name") gives: ['What,is', 'is,your', 'your,name', 'name,', 'What,your', 'is,name']

name is a unigram!


Solution

  • From the paper that OP links, the following string:

    Insurgents killed in ongoing fighting

    Yields:

    2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}

    2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.

    With slight modification to NLTK's ngrams code (https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383):

    from itertools import chain, combinations
    import copy
    from nltk.util import ngrams
    
    def pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
        if pad_left:
            sequence = chain((pad_symbol,) * (n-1), sequence)
        if pad_right:
            sequence = chain(sequence, (pad_symbol,) * (n-1))
        return sequence
    
    def skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None):
        sequence_length = len(sequence)
        sequence = iter(sequence)
        sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol)
    
        if sequence_length + pad_left + pad_right < k:
            raise Exception("The length of sentence + padding(s) < skip")
    
        if n < k:
            raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")    
    
        history = []
        nk = n+k
    
        # Return point for recursion.
        if nk < 1: 
            return
        # If n+k longer than sequence, reduce k by 1 and recur
        elif nk > sequence_length: 
            for ng in skipgrams(list(sequence), n, k-1):
                yield ng
    
        while nk > 1: # Collects the first instance of n+k length history
            history.append(next(sequence))
            nk -= 1
    
        # Iterative drop first item in history and picks up the next
        # while yielding skipgrams for each iteration.
        for item in sequence:
            history.append(item)
            current_token = history.pop(0)      
            # Iterates through the rest of the history and 
            # pick out all combinations the n-1grams
            for idx in list(combinations(range(len(history)), n-1)):
                ng = [current_token]
                for _id in idx:
                    ng.append(history[_id])
                yield tuple(ng)
    
        # Recursively yield the skigrams for the rest of seqeunce where
        # len(sequence) < n+k
        for ng in list(skipgrams(history, n, k-1)):
            yield ng
    

    Let's do some doctest to match the example in the paper:

    >>> two_skip_bigrams = list(skipgrams(text, n=2, k=2))
    [('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
    >>> two_skip_trigrams = list(skipgrams(text, n=3, k=2))
    [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
    

    But do note that if n+k > len(sequence), it will yield the same effects as skipgrams(sequence, n, k-1) (this is not a bug, it's a fail safe feature), e.g.

    >>> three_skip_trigrams = list(skipgrams(text, n=3, k=3))
    >>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3))
    >>> four_skip_fourgrams  = list(skipgrams(text, n=4, k=4))
    >>> four_skip_fivegrams  = list(skipgrams(text, n=5, k=4))
    >>>
    >>> print len(three_skip_trigrams), three_skip_trigrams
    10 [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
    >>> print len(three_skip_fourgrams), three_skip_fourgrams 
    5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
    >>> print len(four_skip_fourgrams), four_skip_fourgrams 
    5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
    >>> print len(four_skip_fivegrams), four_skip_fivegrams 
    1 [('Insurgents', 'killed', 'in', 'ongoing', 'fighting')]
    

    This allows n == k but it disallow n > k as shown in the lines :

    if n < k:
            raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")    
    

    For understanding sake, let's try to understand the "mystical" line:

    for idx in list(combinations(range(len(history)), n-1)):
        pass # Do something
    

    Given a list of unique items, combinations produce this:

    >>> from itertools import combinations
    >>> x = [0,1,2,3,4,5]
    >>> list(combinations(x,2))
    [(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]
    

    And since the indices of a list of tokens is always unique, e.g.

    >>> sent = ['this', 'is', 'a', 'foo', 'bar']
    >>> current_token = sent.pop(0) # i.e. 'this'
    >>> range(len(sent))
    [0,1,2,3]
    

    It's possible to compute the possible combinations (without replacement) of the range:

    >>> n = 3
    >>> list(combinations(range(len(sent)), n-1))
    [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
    

    If we map the indices back to the list of tokens:

    >>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)
    [('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]
    

    Then we concatenate with the current_token, we get the skipgrams for the current token and context+skip window:

    >>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)]
    [('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]
    

    So after that we move on to the next word.