Search code examples
rn-gramquanteda

Quanteda: Create ngrams and skipgrams from tokens in R


I have been browsing the quanteda package in R and could not figure out completely how tokens_skipgrams functions. Below is the example from the manual of this package that I am not quite sure I have understood it well:

tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")   
tokens from 1 document.
text1 :
[1] "insurgents killed in"        "insurgents killed ongoing"  
[3] "insurgents killed fighting"  "insurgents in ongoing"      
[5] "insurgents in fighting"      "insurgents ongoing fighting"
[7] "killed in ongoing"           "killed in fighting"         
[9] "killed ongoing fighting"     "in ongoing fighting"        

I would expect that the output to be comprised of the following:

 "insurgents killed in"    "killed in ongoing"    "in ongoing fighting" 
 "insurgents in fighting"

Why does the result include:

  "insurgents killed ongoing"  
  "insurgents killed fighting"  
  "insurgents in ongoing"      
  "insurgents ongoing fighting"
  "killed in fighting"         
  "killed ongoing fighting" 

In the example above, skip = 0:2 that is skip is 0, 1, and 2. Therefore, I thought the command above can be safely broken into 3 pieces and the combination of each would give me the result above which as I indicated I could not get.

tokens_skipgrams(toks, n = 3, skip = 0, concatenator = " ")   
tokens from 1 document.
text1 :
[1] "insurgents killed in" "killed in ongoing"    "in ongoing fighting" 

tokens_skipgrams(toks, n = 3, skip = 1, concatenator = " ")   
tokens from 1 document.
text1 :
[1] "insurgents in fighting"


tokens_skipgrams(toks, n = 3, skip = 2, concatenator = " ")   
tokens from 1 document.
text1 :
character(0)

But the combination of the results gives me exactly what I expected to have, not the one given above.

Is there anyone who can solve this issue for me?


Solution

  • The behaviour you are observing is the implementation of Guthrie et al (2006)'s definition of a skiagram: "A k skip-gram is an ngram which is a superset of all ngrams and each (k-i) skipgram until (k-i)==0 (which includes 0 skip-grams)." (This is cited on the quanteda man page for ?tokens_skipgram. The original source is Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling.".). The example of s02 below is taken directly from that paper, in what it calls "2-skip-tri-grams".

    For scalar values of skip, however, this recursive implementation of skips is not implemented, in order to give the user maximum control.

    This explains the difference in supplying the skip values as individual scales as above, and then as the sequence 0:2. For

    toks <- tokens("insurgents killed in ongoing fighting")
    toks
    # tokens from 1 document.
    # text1 :
    # [1] "insurgents" "killed"     "in"         "ongoing"    "fighting" 
    

    we observe combinations such as "insurgents killed fighting" when skip = 0:2 because this includes the skips of 0 (between "insurgents" and "killed") and 2 (between "killed" and "fighting"). For the phrase here, that means there are only two additional skipgrams from going from skip = 0:1 to skip = 0:2:

    (s01 <- tokens_skipgrams(toks, n = 3, skip = 0:1, concatenator = " "))
    # tokens from 1 document.
    # text1 :
    # [1] "insurgents killed in"      "insurgents killed ongoing" "insurgents in ongoing"    
    # [4] "insurgents in fighting"    "killed in ongoing"         "killed in fighting"       
    # [7] "killed ongoing fighting"   "in ongoing fighting"      
    
    (s02 <- tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " "))
    # tokens from 1 document.
    # text1 :
    # [1] "insurgents killed in"        "insurgents killed ongoing"   "insurgents killed fighting" 
    # [4] "insurgents in ongoing"       "insurgents in fighting"      "insurgents ongoing fighting"
    # [7] "killed in ongoing"           "killed in fighting"          "killed ongoing fighting"    
    # [10] "in ongoing fighting"        
    
    setdiff(as.character(s02), as.character(s01))
    # [1] "insurgents killed fighting"  "insurgents ongoing fighting"