Search code examples
rnlptext-miningn-gramquanteda

Keyword in context (kwic) for skipgrams?


I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry.

The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes.

The result is:

"kwic object with 0 rows"

x <- tokens("barriers entry")
ntoken_test <- tokens_ngrams(x, n = 2, skip = 0:4, concatenator = " ")
twic_skipgram <-  kwic(doc.corpus, pattern = list(ntoken_test), window=20, valuetype= "glob")

twic_skipgram


Solution

  • Probably the easiest way is wildcarding to represent the "skip".

    library("quanteda")
    ## Package version: 2.1.1
    
    txt <- c(
      "There are barriers to entry.",
      "Also barriers against entry.",
      "Just barriers entry."
    )
    
    # for skip of 1
    kwic(txt, phrase("barriers * entry"))
    ##                                                     
    ##  [text1, 3:5] There are |   barriers to entry    | .
    ##  [text2, 2:4]      Also | barriers against entry | .
    
    # for skip of 0 and 1
    kwic(txt, phrase(c("barriers * entry", "barriers entry")))
    ##                                                     
    ##  [text1, 3:5] There are |   barriers to entry    | .
    ##  [text2, 2:4]      Also | barriers against entry | .
    ##  [text3, 2:3]      Just |     barriers entry     | .