Search code examples
rnlptext-miningquanteda

How to use quanteda to find instances of appearance of certain words before certain others in a sentence


As an R newbie, by using quanteda I am trying to find instances when a certain word sequentially appears somewhere before another certain word in a sentence. To be more specific, I am looking for instances when the word "investors" is located somewhere before the word "shall" in a sentence in the corpus consisted of an international treaty concluded between Morocco and Nigeria (the text can be found here: https://edit.wti.org/app.php/document/show/bde2bcf4-e20b-4d05-a3f1-5b9eb86d3b3b).

The problem is that sometimes there are multiple words between these two words. For instance, sometimes it is written as "investors and investments shall". I tried to apply similar solutions offered on this website. When I tried the solution on (Keyword in context (kwic) for skipgrams?) and ran the following code:

 kwic(corpus_mar_nga, phrase("investors * shall"))

I get 0 observations since this counts only instances when there is only one word between "investors" and "shall".

And when I follow another solution offered on (Is it possible to use `kwic` function to find words near to each other?) and ran the following code:

toks <- tokens(corpus_mar_nga)
toks_investors <- tokens_select(toks, "investors", window = 10)
kwic(toks_investors, "shall")

I get instances when "investor" appear also after "shall" and this changes the context fundamentally since in that case, the subject of the sentence is something different.

At the end, in addition to instances of "investors shall", I should also be getting, for example the instances when it reads as "Investors, their investment and host state authorities shall", but I can't do it with the above codes.

Could anyone offer me a solution on this issue?

Huge thanks in advance!


Solution

  • Good question. Here are two methods, one relying on regular expressions on the corpus text, and the second using (as @Kohei_Watanabe suggests in the comment) using window for tokens_select().

    First, create some sample text.

    library("quanteda")
    ## Package version: 2.1.2
    
    # sample text
    txt <- c("The investors and their supporters shall do something.
              Shall we tell the investors?  Investors shall invest.
              Shall someone else do something?")
    

    Now reshape this into sentences, since your search occurs within sentence.

    # reshape to sentences
    corp <- txt %>%
      corpus() %>%
      corpus_reshape(to = "sentences")
    

    Method 1 uses regular expressions. We add a boundary (\\b) before "investors", and the .+ says one or more of any character in between "investors" and "shall". (This would not catch newlines, but corpus_reshape(x, to = "sentences") will remove them.)

    # method 1: regular expressions
    corp$flag <- stringi::stri_detect_regex(corp, "\\binvestors.+shall",
      case_insensitive = TRUE
    )
    print(corpus_subset(corp, flag == TRUE), -1, -1)
    ## Corpus consisting of 2 documents and 1 docvar.
    ## text1.1 :
    ## "The investors and their supporters shall do something."
    ## 
    ## text1.2 :
    ## "Investors shall invest."
    

    A second method applies tokens_select() with an asymmetric window, with kwic(). First we select all documents (which are sentences) containing "investors", but discarding tokens before and keeping all tokens after. 1000 tokens after should be enough. Then, apply the kwic() where we keep all context words but focus on the word after, which by definition must be after, since the first word was "investors".

    # method 2: tokens_select()
    toks <- tokens(corp)
    tokens_select(toks, "investors", window = c(0, 1000)) %>%
      kwic("shall", window = 1000)
    ##                                                                     
    ##  [text1.1, 5] investors and their supporters | shall | do something.
    ##  [text1.3, 2]                      Investors | shall | invest.
    

    The choice depends on what suits your needs best.