Search code examples
rtokenquanteda

How do I find the location of tokens in a quanteda token object?


I have created a quanteda tokens object from a plain text file, and selected the specific words I seek using

tokens_select(truePdfAnnualReports.toks, unlist(strategicKeywords.list), padding = TRUE)

To maintain the specific token sequence found in the original text file. I now wish to assign token position number (absolute and relative) to the tokens selected by the function. How do I assign position numbers for the tokens selected by the function?


Solution

  • You want kwic(), not tokens_select(). I created a reproducible example answer using the built-in data_corpus_inaugural below.

    library("quanteda")
    ## Package version: 3.1
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    toks <- tokens(tail(data_corpus_inaugural, 10))
    keywords <- c("nuclear", "security")
    
    # form a data.frame from kwic() results
    kw <- kwic(toks, keywords, window = 0) %>%
      as.data.frame()
    
    # for illustration
    kw[10:14, ]
    ##         docname from   to pre  keyword post  pattern
    ## 10  1985-Reagan 2385 2385     security      security
    ## 11    1989-Bush 2149 2149     security      security
    ## 12 1997-Clinton  259  259     security      security
    ## 13 1997-Clinton 1660 1660      nuclear       nuclear
    ## 14    2001-Bush  872  872     Security      security
    

    Now, to get the relative positions, we can do a little dplyr magic once we get the total tokens lengths and divide:

    doc_lengths <- data.frame(
      docname = docnames(toks),
      toklength = ntoken(toks)
    )
    
    # the answer
    answer <- dplyr::left_join(kw, doc_lengths) %>%
      dplyr::mutate(
        from_relative = from / toklength,
        to_relative = to / toklength
      )
    ## Joining, by = "docname"
    head(answer)
    ##       docname from   to pre  keyword post  pattern toklength from_relative
    ## 1 1985-Reagan 2005 2005     security      security      2909     0.6892403
    ## 2 1985-Reagan 2152 2152     security      security      2909     0.7397731
    ## 3 1985-Reagan 2189 2189      nuclear       nuclear      2909     0.7524923
    ## 4 1985-Reagan 2210 2210      nuclear       nuclear      2909     0.7597112
    ## 5 1985-Reagan 2245 2245      nuclear       nuclear      2909     0.7717429
    ## 6 1985-Reagan 2310 2310     security      security      2909     0.7940873
    ##   to_relative
    ## 1   0.6892403
    ## 2   0.7397731
    ## 3   0.7524923
    ## 4   0.7597112
    ## 5   0.7717429
    ## 6   0.7940873