Search code examples
rquanteda

Create keyword column with dictionary discarding longer matches


I am using tokens_lookup to see whether some texts contain the words in my dictionary discarding matches included in some pattern of words with nested_scope = "dictionary", as described in this answer. The idea is to discard longer dictionary matches which contain a nested target word (e.g. include Ireland but not Northern Ireland).

Now I'd like to:

(1) create a dummy variable indicating whether the text contains the words in the dictionary. I managed to do it with the code below but I don't understand why I have to write IE as lowercase in as.logical.

df <- structure(list(num = c(2345, 3564, 3636), text = c("Ireland lorem ipsum", "Lorem ipsum Northern 
Ireland", "Ireland lorem ipsum Northern Ireland")), row.names = c(NA, -3L), 
class = c("tbl_df", "tbl", "data.frame"))


dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"), 
                   tolower = F)
corpus <- corpus(df, text_field = "text")
toks <- tokens(corpus)
dfm <- tokens_lookup(toks, dictionary = dict, nested_scope = "dictionary", case_insensitive = F) %>%
  tokens_remove("Northern Ireland") %>% 
  dfm()
df$contains <- as.logical(dfm[, "ie"], case_insensitive = FALSE)

(2) Store the matches in another column by using kwic. Is there a way to exclude a dictionary key in kwic (Northern Ireland in the example)? In my attempt I get a keyword column that contains both Ireland and Norther Irland matches. (I don't know if it makes any difference, but in my full dataset I have multiple matches per row). Thank you.

words <- kwic(toks, pattern = dict, case_insensitive = FALSE)
df$docname = dfm@Dimnames[["docs"]]
df_keywords <- merge(df, words[ , c("keyword")], by = 'docname', all.x = T)
df_keywords <- df_keywords %>% group_by(docname, num) %>% 
  mutate(n = row_number()) %>% 
  pivot_wider(id_cols = c(docname, num, text, contains), 
              values_from = keyword, names_from = n, names_prefix = 'keyword')


Solution

  • You could do it this way:

    df <- structure(list(
      num = c(2345, 3564, 3636),
      text = c("Ireland lorem ipsum", "Lorem ipsum Northern
    Ireland", "Ireland lorem ipsum Northern Ireland")
    ),
    row.names = c(NA, -3L),
    class = c("tbl_df", "tbl", "data.frame")
    )
    
    library("quanteda")
    ## Package version: 3.1.0
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"),
      tolower = FALSE
    )
    corpus <- corpus(df, text_field = "text", docid_field = "num")
    toks <- tokens(corpus)
    

    Here you need to flip the tolower = FALSE in the dfm() call, or it will lowercase the keys from the tokens_lookup().

    dfmat <- tokens_lookup(toks, dict, nested_scope = "dictionary", case_insensitive = FALSE) %>%
      dfm(tolower = FALSE)
    dfmat
    ## Document-feature matrix of: 3 documents, 2 features (33.33% sparse) and 0 docvars.
    ##       features
    ## docs   IE Northern Ireland
    ##   2345  1                0
    ##   3564  0                1
    ##   3636  1                1
    
    df$contains_Ireland <- as.logical(dfmat[, "IE"])
    df
    ## # A tibble: 3 × 3
    ##     num text                                   contains_Ireland
    ##   <dbl> <chr>                                  <lgl>           
    ## 1  2345 "Ireland lorem ipsum"                  TRUE            
    ## 2  3564 "Lorem ipsum Northern\nIreland"        FALSE           
    ## 3  3636 "Ireland lorem ipsum Northern Ireland" TRUE
    

    For part 2, we don't have the match nesting implemented for kwic(). But you can search for "Ireland" and then exclude the matches where "Northern" came before?

    words <- kwic(toks, pattern = "Ireland", case_insensitive = FALSE, window = 2) %>%
      as.data.frame() %>%
      # removes the matches on IE value "Ireland" nested withing "Northern Ireland"
      dplyr::filter(!stringr::str_detect(pre, "Northern$")) %>%
      dplyr::mutate(num = as.numeric(docname))
    words
    ##   docname from to pre keyword        post pattern  num
    ## 1    2345    1  1     Ireland lorem ipsum Ireland 2345
    ## 2    3636    1  1     Ireland lorem ipsum Ireland 3636
    
    dplyr::full_join(df, words, by = "num")
    ## # A tibble: 3 × 10
    ##     num text    contains_Ireland docname  from    to pre   keyword post  pattern
    ##   <dbl> <chr>   <lgl>            <chr>   <int> <int> <chr> <chr>   <chr> <fct>  
    ## 1  2345 "Irela… TRUE             2345        1     1 ""    Ireland lore… Ireland
    ## 2  3564 "Lorem… FALSE            <NA>       NA    NA  <NA> <NA>    <NA>  <NA>   
    ## 3  3636 "Irela… TRUE             3636        1     1 ""    Ireland lore… Ireland