Search code examples
rnlptidytext

R tidytext Remove word if part of relevant bigrams, but keep if not


By using unnest_token, I want to create a tidy text tibble which combines two different tokens: single words and bigrams. The reasoning behind is that sometimes single words are the more reasonable unit to study and sometime it is rather higher-order n-grams.

If two words show up as a "sensible" bigram, I want to store the bigram and not store the individual words. If the same words show up in a different context (i.e. not as bigram), then I want to save them as single words.

In the stupid example below "of the" is an important bigram. Thus, I want to remove single words "of" and "the" if they actually appear as "of the" in the text. But if "of" and "the" show up in other combinations, I would like to keep them as single words.

library(janeaustenr)
library(data.table)
library(dplyr)
library(tidytext)
library(tidyr)


# make unigrams
tide <- unnest_tokens(austen_books() , output = word, input = text )
# make bigrams
tide2 <- unnest_tokens(austen_books(), output = bigrams, input = text, token = "ngrams", n = 2)

# keep only most frequent bigrams (in reality use more sensible metric)
keepbigram <- names( sort( table(tide2$bigrams), decreasing = T)[1:10]  )
keepbigram
tide2 <- tide2[tide2$bigrams %in% keepbigram,]

# this removes all unigrams which show up in relevant bigrams
biwords <- unlist( strsplit( keepbigram, " ") )
biwords
tide[!(tide$word %in% biwords),]

# want to keep biwords in tide if they are not part of bigrams

Solution

  • You could do this by replacing the bigrams you're intrested in with a compound in text, before tokenisation (i.e. unnest_tokens):

    keepbigram_new <- stringi::stri_replace_all_regex(keepbigram, "\\s+", "_")
    keepbigram_new
    #>  [1] "of_the"   "to_be"    "in_the"   "it_was"   "i_am"     "she_had" 
    #>  [7] "of_her"   "to_the"   "she_was"  "had_been"
    

    Using _ instead of whitespace is common practice for this. stringi::stri_replace_all_regex is pretty much the same as gsub or str_replace from stringr but a little faster and with more features.

    Now replace the bigrams in text with these new compounds before tokenisation. I use word boundary regular expressions (\\b) at the beginning and end of the bigrams to not accidentally capture e.g., "of them":

    topwords <- austen_books() %>% 
      mutate(text = stringi::stri_replace_all_regex(text, paste0("\\b", keepbigram, "\\b"), keepbigram_new, vectorize_all = FALSE)) %>% 
      unnest_tokens(output = word, input = text) %>% 
      count(word, sort = TRUE) %>% 
      mutate(rank = seq_along(word))
    

    Looking at the most common words, the first bigram appears on rank 40 now:

    topwords %>% 
      slice(1:4, 39:41)
    #> # A tibble: 7 x 3
    #>   word       n  rank
    #>   <chr>  <int> <int>
    #> 1 and    22515     1
    #> 2 to     20152     2
    #> 3 the    20072     3
    #> 4 of     16984     4
    #> 5 they    2983    39
    #> 6 of_the  2833    40
    #> 7 from    2795    41