By using unnest_token
, I want to create a tidy text tibble which combines two different tokens: single words and bigrams.
The reasoning behind is that sometimes single words are the more reasonable unit to study and sometime it is rather higher-order n-grams.
If two words show up as a "sensible" bigram, I want to store the bigram and not store the individual words. If the same words show up in a different context (i.e. not as bigram), then I want to save them as single words.
In the stupid example below "of the" is an important bigram. Thus, I want to remove single words "of" and "the" if they actually appear as "of the" in the text. But if "of" and "the" show up in other combinations, I would like to keep them as single words.
library(janeaustenr)
library(data.table)
library(dplyr)
library(tidytext)
library(tidyr)
# make unigrams
tide <- unnest_tokens(austen_books() , output = word, input = text )
# make bigrams
tide2 <- unnest_tokens(austen_books(), output = bigrams, input = text, token = "ngrams", n = 2)
# keep only most frequent bigrams (in reality use more sensible metric)
keepbigram <- names( sort( table(tide2$bigrams), decreasing = T)[1:10] )
keepbigram
tide2 <- tide2[tide2$bigrams %in% keepbigram,]
# this removes all unigrams which show up in relevant bigrams
biwords <- unlist( strsplit( keepbigram, " ") )
biwords
tide[!(tide$word %in% biwords),]
# want to keep biwords in tide if they are not part of bigrams
You could do this by replacing the bigrams you're intrested in with a compound in text, before tokenisation (i.e. unnest_tokens
):
keepbigram_new <- stringi::stri_replace_all_regex(keepbigram, "\\s+", "_")
keepbigram_new
#> [1] "of_the" "to_be" "in_the" "it_was" "i_am" "she_had"
#> [7] "of_her" "to_the" "she_was" "had_been"
Using _
instead of whitespace is common practice for this. stringi::stri_replace_all_regex
is pretty much the same as gsub
or str_replace
from stringr
but a little faster and with more features.
Now replace the bigrams in text with these new compounds before tokenisation. I use word boundary regular expressions (\\b
) at the beginning and end of the bigrams to not accidentally capture e.g., "of them":
topwords <- austen_books() %>%
mutate(text = stringi::stri_replace_all_regex(text, paste0("\\b", keepbigram, "\\b"), keepbigram_new, vectorize_all = FALSE)) %>%
unnest_tokens(output = word, input = text) %>%
count(word, sort = TRUE) %>%
mutate(rank = seq_along(word))
Looking at the most common words, the first bigram appears on rank 40 now:
topwords %>%
slice(1:4, 39:41)
#> # A tibble: 7 x 3
#> word n rank
#> <chr> <int> <int>
#> 1 and 22515 1
#> 2 to 20152 2
#> 3 the 20072 3
#> 4 of 16984 4
#> 5 they 2983 39
#> 6 of_the 2833 40
#> 7 from 2795 41