Search code examples
rnlpn-gram

Output text with both unigrams and bigrams in R


I'm trying to figure out how to identify unigrams and bigrams in a text in R, and then keep both in the final output based on a threshold. I've done this in Python with gensim's Phraser model, but haven't figured out how to do it in R.

For example:

strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]

Thank you!


Solution

  • You could use quanteda framework for this:

    library(quanteda)
    # tokenize, tolower, remove stopwords and create ngrams
    my_toks <- tokens(strings$text) 
    my_toks <- tokens_tolower(my_toks)
    my_toks <- tokens_remove(my_toks, stopwords("english"))
    bigs <- tokens_ngrams(my_toks, n = 1:2)
    
    # turn into document feature matrix and filter on minimum frequency of 2 and more
    my_dfm <- dfm(bigs)
    dfm_trim(my_dfm, min_termfreq = 2)
    
    Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
           features
    docs    great movie yesterday great_movie went theater
      text1     1     1         1           1    0       0
      text2     0     0         0           0    1       0
      text3     1     1         0           1    0       1
      text4     0     0         1           0    1       1
    
    # use convert function to turn this into a data.frame
    

    Alternatively you could use tidytext package, tm, tokenizers etc etc. It all depends a bit on the output you are expecting.

    An example using tidytext / dplyr looks like this:

    library(tidytext)
    library(dplyr)
    strings %>% 
      unnest_ngrams(bigs, text, n = 2, n_min = 1, ngram_delim = "_", stopwords = stopwords::stopwords()) %>% 
      count(bigs) %>% 
      filter(n >= 2)
    
             bigs n
    1       great 2
    2 great_movie 2
    3       movie 2
    4     theater 2
    5        went 2
    6   yesterday 2
    

    Both quanteda and tidytext have a lot of online help available. See vignettes wiht both packages on cran.