Search code examples
rnlptext-miningquanteda

R: find ngram using dfm when there are multiple sentences in one document


I have a big dataset (>1 million rows) and each row is a multi-sentence text. For example following is a sample of 2 rows:

mydat <- data.frame(text=c('I like apple. Me too','One two. Thank you'),stringsAsFactors = F)

What I was trying to do is extracting the bigram terms in each row (the "." will be able to separate ngram terms). If I simply use the dfm function:

mydfm  = dfm(mydat$text,toLower = T,removePunct = F,ngrams=2)
dtm = as.DocumentTermMatrix(mydfm)
txt_data = as.data.frame(as.matrix(dtm))

These are the terms I got:

"i_like"     "like_apple" "apple_."    "._me"       "me_too"     "one_two"    "two_."      "._thank"    "thank_you" 

These are What I expect, basically "." is skipped and used to separate the terms:

"i_like"     "like_apple"  "me_too"     "one_two"    "thank_you" 

Believe writing slow loops can solve this as well but given it is a huge dataset I would prefer efficient ways similar to the dfm() in quanteda to solve this. Any suggestions would be appreciated!


Solution

  • If your goal is just to extract those bigrams, then you could use tokens twice. Once to tokenize to sentences, then again to make the ngrams for each sentence.

    library("quanteda")
    mydat$text %>% 
        tokens(mydat$text, what = "sentence") %>% 
        as.character() %>%
        tokens(ngrams = 2, remove_punct = TRUE) %>%
        as.character()
    #[1] "I_like"     "like_apple" "Me_too"     "One_two"    "Thank_you"
    

    Insert a tokens_tolower() after the first tokens() call if you like, or use char_tolower() at the end.