I have a big dataset (>1 million rows) and each row is a multi-sentence text. For example following is a sample of 2 rows:
mydat <- data.frame(text=c('I like apple. Me too','One two. Thank you'),stringsAsFactors = F)
What I was trying to do is extracting the bigram terms in each row (the "." will be able to separate ngram terms). If I simply use the dfm function:
mydfm = dfm(mydat$text,toLower = T,removePunct = F,ngrams=2)
dtm = as.DocumentTermMatrix(mydfm)
txt_data = as.data.frame(as.matrix(dtm))
These are the terms I got:
"i_like" "like_apple" "apple_." "._me" "me_too" "one_two" "two_." "._thank" "thank_you"
These are What I expect, basically "." is skipped and used to separate the terms:
"i_like" "like_apple" "me_too" "one_two" "thank_you"
Believe writing slow loops can solve this as well but given it is a huge dataset I would prefer efficient ways similar to the dfm() in quanteda to solve this. Any suggestions would be appreciated!
If your goal is just to extract those bigrams, then you could use tokens
twice. Once to tokenize to sentences, then again to make the ngrams for each sentence.
library("quanteda")
mydat$text %>%
tokens(mydat$text, what = "sentence") %>%
as.character() %>%
tokens(ngrams = 2, remove_punct = TRUE) %>%
as.character()
#[1] "I_like" "like_apple" "Me_too" "One_two" "Thank_you"
Insert a tokens_tolower()
after the first tokens()
call if you like, or use char_tolower()
at the end.