TextMining in R - Extracting 2 gram for only few terms and 1 gram for rest

text = c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')

I want to extract 1-gram token for most words and 2 gram tokens for words such as extremely, no , not

For example when I get tokens they should be as below: the, nurse, was, extremely helpful, she, truly, gem, helping, no issue, not bad

These are the terms that should show in the term document matrix

Thank you for the help!!

Solution

Here is a possible solution (assuming you want to not split only on c("extremely", "no", "not"), but also want to include words similar to them). The pkg qdapDictionaries has some dictionaries for amplification.words (like "extremely"), negation.words (like "no" & "not"), and more.

Here is an example of how to split on a space except for when the space follows a word in a predefined vector (here we define the vector using amplification.words, negation.words, & deamplification.words from qdapDictionaries). You can change the definition of no_split_words if you want to use a more customized list of words.

performing split

library(stringr)
library(qdapDictionaries)

text <-  c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')

# define list of words where we dont want to split on space
no_split_words <- c(amplification.words, negation.words, deamplification.words)
# collapse words into form "word1|word2| ... |wordn
regex_or       <- paste(no_split_words, collapse="|")
# define regex to split on space given that the prev word not in no_split_words
split_regex    <- regex(paste("((?<!",regex_or,"))\\s"))

# perform split
str_split(text, split_regex)

#output
[[1]]
[1] "the"               "nurse"             "was"               "extremely helpful"

[[2]]
[1] "she"     "was"     "truly a" "gem"    

[[3]]
[1] "helping"

[[4]]
[1] "no issue"

[[5]]
[1] "not bad"

creating dtm with `tidytext`

(assumes above code chunk was already run)

library(tidytext)
library(dplyr)

doc_df <- data_frame(text) %>% 
  mutate(doc_id = row_number())

# creates doc term matrix from tm package
# creates a binary dtm
# can define value as term freq, tfidf, etc for a nonbinary dtm
tm_dtm <- doc_df %>% 
  unnest_tokens(tokens, text, token="regex", pattern=split_regex) %>% 
  mutate(value = 1) %>%  
  cast_dtm(doc_id, tokens, value)

# can coerce to matrix if desired
matrix_dtm <- as.matrix(tm_dtm)

TextMining in R - Extracting 2 gram for only few terms and 1 gram for rest

performing split

creating dtm with tidytext

creating dtm with `tidytext`